In technical domains with text obtained from PDF files, tokenization , sentence split and noise filtering (invalid segments resulted from the mix of floating parts with main text , broken words with hyphens, non consistency spaces between lines etc) are challenging tasks.
I am trying study the literature about these issues understanding the better approaches for pre processing the files before sentences analysis. I found
Any other reference ? In the Wiki I found some info about PET input formats and REPP.