Tokenization and segmentation


In technical domains with text obtained from PDF files, tokenization , sentence split and noise filtering (invalid segments resulted from the mix of floating parts with main text , broken words with hyphens, non consistency spaces between lines etc) are challenging tasks.

I am trying study the literature about these issues understanding the better approaches for pre processing the files before sentences analysis. I found

Any other reference ? In the Wiki I found some info about PET input formats and REPP.


I’m not sure about technical references, but I did some work with Ryan Georgi and Fei Xia on extracting text from PDFs (, however our goal was fairly narrow (breaking PDF pages into blocks (title headers, columns, figures, paragraphs) in order to detect linguistic examples (IGTs). I can recommend some tools though:

  • PDFLib’s Text-Extraction-Toolkit (TET) was the best utility we sampled for extracting text as it performed the best with unicode characters and line reconstruction. Unfortunately didn’t do well at inspecting IGTs as columnar data so we wrote our own tool for that part. The license for TET can be expensive, though.

  • PDFMiner is a free and open-source utility that does a lot of what TET does but with slightly lower quality for unicode characters and complex layouts (note: use the PDFMiner.six fork (linked) as the original is dormant). If you are writing your own tool on top of its XML output (as we did) it is pretty decent.