Tokenization and segmentation

In technical domains with text obtained from PDF files, tokenization , sentence split and noise filtering (invalid segments resulted from the mix of floating parts with main text , broken words with hyphens, non consistency spaces between lines etc) are challenging tasks.

I am trying study the literature about these issues understanding the better approaches for pre processing the files before sentences analysis. I found

Any other reference ? In the Wiki I found some info about PET input formats and REPP.

1 Like

I’m not sure about technical references, but I did some work with Ryan Georgi and Fei Xia on extracting text from PDFs (http://www.lrec-conf.org/proceedings/lrec2018/pdf/947.pdf), however our goal was fairly narrow (breaking PDF pages into blocks (title headers, columns, figures, paragraphs) in order to detect linguistic examples (IGTs). I can recommend some tools though:

  • PDFLib’s Text-Extraction-Toolkit (TET) was the best utility we sampled for extracting text as it performed the best with unicode characters and line reconstruction. Unfortunately didn’t do well at inspecting IGTs as columnar data so we wrote our own tool for that part. The license for TET can be expensive, though.

  • PDFMiner is a free and open-source utility that does a lot of what TET does but with slightly lower quality for unicode characters and complex layouts (note: use the PDFMiner.six fork (linked) as the original is dormant). If you are writing your own tool on top of its XML output (as we did) it is pretty decent.

1 Like

Still related to this topic. I found the chapter:

that discuss embedded sentences:

In this example, the main sentence contains an embedded sentence (delimited by dashes), and this embedded sentence also contains an embedded quoted sentence.

(9) The holes certainly were rough - “Just right for a lot of vagabonds like us,” said Bigwig - but the exhausted and those who wander in strange country are not particular about their quarters.

It should be clear from these examples that true sentence segmentation, including treatment of embedded sentences, can only be achieved through an approach which integrates segmentation with parsing. Unfortunately, there has been little research in integrating the two; in fact, little research in computational linguistics has focused on the role of punctuation in written language.

Is it true? How the integration of parsing and segmentation could help? I am still looking for references about segmentation and how to deal with structures with embedded sentences in quotes.

This chapter was published 20 years ago, and since then there has been a fair amount of research into punctuation and parsing. I don’t think Palmer has given a good justification for why we need to achieve “true sentence segmentation” - or indeed what that actually means. It’s also strange that he didn’t cite Christy Doran’s PhD thesis from only 2 years previously, which directly addresses this issue: Incorporating Punctuation Into the Sentence Grammar: A Lexicalized Tree Adjoining Grammar Perspective https://repository.upenn.edu/ircs_reports/68

2 Likes

Thank you for the reference. BTW, do you know the title of the book of this chapter?

That chapter ‘Tokenisation and Sentence Segmentation’ comes from the 1st edition of the Handbook of Natural Language Processing edited by Dale, Moisl and Somers, 2000: https://www.routledge.com/Handbook-of-Natural-Language-Processing/Dale-Moisl-Somers/p/book/9780824790004

David Palmer seems to have written an updated version of the chapter for the 2nd edition of the handbook in 2010: ‘Text Preprocessing’ https://www.taylorfrancis.com/books/9780429149207/chapters/10.1201/9781420085938-10. I don’t know whether there’s a freely downloadable version of this.

1 Like