Punctuation as suffixes


The ERG internally (still) analyzes most punctuation marks as pseudo-affixes (rather than as separate tokens, as in the PTB). To accommodate any discrepancies, the grammar includes token mapping rules to adjust (i.e. correct) externally supplied tokenization (see the ChartMapping page for general background); specifically, punctuation marks will be re-combined with preceding or following tokens, reflecting standard orthographic convention.

I am trying to find the arguments in favor of this approach. I remember to have read it somewhere. Can anyone help me with the reference?

BTW, what happens with commas or dots following quotes? They are considered jointly as suffixes for the preceding word?

Here’s the standard reference:
author = {Nunberg, Geoffrey},
title = {The Linguistics of Punctuation},
publisher = {CSLI Publications},
series = {Lecture Notes},
number = {18},
address = {Stanford, CA},
year = 1990
My original motivation was to have the grammar reflect standard writing conventions, which normally do not treat punctuation marks as separate tokens. There is also a potential loss of information if for example a single straight quote is treated as a separate token from the one it was attached to (either to the left or to the right), though of course one can decorate such marks more richly when tokenizing in order to avoid ambiguity.

The analysis gets trickier with punctuation clusters such as commas or periods with quotes, and the implementation in the ERG is not (yet) ideal or general enough for these.

1 Like