LKB-FOS + % in identifiers

In the Grammar Matrix customization system, we build lex entry identifiers based on the stem values provided, escaping special characters like so:

choices:

noun1_stem1_orth=cat’

lexicon.tdl:

cat%27 := common-noun-lex &
[ STEM < “cat” >,
SYNSEM.LKEYS.KEYREL.PRED “_cat_n_rel” ].

error:

Syntax error at line 4, character 4: := expected but not found in CAT
Proceeding assuming :=
Syntax error at line 4, character 4: Expected an identifier but found %

Is this something where the Grammar Matrix should change practice, and if so, what is recommended?

LKB_FOS and the svn source code for both versions of the LKB implement the TDL specification at http://moin.delph-in.net/TdlRfc (starting at ‘Types and Instances’). % is in the ‘blacklist’ of characters that cannot occur in an identifier:

Identifier := /[^\s!"#$%&’(),./:;<=>[]^|]+/

The other DELPH-IN processors should also prohibit %. So yes, change % to another character - maybe ~ or *?

Thanks, John. Ace seems happy enough with this grammar, and I don’t think we changed this recently in the Matrix, so I’m surprised that the source code for the other version of the LKB has the same constraint.

Regardless, it should be something we can change on our end. @olzama is this something you could pick up?

Stephan ported John’s updates for TDL parsing into the original LKB in r28957 last August. I don’t think ACE has made any changes except for the “”“triple-quoted-docstrings”"" since we finalized TDLRfc. So you wouldn’t have been affected a year ago unless you used PyDelphin or LKB-FOS, which I believe have had the stricter TDL parsing since 2018.

Also, if you’re curious, % is disallowed because otherwise when we parse something like this:

n_pl_olr := %suffix ...

we wouldn’t know if %suffix is starting an inflectional rule or if it’s a supertype whose identifier is %suffix. This may seem silly, but we don’t currently have a way to say, for example, that % is disallowed at the start of an identifier but allowed elsewhere.

1 Like

Thanks for the clarification, @goodmami. It is definitely on the Matrix to get up to speed then, but it’s good to know why we missed this so far.

Regardless, it should be something we can change on our end. @olzama is this something you could pick up?

@ebender, as in, add that to validation?

No — it’s in the code that determines what string to use for the lexical item identifiers if the stem contains a special character.

OK; I will create an issue and will try to address it on Monday.

So which character do we want it to be, then? * should in principle be meaningful; ~ is often used in glossing.

Hi Olga,

As far as I know this is only about the way that the identifiers are generated for lexical entries based on their stem forms, so glossing conventions etc shouldn’t matter. In order to pick something that is visually quiet, however, how about _ ?

Thanks!

I am not sure about the underscore because it is already used for a number of things in the lexicon. Maybe it is not a problem but maybe it is? You tell me :).

Before the change:

cat%27_1 := common-noun-lex &
  [ STEM < "cat'" >,
    SYNSEM.LKEYS.KEYREL.PRED "_cat_n_rel" ].

cat%27_2 := common-noun-lex &
  [ STEM < "cat'" >,
    SYNSEM.LKEYS.KEYREL.PRED "_cathomonym_n_rel" ].

After the proposed change:

cat_27_1 := common-noun-lex &
  [ STEM < "cat'" >,
    SYNSEM.LKEYS.KEYREL.PRED "_cat_n_rel" ].

cat_27_2 := common-noun-lex &
  [ STEM < "cat'" >,
    SYNSEM.LKEYS.KEYREL.PRED "_cathomonym_n_rel" ].

I think that’s fine, but I’d also be open to suggestions for how to improve on it!

If you think that’s fine, let’s go with it then. I should also note that there aren’t that many characters which work, even. E.g. &, @, ^ all do not work anyway (break something in the TDL parser). The underscore works. I am a bit bothered that it means several things in `lexicon.tdl’ but until we get better ideas, we can probably go with it.

To add some context to the original use of %, the apostrophe is being converted to %27 via URL encoding, and the percent sign is a standard part of that encoding (see e.g. w3schools). I wonder if a better method than mangling the URL encoding would be to just use a different method altogether?

For instance, could the special characters be converted to their Unicode code point and then prepended with the underscore? This would be similar to the URL encoding but more standard, I think.

So, instead of cat_27 (where 27 is now sort of a lost reference into URL encodings) it would be cat_39 which points to the apostrophe in ASCII.

1 Like

the apostrophe is being converted to %27 via URL encoding

I think it’s safer to say it’s inspired by URL encoding (also called percent encoding). The function is defined as TDLencode() in gmcs/utils.py which only converts the lower 127 ASCII codepoints and has a different set of reserved characters:

>>> from urllib.parse import quote
>>> from gmcs.utils import TDLencode
>>> quote("cat'_猫")      # percent-encoded
'cat%27_%E7%8C%AB'
>>> TDLencode("cat'_猫")  # Matrix TDL encoded
'cat%27_猫'
>>> quote("_-+*.~")       # allowed punctuation: _-.~
'_-%2B%2A.~'
>>> TDLencode("_-+*.~")   # allowed punctuation: _-+*
'_-+*%2E%7E'

The nice thing about these encoding schemes is that they can be converted back (in theory; I don’t see a TDLdecode() function, but urllib.parse.unquote() might work), which also means that you can pretty much be guaranteed unique identifiers if questionnaire validation already ensured unique orthographies within a lexical class (that is, where the identifier is some combination of the lexical class identifier and the orthography).

I don’t think, however, that relying on unique orthographies for unique identifiers, if that is in fact how it is done, is particularly tenable. I’d rather see a name sanitization method that removes the illegal characters or replaces runs of them with a single replacement character. Uniqueness would then need to be explicitly handled.

For instance, could the special characters be converted to their Unicode code point and then prepended with the underscore?

That would be ok, but then we’d have to handle actual underscores. Also note that the \ character is not disallowed in TDL.