the apostrophe is being converted to %27 via URL encoding
I think it’s safer to say it’s inspired by URL encoding (also called percent encoding). The function is defined as TDLencode()
in gmcs/utils.py
which only converts the lower 127 ASCII codepoints and has a different set of reserved characters:
>>> from urllib.parse import quote
>>> from gmcs.utils import TDLencode
>>> quote("cat'_猫") # percent-encoded
'cat%27_%E7%8C%AB'
>>> TDLencode("cat'_猫") # Matrix TDL encoded
'cat%27_猫'
>>> quote("_-+*.~") # allowed punctuation: _-.~
'_-%2B%2A.~'
>>> TDLencode("_-+*.~") # allowed punctuation: _-+*
'_-+*%2E%7E'
The nice thing about these encoding schemes is that they can be converted back (in theory; I don’t see a TDLdecode()
function, but urllib.parse.unquote()
might work), which also means that you can pretty much be guaranteed unique identifiers if questionnaire validation already ensured unique orthographies within a lexical class (that is, where the identifier is some combination of the lexical class identifier and the orthography).
I don’t think, however, that relying on unique orthographies for unique identifiers, if that is in fact how it is done, is particularly tenable. I’d rather see a name sanitization method that removes the illegal characters or replaces runs of them with a single replacement character. Uniqueness would then need to be explicitly handled.
For instance, could the special characters be converted to their Unicode code point and then prepended with the underscore?
That would be ok, but then we’d have to handle actual underscores. Also note that the \
character is not disallowed in TDL.