I have a student using VSCode for git interaction and finding that doing so corrupts the files in some invisible way such that the segmentation stops working—when we input a sentence with three words, the LKB just sees one. The repp/vanilla.rpp file looks normal, and we changed its encoding from DOS to a unix UTF encoding, but that didn’t fix things. Any idea what might be going on?
UTF isn’t specific to Unix. I don’t know how VSCode works, but I’d be surprised if it wasn’t using unicode internally. Also it seems that VSCode provides an easily accessible option to save files in UTF-8 encoding.
The problem might well be due to CRLF line termination. Since the LKB will be running on a Unix-based operating system (even if it’s via a virtual machine) you need LF line termination.
I use VSCode regularly (although I use Emacs + magit for working with Git), and I’m fairly certain it is not directly to blame. But if your student noticed that the problem does happen in VSCode and not with other editors, then there might be some interaction there.
The basic vanilla.rpp only splits on regular spaces (U+0020) and tabs (U+0009). If you visually see a space but it’s not splitting on them, it may be one of many other whitespace characters. A common one that causes this behavior is non-breaking spaces (U+00a0). You can use Python to easily check what kind of space it is (or hexdump or other tools, but Python is easiest for me):
>>> s = "a b c" # looks innocent enough
>>> print(s) # still fine
a b c
>>> s.split() # split on any whitespace
['a', 'b', 'c']
>>> s.split(" ") # split only on a regular space... hmm..
['a', 'b\xa0c']
>>> [ord(c) for c in s] # get the decimal codepoint of each character
[97, 32, 98, 160, 99]
>>> s.encode("unicode-escape") # another way to see non-ascii things
b'a b\\xa0c'
Some questions:
What platform (Windows/macOS/Linux) is the student on? In macOS, you can (maybe accidentally) insert a non-breaking space with Option-Space. Sometimes a Shift-Space can do it (VSCode may have custom keyboard mappings).
Is the student using a CJK input method where the space bar might insert a double-width or other space character? Or one that does key chords or combinations to get diacritics and things?
Is the student manually typing these sentences in VSCode, or copy-pasting from a PDF, HTML, etc.? Copied spaces may look normal but be something else.
Does the student use any VSCode plugins that may be interfering?
That’s true, but maybe not the whole picture. Linux and macOS use UTF-8 as a default system encoding, but Windows (last I checked) uses UTF-16. Furthermore, some Windows apps insisted on using a non-standard byte-order mark (BOM) in UTF-8. Does the LKB expect UTF-8 only?
Assuming this default encoding, the grammar files coming from VSCode will be UTF-8 with Windows line endings (CRLF). However, in Unix, lines of text are assumed to end with (only) the LF character. That means there will be spurious CR characters peppered through the VSCode-derived files. If you’re lucky this might not adversely affect grammar files in TDL, but it’s very likely to corrupt other parts of the grammar where elements are one-per-line. In the case of REPP, each rule will end up with a spurious CR character at the end. The first rule in vanilla.rpp is meant to break up the input string at space and punctuation characters – however, if the rule is corrupted in this way it will only insert a break when it encounters a space/punctuation character immediately followed by a CR.
If this doesn’t fit with what you observe @ebender then I’m afraid I’m out of guesses.