What is the purpose of "glosses.txt" as a MOM input file?

In my toolbox to xigt conversion script, I have some code that gathers a set of unique POS tags and glosses from the Xigitifed data. The user should then sort the POS tags into the appropriate files that MOM requires (noun-tags.txt, verb-tags.txt, etc.). There is also a “glosses.txt” file that gets fed to MOM, and initially I thought this was just all of the unique glosses in the corpus, but it appears to only be glosses with syntactic information, like 1PL etc, but not glosses that are just “translations” (like ‘bird’).

What is the purpose of this “glosses.txt” file? Does using regex seem like a good way to store only the correct glosses in this file?

glosses.txt should include all of the grams found in the corpora, but no root translations. MOM uses it to determine whether a morpheme is a root or affix. There is already a script (I think thanks to Olga) at mom/util/collect_tags_xigt.py. It generates pos tags and glosses for mom. The pos tags file it produces needs to be divided into nouns and verbs (and adpositions if you are running the BASIL pipeline). The glosses file it produces includes all unique glosses in the corpus and needs to be filtered to remove any stem translations. I think it sorts frequently, making it pretty easy to remove stem translations by hand.

Finally getting back to this again…

I found the script, and definitely thanks to @olzama that it exists! But I have a suggestion for how to possibly change it and want input on whether it’s a good idea for a change or not.

As it is now, the script produces 3 files:

  1. pos_tags.txt – set of POS tags retrieved from the Xigt POS tier
  2. glosses.txt – set of glosses retrieved from the Xigt gloss tier
  3. unknown_features.txt – set of glosses not found in the existing Feature Dictionary

Since the unknown_features.txt file is a superset of glosses.txt, and includes mostly all of the root translations, with a handful of potential new feature glosses, I was thinking that maybe glosses.txt should include only the known features, and then the user can look in unknown_features.txt and add any glosses that actually are features both to FeatureDictionary and glosses.txt.

This way the user has to do (slightly) less work and the glosses.txt file won’t include root translations by default anymore.

If I do this… what is the protocol for glosses that appear with periods? It seems that as of now, they show up in glosses.txt as is (e.g. something like “to.be” or “m.sg,” but then in unknown_features.txt they are split up. I assume they are split for unknown_features.txt because when inserting them into the Feature Dictionary they should appear separately (?), but maybe glosses.txt needs them in their original form?