# Xigt data for AGGREGATION

I’m working on organizing the data in the agg repo, and part of this process involves converting FLEx and toolbox data to Xigt.

By looking through the xigt python package, I see that it comes with the ability to convert toolbox data out of the box, but I’m a little unclear on how to get started doing this. I looked at the documentation on the Xigt repository, but I’m still not sure where to start. Does anyone have any sample scripts where they converted toolbox data to Xigt?

Similarly, does anyone have an analogous example script for FLEx data? I see the xigt package doesn’t come with the ability to convert FLEx out of the box, but there appears to be a flex2igt repo on the agg gitlab, so it seems like some work has been done to make this possible.

Any help or pointers to documentation/scripts would be greatly appreciated! Thanks!

I haven’t used or looked at this code in a long time, but as I recall Xigt can import from Toolbox’s SFM files (the ones with things like \tx zhe shi yi-zhi bi) and not the Toolbox XML format, and you’ll need the toolbox module to be importable by Python.

I just tried running this and unfortunately the toolbox module does not have an installer (setup.py), so you either need to copy the toolbox.py file somewhere on the import path or adjust PYTHONPATH accordingly. For instance, say I have an example IGT file ex.txt and I convert it to Xigt using the source checkouts of Xigt and the toolbox module:

~$git clone https://github.com/xigt/xigt.git Cloning into 'xigt'... [...] ~$ git clone https://github.com/goodmami/toolbox.git
Cloning into 'toolbox'...
[...]
~$cat ex.txt \ref item1 \t O Pedro baixou \m O Pedro bai -xou \g the.M.SG Pedro lower -PST.IND.3SG \t a bola \m a bola \g the.F.SG ball.F.SG \f Pedro calmed down. \l Pedro lowered the ball. ~$ cd xigt/
~/xigt$PYTHONPATH=../toolbox/ python3 -m xigt.main import -i ../ex.txt -o ex.xml ~/xigt$ cat ex.xml
<xigt-corpus>
<igt id="item1">
<tier id="p" type="phrases">
<item id="p1">O Pedro baixou a bola</item>
</tier>
<tier id="w" type="words" segmentation="p">
<item id="w1" segmentation="p1">O</item>
<item id="w2" segmentation="p1">Pedro</item>
<item id="w3" segmentation="p1">baixou</item>
<item id="w4" segmentation="p1">a</item>
<item id="w5" segmentation="p1">bola</item>
</tier>
<tier id="m" type="morphemes" segmentation="w">
<item id="m1" segmentation="w1">O</item>
<item id="m2" segmentation="w2">Pedro</item>
<item id="m3" segmentation="w3">bai</item>
<item id="m4" segmentation="w3">-xou</item>
<item id="m5" segmentation="w4">a</item>
<item id="m6" segmentation="w5">bola</item>
</tier>
<tier id="g" type="glosses" alignment="m">
<item id="g1" alignment="m1">the.M.SG</item>
<item id="g2" alignment="m2">Pedro</item>
<item id="g3" alignment="m3">lower</item>
<item id="g4" alignment="m4">-PST.IND.3SG</item>
<item id="g5" alignment="m5">the.F.SG</item>
<item id="g6" alignment="m6">ball.F.SG</item>
</tier>
<tier id="t" type="translations" alignment="p">
<item id="t1" alignment="p1">Pedro calmed down.</item>
</tier>
</igt>
</xigt-corpus>


You can also install Xigt with pip install xigt but you’ll still need to make the toolbox module importable somehow.

I created these tools when I knew a bit less about Python packaging than I do now, so it’s not an ideal situation. I had hoped to merge the toolbox code into Xigt, but honestly I don’t know when/if I’ll get around to that. But I hope the above helps get you to the next step.

Finally, I have not used the FLEx importer so I cannot help with that one.

One more thing: the example above conveniently had the same markers as the default set used by the importer. At the top of toolbox.py is an example JSON file that you can adapt for configuring it for your data. For convenience I paste it below:

{
"record_markers": [
"\\id",
"\\ref"
],
"igt_attribute_map": {
"\\id": "corpus-id"
},
"tier_map": {
"\\t": "w",
"\\m": "m",
"\\g": "g",
"\\p": "pos",
"\\f": "t"
},
"make_phrase_tier": ["w", "p"],
"tier_types": {
"p": {"type": "phrases"},
"w": {"type": "words", "interlinear": true},
"m": {"type": "morphemes", "interlinear": true},
"g": {"type": "glosses", "interlinear": true},
"pos": {"type": "pos", "interlinear": true},
"t": {"type": "translations"}
},
"alignments": {
"w": ["segmentation", "p"],
"m": ["segmentation", "w"],
"g": ["alignment", "m"],
"pos": ["alignment", "m"],
"t": ["alignment", "p"]
},
"error_recovery_method": "ratio"
}


The markers (e.g., \\t) are mapped to identifier prefixes (w), and these are reused when defining the tier types. The make_phrase_tier thing is an ad-hoc way to create a phrasal tier from the words tier. I think the two values are the ID of the words and the ID for the resulting phrase tier. The "interlinear": true setting informs the importer which tiers are meant to aligned internally. The error_recovery_method tells Xigt how to deal with columnar alignment issues in the toolbox file. The ratio method is probably good; the other methods are reanalyze and strict.

Save the JSON file to something like config.json then use the -c/--config option when importing:

~/xigt\$ PYTHONPATH=../toolbox/ python3 -m xigt.main import -i ../ex.txt -o ex.xml -c config.json


The flex2xigt repo should be pretty straighforward. The usage is just
python3 xigtconversion.py <path_to_flex_data.flextext>

I also have a few config files that I used to convert various agg datasets, which I can share with you.

Thank you! This is all really helpful. What are the differences between the different error recovery methods?

Also, thanks for your config files, @kphowell! I was looking at them and I wanted to know if there was a particular reason that these config files have the following:

"make_phrase_tier": ["m", "p"]

Since @goodmami 's explanation says that this is a way to make a phrase tier from the words tier. I assume putting “m” instead would just create a phrase tier from the morpheme tier(?), but I’m not sure why this would be done over using the “w” tier.

@ecconrad This is based on a year old memory, so double check me, but I think there are two possible reasons:
A) the datasets didn’t have a w tier at all or
B) the m and w tiers didn’t match (ie. the m tier showed underlying morpheme forms) and I wanted to process the m tier because AGGREGATION/BASIL assumes a morpho-phonological analyzer.

I think A is more likely. B doesn’t quite make sense as I should have constructed my testsuites from the m tier (see the xigt2itsdb converter), rather than the phrases tier anyway, so it shouldn’t matter which tier ‘phrases’ was constructed from. So if B was my reason, it may have been unnecessary, or I might have been cutting a corner for some reason.

Consider the following IGT:

\w abcdef        ghi   jkl     mnop
\m ab=cde-f      ghi   jk  l   mno    -p
\g 1SG= qrstu -NOM wxy  z   2PL qrstuv -PST


Note that the -NOM and wxy gloss tokens straddle the ghi word/morpheme column, and this shifts everything down the line. Also the jk and l morphemes do not have morpheme boundary markers (- or =), so the attachment is ambiguous, or possibly the jkl grouping as a single token is erroneous (we were working with PDF-to-text or OCR output, which can make such mistakes).

The ratio method looks at the columns (e.g. from a up to but not including g, then from g up to but not including j, etc.) in the previous line and determines which column the current token is most within. For instance, the -NOM token has 3/4 characters in the first column and 1/4 in the second, while wxy is entirely within the second column. It would therefore place -NOM in the first column.

The reanalyze method expects that token and morpheme boundaries are correct, then ignores whitespace for determining columns. It would also correctly group -NOM with the first column, but because jkl splits into jk and l without any morpheme boundary marker, it would incorrectly assign l as the morpheme of mnop, then the mno and -p morphemes would align to nothing. This method is good if you’re confident in the morpheme boundary markers but not in the spacing. In our work, we were more confident in the columns being nearly correct, so ratio is the default method.

The strict method simply raises an error when it encounters any issues with token alignment. This is useful when you want to stop, correct the source files, and try again.

It’s also worth mentioning that columnar alignment errors we experienced with Toolbox files were often not because of manual editing errors, but because we were reading the files in text mode and the columns were aligned by bytes, and this caused errors when multi-byte (non-ASCII) characters were used. We fixed this in Xigt’s GitHub repository but unfortunately never cut a release for the fix, so I recommend using Xigt from a recent clone from GitHub instead of the pip install version.

I’m trying to use the xigt2itsdb converter now, but I’m not quite sure how to run it/where to look for documentation on how to use the exporters in the xigt package. Do you have any examples of this?

We made some changes to the itsdb exporter to work with the latest agg pipeline. You can grab those changes here https://git.ling.washington.edu/agg/temp-xigt2itsdb and replace the itsdb.py in the xigt repo (GitHub - xigt/xigt: eXtensible Interlinear Glossed Text) with the one in the temp repo.

Then you can run the exporter using xigt.sh in the xigt repo like this:

./xigt.sh export -f itsdb -i -o

Thanks!

I’m trying to get this to work right now, but there seem to be some bugs, so I’m stepping through with a debugger. The first issue I came across is that the folder xigt/codecs seems to conflict with another package called codecs that Python needs to be able to run. I just renamed it to xigt/xigt_codecs and that seemed to solve it, but I assume this problem was not being encountered in your setup, do you have any ideas about this?

@goodmami , if it does need to be changed, is there a way the xigt repo can be updated?

Yes, feel free to open a pull request at GitHub - xigt/xigt: eXtensible Interlinear Glossed Text. There are some minimal instructions for this in CONTRIBUTING.md, but let me know if you need any help.