Exporting gold lexical types for supertagging training data from ERG treebanks

My goal is to have a corpus of all the gold trees from the ERG treebanks in plain text format paired with a corpus with the unparsed raw items (sentences). Aka training data for a supertagger. So, ideally, in the end I need to be able to extract a sequence of lexical types (more or less) corresponding to the sequence of tokens.

What is the easiest way to obtain this thing?

  1. I know there is the [incr tsdb()] export option (described here: WeScience · delph-in/docs Wiki · GitHub) I cannot make it work so far but presumably it is possible with more tweaking. Is that still the recommended way? @bond @Dan I looked at what I can get using Trees | Export via the GUI and that’s probably the right rhing although the lexical types themselves will still have to be extracted.

  2. Or is there perhaps a pydelphin way? If so, where to look first, here? delphin.itsdb — PyDelphin 1.6.0 documentation @goodmami

  3. Maybe there is a fftb way? @sweaglesw I see here FftbTop · delph-in/docs Wiki · GitHub that there is a way of training a ranking model directly using fftb but that is not what I need at the moment. I need data for seq2seq model training.

Hi,

the gold profiles already have the ERG derivation tress in plain text format and the unparsed raw items: field i-input of items and filed derivation of results. So no need to export anything.

I think you can get what you want from PyDelphin:
https://pydelphin.readthedocs.io/en/latest/guides/itsdb.html#tsql-queries-over-test-suites

just select ‘i-id item derivation’, …

1 Like

When I am simply looking at the ERG files, I cannot locate the “derivation” field anywhere. You mean, it should be in the result.gz file, yes?

Indeed, pydelphin will give me the derivation with the following command:

delphin select 'derivation where i-id = 101' erg/2020/tsdb/gold/ccs (for example).

And then extracting the lexical types for each token, is that up to me or is there pydelphin magic for that?

TSQL (including PyDelphin’s implementation) does implicit joins, so you can just specify the fields you want, such as i-id, i-input, and derivation, and the query planner will join the relevant tables for you.

For more on TSQL syntax, you can try the wiki and PyDelphin’s delphin.tsql docs. It’s pretty well documented. The guide that Francis linked to above shows an example of the delphin.tsql module’s programmatic usage (as opposed to the command-line usage).

This is not possible from the command-line. You’ll need to use the Python API, which I hope doesn’t seem like magic as I spent a lot of time documenting it. For this the TSQL queries may be less useful, so instead I’d recommend the delphin.itsdb.processed_items() method (which, at the risk of undermining my point in the previous sentence, is not documented very well, but it does say that it’s a method on TestSuite objects that yields Response objects). See delphin.interface for what you can do with the Response objects it yields and delphin.derivation for what you can do with Derivation objects. Here’s an example:

>>> from delphin import itsdb
>>> ts = itsdb.TestSuite('~/grammars/erg-trunk/tsdb/gold/mrs')
>>> for response in ts.processed_items():
...     # Get the first result and inspect its derivation
...     result = response.result(0)  # may raise IndexError if no results
...     deriv = result.derivation()
...     print(' '.join(f'{term.form}/{term.parent.type}'
...                    for term in deriv.terminals()))
...     break  # stop after the first for this example
... 
it/None rained/None ./None

The lexical type of the preterminal is None because this profile did not export the UDX derivation format variant. You can get ACE to output UDX with the --udx option (see the AceOptions wiki).

$ delphin mkprof --source ~/grammars/erg-trunk/tsdb/gold/mrs/ mrs-udx --quiet
$ delphin process -g ~/grammars/erg.dat mrs-udx --options='-1 --udx'
Processing |################################| 107/107
NOTE: parsed 107 / 107 sentences, avg 4807k, time 2.47997s

Now try that again:

>>> from delphin import itsdb
>>> ts = itsdb.TestSuite('mrs-udx')
>>> for response in ts.processed_items():
...     result = response.result(0)
...     deriv = result.derivation()
...     print(' '.join(f'{term.form}/{term.parent.type}'
...                    for term in deriv.terminals()))
...     break
... 
it/n_-_pr-it-x_le rained/v_-_it_le ./pt_-_period_le

Also note that the Response class is just a wrapper around the dictionary of response data from the profile, so it may include other info, like the input string:

>>> print(response.keys())
dict_keys(['i-id', 'i-origin', 'i-register', 'i-format', 'i-difficulty', 'i-category', 'i-input', 'i-tokens', 'i-gloss', 'i-translation', 'i-wf', 'i-length', 'i-comment', 'i-author', 'i-date', 'parse-id', 'run-id', 'ninputs', 'p-input', 'ntokens', 'p-tokens', 'readings', 'first', 'total', 'tcpu', 'tgc', 'treal', 'words', 'l-stasks', 'p-ctasks', 'p-ftasks', 'p-etasks', 'p-stasks', 'aedges', 'pedges', 'raedges', 'rpedges', 'tedges', 'eedges', 'ledges', 'sedges', 'redges', 'unifications', 'copies', 'conses', 'symbols', 'others', 'gcs', 'i-load', 'a-load', 'date', 'error', 'comment', 'results'])
>>> response['i-input']
'It rained.'
2 Likes

Thanks, Michael! That’s great, I will look into it. (And I never mean any offense by the word “magic”; quite the opposite! As for documentation, I am always happy to read a specific piece, but in practice it is usually impossible to locate what that relevant piece is without the help of those who composed it (this is not regarding pydelphin but a general observation.) So, this is what forums are for!)

1 Like

So, if my goal is to use the ERG gold treebanked data, then this won’t work, right? (I checked that the code above outputs None for one of the treebanks). It wouldn’t make sense for me to reparse all of that and then re-treebank it.

Is there any way to use pydelphin on the already treebanked data? Or does it sound like I need to write something on my own here, such as a converter to UDX or else some additional code for the pydelphin module (which is of course not a problem, I am just trying to avoid that if the code is already written).

Perhaps re-exporting the treebanks via [incr tsdb()] is the solution after all?

Hi,

you can use the ltdb (or lisp and the lkb or pydelphin) to get the lexical type from the lexical id, so no need to reparse.

I am in the middle of marking, but could give some help next week.

1 Like

I can reparse and auto-update the gold profiles using ERG 2020 with the additional setting to record the lexical types, and it will only take a day or so. I didn’t use that setting for the official release, since I have not tested the various scripts and tools that read the derivation trees, to be sure they can all cope with the added annotations.

1 Like

Thanks, Dan, but I think it’s not necessary. As Francis said, one should be able to lookup the lexical types by loading the grammar’s lexicon and looking up the lexical entries in the derivations. No need to risk breaking your scripts unless you were looking for an excuse to export UDX anyway :slight_smile:

As Michael said, it is not strictly necessary. On the other hand, I think recording the lexical types (and roots) in the gold profiles would be a very good idea and should become the documented best-practice.

@Dan Let me know what you decide please! Of course I am not asking you to spend a day of your time but if in the end you conclude that you might need it yourself, then let me know :).

In the meantime, @goodmami , you mean doing this:

completely automatically, right? There is a way to write code that will do this, right? E.g.\ one of the fields in the response object will robustly correspond to something in the lexicon file(s)?

Something like this, maybe, but Francis has more experience here, so maybe wait until he’s done marking:

>>> from delphin import tdl, itsdb
>>> lextypes = {}  # mapping of lexical entry IDs to types
>>> for event, obj, lineno in tdl.iterparse('~/grammars/erg-trunk/lexicon.tdl'):
...     if event == 'TypeDefinition':
...         lextypes[obj.identifier] = obj.supertypes[0]  # assume exactly 1
... 
>>> ts = itsdb.TestSuite('~/grammars/erg-trunk/tsdb/gold/mrs')
>>> for response in ts.processed_items():
...     result = response.result(0)
...     deriv = result.derivation()
...     pairs = [(t.form, t.parent.entity) for t in deriv.terminals()]
...     print(' '.join(f'{form}/{lextypes.get(entity, entity)}'
...                    for form, entity in pairs))
...     break  # for example only
... 
it/n_-_pr-it-x_le rained/v_-_it_le ./period_pct

Notes:

  • Here’s the documentation for tdl.iterparse()
  • Not every terminal had a lexical entry, so I used lextypes.get(...) instead of lextypes[...]. In this case, period_pct was not in the lexicon.
3 Likes

Hi,

that code looks good to me, the only change you would have to make is to iterate through more than one lexicon.
period_pct is in ple.tdl (for punctuation), and I think there is also gle.tdl for generic lexical entries and a couple more.

From english.dat:
;
;; Lexicon entries (instances of status lex-entry or generic-lex-entry)
;;

:begin :instance :status lex-entry.
:include “lexicon”.
:include “lexicon-rbst”.
:include “ple”.
:end :instance.

:begin :instance :status generic-lex-entry.
:include “gle”.
:include “gle-gen”.
:end :instance.

3 Likes

My terminals don’t have parents for some reason (AttributeError: 'UDFTerminal' object has no attribute 'parent'), I wonder what the difference is. I have pydelphin version 1.6.0, and I am trying it on the ERG trunk as well, the MRS test suite. So the data is the same.

Any thoughts on what I could be doing differently? @goodmami

Oh, OK, I can get the code working if I do t._parent instead of t.parent. I am still curious about why t.parent worked for you though? Or if you in fact had _parent, then maybe you could edit accordingly and then I can mark your post as the accepted solution. But I understand this is a “protected attribute” so I am assuming I am doing something wrong by accessing it directly like this. Please advise :).

@bond which english.dat do you mean here? Not the one that ace compiles (but that’s the only one I know about?)

If you meant config.tdl, then the one I get by default with the ERG release does not have that…

Regarding UDFTerminal.parent, I’m surprised to see that error because parent is defined on UDFTerminal, and all it does is retrieve _parent:

Maybe you can paste more of the code you’re using? Maybe something else is going on.

Regarding english.dat, I think @bond meant english.tdl.