Updating a TDL lexicon using pydelphin

I need to add a supertype to a bunch of types. I think that should in principle be convenient to do using Pydelphin but I can’t figure out how.

I am trying something like:

        for event, obj, lineno in tdl.iterparse(lexicon):
            if event == 'TypeDefinition':

…but that seems to have zero effect somehow, in that the list of supertypes (which seems to be a normal list) remains unchanged? What’s the right way of doing this?

Scratch this, because: Moving a constraint into a new supertype breaks the grammar (instances can only inherit from one type).

So instead, I need to add a constraint literally spelled out, to each entry in the lexicon. I am still not sure what the easiest way of doing that is; probably there is a pydelphin way?

Alternatively you could create a type which inherits from the regular lexical types + your new type and then change the type for the lex entries. (That is probably better than repeating the constraint.)

Or better yet: rename the existing type, create a subtype with the old type name that has the constraint you want, and leave the lexicon alone!

1 Like

Rename v_-le to v-_le-super (or some such).


v_-le := v-_le-super & native_le.

1 Like

This may be a moot point now, but the reason the obj.supertypes.append() operation didn’t work is that TypeDefinition.supertypes is not a persistent data structure but a property/accessor that lists the type terms in the TypeDefinition’s top-level conjunction. The information contained in a TypeDefinition is mainly in the Conjunction object, which could be type identifiers, AVMs, strings, regexes, or coreferences. If you want to add something to the conjunction, you’d need to use Conjunction.add():

>>> from delphin import tdl
>>> obj = next(
        obj for ev, obj, _ in tdl.iterparse('~/delphin/jacy/japgram.tdl')
        if ev == 'TypeDefinition'
>>> print(tdl.format(obj))  # what it looks like before
utterance_rule-decl-finite := utterance-sf-type &
  declarative sentence, finite verb
  <ex> 食べる
                                         FIN + ],
                        NON-LOCAL.QUE <! !> ] ].
>>> obj.conjunction.add(tdl.TypeIdentifier('foo'))  # add a single term
>>> print(tdl.format(obj))  # what it looks like after
utterance_rule-decl-finite := utterance-sf-type &
  declarative sentence, finite verb
  <ex> 食べる
                                         FIN + ],
                        NON-LOCAL.QUE <! !> ] ] & foo.

Note that the supertype foo is added to the end of the conjunction. You can get to a more canonical ordering with Conjunction.normalize()

>>> obj.conjunction.normalize()
>>> print(tdl.format(obj))
utterance_rule-decl-finite := utterance-sf-type & foo &
  declarative sentence, finite verb
  <ex> 食べる
                                         FIN + ],
                        NON-LOCAL.QUE <! !> ] ].

Hmm well, I need to update a pretty big lexicon, so actually adding the constraint literally seems more realistic perhaps?

I still don’t fully understand how to do it with Pydelphin but I’m sure it’s possible (just adding the same constraint at the same level to each entry). Also, there is the Matrix code which does this sort of thing.

Otherwise I would need to create lots of new supertypes, and then still change each entry in the lexicon?

Thanks very much for this clear example, Michael! I will try to see if I can find the relevant bits to add a constraint though, in the end, not a supertype (see above).

Hmm looks like the Matrix TDL module is only for writing new files, not reading existing ones, in terms of what functions are already implemented. So, probably best to use Pydelphin (since SRG cannot be customized from a Matrix-style “choices” spec).

One of the main motivations for lexical types is to avoid repeating constraints, and thereby have a more maintainable grammar, right? Adding the same constraint to many lexical entries sounds like a short-term solution that you may regret later.

1 Like

I suppose I could automate this process as follows:

  1. Grab the sole supertype of the lexical entry.
  2. Create a new supertype automatically (but what if something called x_super already exists? Do I need to think of something idiosyncratic here?), add it to letypes.tdl
  3. Replace the supertype for the entry

I’m still not sure what the best library is for parsing a TDL file and then modifying it. It appears that Pydelphin is great for parsing but Matrix’s tdl.py might be better for modifying? Or am I wrong. I think I’m wrong because Pydelphin seems to have enough classes which seem like they should be useful for modifying TDL, but I am not seeing many examples of that. I’ll try to figure it out.

OK, I think maybe I can do this more or less as follows:

    # Assume a list of type names to update in this manner called le_to_update
    updated_letypes = []
    for event, obj, lineno in pydelphin_tdl.iterparse(filepath_letypes):
        if event == 'TypeDefinition':
            if obj.identifier in le_to_update:
                new_id = pydelphin_tdl.TypeIdentifier(obj.identifier + '_native')
                ds = '' if not obj.docstring else obj.docstring
                conj = copy.deepcopy(obj.conjunction) 
                new_type = pydelphin_tdl.TypeDefinition(new_id, conj, ds +
                                                        'This is a native lexical entry type, '
                                                        'for words that are in the lexicon.')
    with open('updated_letypes.tdl', 'w') as f:
        for obj in updated_letypes:
            f.write(pydelphin_tdl.format(obj) + '\n\n')

This way, I get a new file from which I can copy the portion starting from where the _le types start:

v_np_npsv-id_le := v_np_npsv-id_lex.

v_np_npsv-id-ser_le_native := v_np_npsv-id-ser_lex & native_le
  This is a native lexical entry type, for words that are in the lexicon.

It would be cleaner to make a full updated file, without any manual copying, but I am not sure about the comments at the moment (which aren’t docstrings).

To this same topic, here’s an example of how I create generic lexical entries using pydelphin (if there is a simpler/better way, comments are welcome :slight_smile: ).

def create_generic_entries(tags, supertype, pred):
    entries = []
    for t in tags:
        id = pydelphin_tdl.TypeIdentifier(t + '_ge')
        super_id = pydelphin_tdl.TypeIdentifier(supertype)
        v = [pydelphin_tdl.String(supertype)]
        token_id = pydelphin_tdl.TypeIdentifier('generic_token_list')
        pos_list = pydelphin_tdl.ConsList(values=[pydelphin_tdl.String(t)], end=pydelphin_tdl.EMPTY_LIST_TYPE)
        pos_avm = pydelphin_tdl.AVM([('+POS.+TAGS', pos_list)])
        tok_list = pydelphin_tdl.ConsList(values = [pos_avm], end=pydelphin_tdl.EMPTY_LIST_TYPE)
        token_conj = pydelphin_tdl.Conjunction([token_id, tok_list])
        term = pydelphin_tdl.AVM([('STEM', pydelphin_tdl.ConsList(values=v,end=pydelphin_tdl.EMPTY_LIST_TYPE)),
                                  ('TOKENS.+LIST', token_conj),
                                  ('SYNSEM.LKEYS.KEYREL.PRED', pydelphin_tdl.String('_generic_' + pred + '_rel'))])

        ds = 'Generic lexical entry that will be triggered by tag {}.'.format(t)
        terms = [super_id, term]
        conj = pydelphin_tdl.Conjunction(terms)
        new_type = pydelphin_tdl.TypeDefinition(id, conj, ds)
    with open(pred + '_entries.tdl', 'w') as f:
        for e in entries:
            f.write(pydelphin_tdl.format(e) + '\n\n')

adverbial_tags = ['rg', 'rn', 'nc00000']
create_generic_entries(adverbial_tags, 'av_-_i-vm_le', 'adv')

What comes out looks properly like this:

rg_ge := av_-_i-vm_le &
  [ STEM < "av_-_i-vm_le" >,
    TOKENS.+LIST generic_token_list & < [ +POS.+TAGS < "rg" > ] >,
    SYNSEM.LKEYS.KEYREL.PRED "_generic_adv_rel" ]
  Generic lexical entry that will be triggered by tag rg.

Posting it here as a complete example because it took me a while to assemble the relevant pieces of documentation to make this work.

but I am not seeing many examples of that. I’ll try to figure it out.

Did you find a solution for preserving the comments?

I was thinking… LKB surely need to have code for that! The main code seems to be at src/io-tdl folder but in the files I was not able to find the function to parse a single TDL file. I found

(read-tdl-lex-or-grammar-rule-file "~/hpsg/terg/lexicon.tdl" t)

But the messages are not helpful. The problem is that LKB overuses global variables and adopt too many unconventional ‘tricks’ from Lisp! @johnca is really my hero! Almost all definitions of LKB are in the same namespace :lkb, maybe caused by some previous limitations from past version of Lisp implementations??

(it seems that the function above is not prepared to be called from the top level)

I don’t think the LKB is suitable for this task. Having said that…

The correct function for reading a TDL lexicon is read-tdl-lex-file-aux (@arademaker’s suggestion of read-tdl-lex-or-grammar-rule-file reads a lexical rule or grammar rule file). There should be no error messages and no need to do anything with global variables. The outcome is a database containing the lexical entries (the permanent storage for the database are the files in your ~/tmp directory called ERG.idx etc). And then there are functions that retrieve entries from the lexicon database, and others that output them to a plain text file.

However, even if you get this far you’ll lose any comments in the TDL file because currently the LKB does not preserve them. This will change in an upcoming release, but it’s not there yet.

Still my hero! :wink: But sorry for the misleading information above…

The way you assemble the pieces is pretty much how to do it at the moment. You could try the shorthand for making conjunctions:

>>> from delphin import tdl
>>> supertype = tdl.TypeIdentifier('some-type')
>>> avm = tdl.AVM([('FEATURE', tdl.TypeIdentifier('value'))])
>>> supertype & avm
<Conjunction object at 140341378514464>

But it’s still a pretty manual experience. PyDelphin mostly just translates between object model TDL and the serialized form.

Something like this but I found that it it still not 100% robust, depending on the original TDL file:

def save_tdl_objects(tdl_objects, filename):
    with open(filename, 'w') as f:
        for event, obj, lineno in tdl_objects:
            if event == 'LineComment':
                if not obj.startswith(';'):
                    f.write('; ' + obj + '\n')
                    f.write(obj + '\n')
            elif event == 'BlockComment':
                f.write('#|' + obj + '|#\n\n')
                f.write(pydelphin_tdl.format(obj) + '\n\n')

If your purpose is to pass through the file and output as you go, this code looks reasonable to me. I can imagine two things PyDelphin won’t accommodate very well:

  1. Lisp commenting conventions regarding multiple semicolons
  2. Comments within TDL type definitions

But if there is something else that doesn’t work, I’d be interested to hear about it.