Reparsing and updating a treebank keeping previous decisions

I have some old treebanks (produced with an old version of the SRG), and I would like to reparse and update them using ACE and a newer version of the morphophonological analyzer. The grammar is already updated wrt the morphophonological tags, and I have a pipeline of scripts which convert the morphophonological analyzer output into YY mode which can then be fed to ACE so that ACE can parse it. Unknown words handling is also working. Thanks a lot, everyone who was involved, for the help!

I also now know how to update the [incr tsdb()] schema. However, refreshing the profile using delphin mkprof with the --refresh or with the source option, results in an empty decision file.

What is the correct way of preparing a treebank, after updating the schema (and refreshing the profile?..) for updating with fftb using a newer version of the grammar but keeping all the previous data such as treebanking decisions? Do I just copy over the decisions file?.. Or am I supposed to be using some other tools altogether.

1 Like

If the decision file was non-empty to begin with, that sounds like a bug. PyDelphin does not do anything special to the decision file on a mkprof command. There are two other things I can think of:

  1. If you processed the profile (delphin process ...), then decision is one of the “affected tables” that gets cleared and rewritten during processing. See delphin.itsdb — PyDelphin 1.7.0 documentation about the “affected tables” of the FieldMapper class, which is used to map responses from processors (like ACE) to [incr tsdb()] profiles.
  2. If the original relations file does not describe decision, then it might not be copied over in mkprof, even if the file exists in the profile directory. I have not confirmed if this behavior; it’s just a guess.

Note that the above only pertains to refreshing [incr tsdb()] profiles to new versions of the TSDB schema and not for updating treebanking decisions to newer versions of the grammar.

You will be giving fftb two different profiles as input for the update operation. One of them will be the original profile from the old version of SRG. Call this one gold. The other one, possibly named new, you will create e.g. using the mkprof method you outlined, followed by a parsing step to fill in the edge relation. At this point it is expected that the decision relation is empty.

Next, invoke fftb to update the new profile. Do something like this to make it happen automatically in batch mode:
fftb -g srg-new.dat --gold tsdb/gold --auto tsdb/new
Alternatively, you can visualize how successful the update was and help resolve remaining ambiguity:
fftb -g srg-new.dat --gold tsdb/gold --browser --webdir /some/where tsdb/new

3 Likes

I am using pydelphin’s ACE wrapper for everything. @goodmami, what is the best way to reparse a tsdb profile? I know how to parse a list of sentences (note that I need YY mode, which I also know how to do, via command line args interface of the ACE wrapper). The sentences I can grab from ts[‘item’], using the ‘i-input’ field of each of the items, then I can convert them to YY and feed them to the parser one by one. The question is, how to stick those back into the tsb profile; can I just keep track of the index and stick the reponce lists onto the same testsuite’s ‘result’ field?.. Or is there some interface for reparsing entire profiles that I am not finding?

Hi @olzama ,

I usually do this via the command line:

delphin process -vv -g dict.dat -o "-n 1 --timeout=60 --max-words=150 --max-chart-megabytes=4000 --max-unpack-megabytes=5000 --rooted-derivations --udx --disable-generalization" -s INPUT -z OUTPUT

where INPUT is the profile folder I want to parse, and OUTPUT is the folder to be produced. I have never worked with YY inputs (I hope you will teach me how to adapt the Portuguese grammar to use it!), but all ACE options can be given to PyDelphin, so I guess it should work with profiles generated from YY inputs.

For treebanking, I normally use the following. The first command creates the profile and the second prepares to FFTB.

delphin mkprof --input FILE-ONE-SENT-PER-LINE --relations ~/hpsg/logon/lingo/lkb/src/tsdb/skeletons/english/Relations --skeleton golden
delphin process golden -g erg.dat --full-forest --options='--disable-generalization'
1 Like

Hmm. I am not sure YY-format can be part of the profile?.. I was assuming that it is necessary to first extract each item’s i-input string and convert it to YY. But maybe I am wrong and I can populate a profile with YY? (Obviously to do that, I would still first need to extract the string and convert it, just the same, but the question is, can that string then be put into the item field of the profile, or not.)

(By the way, @arademaker , YY is just a tokenized format, token with some extra information such as the start and the end position of the token, and also optionally tags such as POS tags. I just have a script that Luis originally wrote, to create it based on the output of Freeling. You could easily create YY formatted sentences with tags coming from a tagger/analyzer of your choice. You can of course see the scripts I am using here.)

There is: delphin.itsdb — PyDelphin 1.7.0 documentation
– the same thing that Alex is pointing out above.

The question still is, what do I do about YY: do I try to replace the items in the testsuite with YY or do I process each item separately and try to replace the result. I don’t think I can do either thing straitforwardly because the test suite objects (perhaps sensibly) do not allow assignment.

Maybe this is what I need? delphin.itsdb — PyDelphin 1.7.0 documentation table.update()?

Or, more likely, I need to use the FieldMapper somehow?.. Anyone has an example of how that’s used?

Yes, I know about the YY format. What I don’t know is:

  1. why use the preprocessing of a morphophonological analyzer for SRG? I am assuming that you hit a limit on the capability of the morphophonological analyses of LKB/Ace.

  2. How to prepare the grammar for the YY input.

Those are the things I am interested in the Portuguese Grammar. But this is another thread…

I guess the YY input will be used to construct the profile. Once created, the tokenization is already stored in the profile, and its processing doesn’t need to go to the tokenization of the input items again. I guess…

1 Like

I can be wrong, but using the YY example from AceUse · delph-in/docs Wiki · GitHub, I produced a profile from a YY file as below.

% delphin mkprof --input teste.yy --relations ~/hpsg/logon/lingo/lkb/src/tsdb/skeletons/english/Relations --skeleton golden

and I was able to process it with ACE:

% delphin process -vv -g erg.dat -o "-n 1 -y --timeout=60 --max-words=150 --max-chart-megabytes=4000 --max-unpack-megabytes=5000 --rooted-derivations --udx --disable-generalization" -s golden parsed

I was not expecting the i-input field to hold the YY markup, but it seems to be the way to go… am I right @goodmami and @sweaglesw?

% cat parsed/item
1@@@@1@@(42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)@@@@1@12@@@
1 Like

That’s just how the grammar was created. In general, for languages with complex morphology, this seems preferable, especially for phonology. Creating a robust morphophonological component can probably take a very long time? But with a good external analyzer, you get robust mapping of forms to underlying forms, along with the tags specifying the morphosyntactic nature of the surface form. Then you map that to an inflectional rule in the grammar, using the tag and the YY format. You still need to create all the inflectional rules but you don’t have to analyze the surface forms.

Is it preferable or necessary? I am still reading the @AnnC “Implementing Typed Feature Structure Grammars”, but page 128 footnote 39 suggest that the morphological rules maybe too simple for many languages. I am just trying to confirm how far can we go with it to Portuguese. It seems that LKB is restricted to concatenative morphology but can’t deal with alternatiion rules (see https://www.amazon.com/Finite-State-Morphology-Kenneth-Beesley/dp/1575864347).

Besides morphology, the preprocessing does also the POS tagging, right? That is, the YY format help on both steps right? The POS can be used to map entries to generic lexical entities, right? If so, another question is the granularity of the tags used for preprocessing. Maybe a possible deviation of the PTB tags could be worth to explore? Johan Bos explored the idea of ‘semantic tagging’ (Towards Universal Semantic Tagging - ACL Anthology).

By the way, related to my experiment above, one ‘limitation’ I see is the fact that we will not have in the profile the actual initial string of the sentence, I guess it can be obtained from the YY format, but it would be preferable to have the string in the input file together with the YY input.

1 Like

This paper might be helpful to your query, @arademaker

2 Likes

Sorry, I didn’t pay attention to that part. In my example above, I did that, added the YY format in the i-input field and that worked. But It would be better to the the YY markup in the i-tokens field maybe?

1 Like

The i-input field of the item file can be YY data. PyDelphin just takes the value of that field and pipes it into ACE, so if ACE was invoked with -y, it will work fine:

$ cat tmp/item
1@@@@1@@(1, 0, 1, <0:3>, 1, "The", 0, "null", "DT" 1.0000) (2, 1, 2, <4:7>, 1, "dog", 0, "null", "NN" 1.0000) (3, 2, 3, <8:13>, 1, "barks", 0, "null", "VBD" 1.0000) (4, 3, 4, <13:14>, 1, ".", 0, "null", "." 1.0000)@@@@1@3@@@
$ delphin process -v --options="-y" -g ~/delphin/erg-2018.dat tmp/
Processing |################################| 1/1
NOTE: parsed 1 / 1 sentences, avg 1558k, time 0.00807s
$ delphin select mrs tmp/ | delphin convert
[ TOP: h0
  INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ]
  RELS: < [ _the_q<0:3> LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] RSTR: h5 BODY: h6 ]
          [ _dog_n_1<4:7> LBL: h7 ARG0: x3 ]
          [ _bark_v_1<8:14> LBL: h1 ARG0: e2 ARG1: x3 ] >
  HCONS: < h0 qeq h1 h5 qeq h7 > ]

[ TOP: h0
  INDEX: e2 [ e SF: prop ]
  RELS: < [ unknown<0:14> LBL: h1 ARG: x4 [ x PERS: 3 NUM: pl IND: + ] ARG0: e2 ]
          [ _the_q<0:3> LBL: h5 ARG0: x4 RSTR: h6 BODY: h7 ]
          [ compound<4:14> LBL: h8 ARG0: e9 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x4 ARG2: x10 [ x IND: + PT: notpro ] ]
          [ udef_q<4:7> LBL: h11 ARG0: x10 RSTR: h12 BODY: h13 ]
          [ _dog_n_1<4:7> LBL: h14 ARG0: x10 ]
          [ _bark_n_1<8:14> LBL: h8 ARG0: x4 ] >
  HCONS: < h0 qeq h1 h6 qeq h8 h12 qeq h14 > ]

If you want stored (as opposed to dynamically generated) YY tokens, then I would store them in the i-input field. It seems like it would be better to keep the original sentence there and put the YY tokens in a field like i-tokens, but it doesn’t work that way. You could keep the original sentence somewhere else, like i-comment or even a separate file.

I wouldn’t try and process things yourself and replace the values of the result file. That’s basically just reimplementing the process functionality.

If you want to use Python to programmatically update profiles, then yes, that is an appropriate function.

No, FieldMapper is for mapping the response objects (a PyDelphin thing, not an ACE thing) to [incr tsdb()] fields. You wouldn’t need to directly use FieldMapper unless you were, say, writing a PyDelphin interface to the LKB or doing something drastically different with ACE outputs.

1 Like

I’m not sure there’s a good way to do this off the shelf. You need to pass YY data to ACE but you also need i-input to be plain text or else FFTB won’t work properly. Off hand I would say a plausible solution would be to store both the plain text in i-input and the preprocessed YY data in i-tokens, and have the delphin process machinery take an option to feed ACE from i-tokens instead of i-input. Looking over the code for art, which does roughly the same thing as delphin process (I think), I see I had a facility to do something related to this, but not quite close enough to work for your problem. @goodmami, would that be easy to add? Or maybe such a facility already exists?

1 Like

It’s not a problem to convert to YY dynamically given a string, the problem is doing that while also processing the profile as a whole, because the scenario here is updating treebanks…

Given the answers from @goodmami and @sweaglesw, we do have a limitation in the current tools. But I guess you can overcome the limitation with some extra code. You can create the profiles with the commands I suggested, storing the YY markup in the i-input field and the string of the sentences in the comment or any other field of the input file.

For treebank with fftb, you will need to update the input file copying the string of the sentences to the i-input and saving the YY markup elsewhere. Does it work @sweaglesw? Does it make sense? How fftb process the data and how the string is used? I am assuming fftb is not processing the string (parsing, tokenize etc) during the treebank. But I may be wrong.

Good call. It actually already exists and I’d forgotten about it. To do this, use the --select option of delphin process, which takes a TSQL select query. To confirm, note that the i-input and i-tokens represent slightly different sentences (The cat meows. and The dog barks., respectively), but the i-tokens one is used in parsing:

$ delphin select "i-input i-tokens" tmp
The cat meows.@(1, 0, 1, <0:3>, 1, "The", 0, "null", "DT" 1.0000) (2, 1, 2, <4:7>, 1, "dog", 0, "null", "NN" 1.0000) (3, 2, 3, <8:13>, 1, "barks", 0, "null", "VBD" 1.0000) (4, 3, 4, <13:14>, 1, ".", 0, "null", "." 1.0000)
$ # see the --select option below
$ delphin process --options="-y" -g ~/delphin/erg-2018.dat --select i-tokens tmp
Processing |################################| 1/1
NOTE: parsed 1 / 1 sentences, avg 1558k, time 0.00758s
$ # confirm it parsed from i-tokens and not i-input
$ delphin select "mrs" tmp/ | delphin convert
[ TOP: h0
  INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ]
  RELS: < [ _the_q<0:3> LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] RSTR: h5 BODY: h6 ]
          [ _dog_n_1<4:7> LBL: h7 ARG0: x3 ]
          [ _bark_v_1<8:14> LBL: h1 ARG0: e2 ARG1: x3 ] >
  HCONS: < h0 qeq h1 h5 qeq h7 > ]

[ TOP: h0
  INDEX: e2 [ e SF: prop ]
  RELS: < [ unknown<0:14> LBL: h1 ARG: x4 [ x PERS: 3 NUM: pl IND: + ] ARG0: e2 ]
          [ _the_q<0:3> LBL: h5 ARG0: x4 RSTR: h6 BODY: h7 ]
          [ compound<4:14> LBL: h8 ARG0: e9 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x4 ARG2: x10 [ x IND: + PT: notpro ] ]
          [ udef_q<4:7> LBL: h11 ARG0: x10 RSTR: h12 BODY: h13 ]
          [ _dog_n_1<4:7> LBL: h14 ARG0: x10 ]
          [ _bark_n_1<8:14> LBL: h8 ARG0: x4 ] >
  HCONS: < h0 qeq h1 h6 qeq h8 h12 qeq h14 > ]

The --select query may be more complex with conditions, etc., but it must yield a single column, which serves as the input to ACE.

If you’d like to go this route, the docs have an example of using REPP as a preprocessor when parsing a profile: delphin.interface — PyDelphin 1.7.0 documentation. You could do something similar if you want to write a Python script to do the parsing, as this functionality is not available via the delphin process command.

Thanks, @arademaker , @sweaglesw and @goodmami !

I feel like we are getting somewhere, though for now I am still confused:

OK great, so, this allows me to keep the i-input string for fftb later but use i-tokens for parsing. How do I set the i-tokens to something?.. In other words, how did you create the input for the example above (assuming it is a itsdb Testsuite instance)?

Just to show you what I currently have:

    sentence_list = read_testsuite(ts) #returns a list of strings from i-input
    script_output = run_script('./sentences2freeling.sh', sentence_list) # returns a list of Freeling output strings
    yy = convert_sentences(script_output) # returns a list of YY strings
    assert len(yy) == len(ts['item'])

I’m trying to understand how to proceed from the above (or what to do differently) if the goal is to have updated treebanked profiles, with the old decisions kept.

The testsuites I have have just the sentences in i-input (and if I look into each item[i-tokens], what I see is a string 1 for some reason, not sure what that means). I can then obtain the YY format for each string but something like item['i-token'] doesn’t allow assignment. I know from the docs that Tables have the update function but I can’t figure out how to get to use it. How to I get to update i-tokens given a testsuite item?..

I think I need to do something like:

ts['item'].update(7, data)  

…where 7 is the index of the Row i-tokens. But I don’t understand how I create data from a list of YY strings? data is a “a mapping of column names to values for replacement”. How do I create that? – and there is an example right below, so let me try that:

table.update(0, {'i-input': '...'})

Sorry, too many docs :). I try to read them carefully but I don’t often succeed :).

I still have a question though: the above example with i-input looks like it’s updating one item?.. Or is the ‘…’ somehow a list of things?

ts['item'].update(7,{'i-tokens':yy}) 

does not crash but does not lead to the desired effect, I think, in that I can’t then find the YY tokens anywhere in the testsuite…

What exactly isn’t possible? If I manage to update each i-tokens for each item with YY, won’t I then be able to use process on the entire testsuite?

One important thing to know is that itsdb.TestSuite objects are like open SQL database connections. The data is persisted on disk, but changes are stored in-memory until you commit them. So if you want to use itsdb.Table.update() to change these, don’t forget to run TestSuite.commit() when you’re done. Here is an example:

$ cat update.py 
import sys

from delphin import itsdb
from delphin import repp

tokenizer = repp.REPP()  # default tokenizer, as an example
ts = itsdb.TestSuite(sys.argv[1])
item = ts['item']

for i, row in enumerate(item):
    tokens = tokenizer.tokenize(row['i-input'])
    # the update() function only changes data in-memory
    item.update(i, {'i-tokens': str(tokens)})  # cast to str for YY format
ts.commit()  # commit to write to disk

$ cat tmp/item  # before updating
1@@@@1@@The cat meows.@@@@1@3@@@
2@@@@1@@The dog barks.@@@@1@3@@@
$ python update.py tmp
$ cat tmp/item  # after updating
1@@@@1@@The cat meows.@(0, 0, 1, <0:3>, 1, "The", 0, "null") (1, 1, 2, <4:7>, 1, "cat", 0, "null") (2, 2, 3, <8:14>, 1, "meows.", 0, "null")@@@1@3@@@
2@@@@1@@The dog barks.@(0, 0, 1, <0:3>, 1, "The", 0, "null") (1, 1, 2, <4:7>, 1, "dog", 0, "null") (2, 2, 3, <8:14>, 1, "barks.", 0, "null")@@@1@3@@@

You can also use the low-level delphin.tsdb.write() function with a list of records you’ve created, but this requires a different workflow. The above is probably more user-friendly.

The update() function is on the itsdb.Table instance, not the individual rows, so it requires a row index. This is basically just a design limitation – the Row objects are meant to be immutable to save memory and reduce complexity. The index used in update() shouldn’t be a constant like 0 but some variable, such as the i above returned from enumerate(item).

I offered this as an alternative workflow that dynamically generates the tokens when processing. For this, you would not update the profiles with i-tokens, you would just define a delphin.interface.Processor subclass that performs your preprocessing and use the original profiles as inputs. Since you have to write custom Python code for this, you cannot use the delphin process command, but instead write your own script to use the Python API to do the processing. For example:

# define your custom Processor subclass...
ts = itsdb.TestSuite(ts_path)
with ace.ACEParser(grm, cmdargs=['-y']) as _cpu:
    cpu = PreprocessorWrapper(_cpu, ...)
    ts.process(cpu)

Sorry there is not one obvious way to do it. You have choices. Either preprocess your profile as a first step and use delphin process --select ... or write your own script to process the profile with dynamic preprocessing.

2 Likes