Reparsing and updating a treebank keeping previous decisions

I can be wrong, but using the YY example from AceUse · delph-in/docs Wiki · GitHub, I produced a profile from a YY file as below.

% delphin mkprof --input teste.yy --relations ~/hpsg/logon/lingo/lkb/src/tsdb/skeletons/english/Relations --skeleton golden

and I was able to process it with ACE:

% delphin process -vv -g erg.dat -o "-n 1 -y --timeout=60 --max-words=150 --max-chart-megabytes=4000 --max-unpack-megabytes=5000 --rooted-derivations --udx --disable-generalization" -s golden parsed

I was not expecting the i-input field to hold the YY markup, but it seems to be the way to go… am I right @goodmami and @sweaglesw?

% cat parsed/item
1@@@@1@@(42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)@@@@1@12@@@
1 Like

That’s just how the grammar was created. In general, for languages with complex morphology, this seems preferable, especially for phonology. Creating a robust morphophonological component can probably take a very long time? But with a good external analyzer, you get robust mapping of forms to underlying forms, along with the tags specifying the morphosyntactic nature of the surface form. Then you map that to an inflectional rule in the grammar, using the tag and the YY format. You still need to create all the inflectional rules but you don’t have to analyze the surface forms.

Is it preferable or necessary? I am still reading the @AnnC “Implementing Typed Feature Structure Grammars”, but page 128 footnote 39 suggest that the morphological rules maybe too simple for many languages. I am just trying to confirm how far can we go with it to Portuguese. It seems that LKB is restricted to concatenative morphology but can’t deal with alternatiion rules (see https://www.amazon.com/Finite-State-Morphology-Kenneth-Beesley/dp/1575864347).

Besides morphology, the preprocessing does also the POS tagging, right? That is, the YY format help on both steps right? The POS can be used to map entries to generic lexical entities, right? If so, another question is the granularity of the tags used for preprocessing. Maybe a possible deviation of the PTB tags could be worth to explore? Johan Bos explored the idea of ‘semantic tagging’ (Towards Universal Semantic Tagging - ACL Anthology).

By the way, related to my experiment above, one ‘limitation’ I see is the fact that we will not have in the profile the actual initial string of the sentence, I guess it can be obtained from the YY format, but it would be preferable to have the string in the input file together with the YY input.

1 Like

This paper might be helpful to your query, @arademaker

2 Likes

Sorry, I didn’t pay attention to that part. In my example above, I did that, added the YY format in the i-input field and that worked. But It would be better to the the YY markup in the i-tokens field maybe?

1 Like

The i-input field of the item file can be YY data. PyDelphin just takes the value of that field and pipes it into ACE, so if ACE was invoked with -y, it will work fine:

$ cat tmp/item
1@@@@1@@(1, 0, 1, <0:3>, 1, "The", 0, "null", "DT" 1.0000) (2, 1, 2, <4:7>, 1, "dog", 0, "null", "NN" 1.0000) (3, 2, 3, <8:13>, 1, "barks", 0, "null", "VBD" 1.0000) (4, 3, 4, <13:14>, 1, ".", 0, "null", "." 1.0000)@@@@1@3@@@
$ delphin process -v --options="-y" -g ~/delphin/erg-2018.dat tmp/
Processing |################################| 1/1
NOTE: parsed 1 / 1 sentences, avg 1558k, time 0.00807s
$ delphin select mrs tmp/ | delphin convert
[ TOP: h0
  INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ]
  RELS: < [ _the_q<0:3> LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] RSTR: h5 BODY: h6 ]
          [ _dog_n_1<4:7> LBL: h7 ARG0: x3 ]
          [ _bark_v_1<8:14> LBL: h1 ARG0: e2 ARG1: x3 ] >
  HCONS: < h0 qeq h1 h5 qeq h7 > ]

[ TOP: h0
  INDEX: e2 [ e SF: prop ]
  RELS: < [ unknown<0:14> LBL: h1 ARG: x4 [ x PERS: 3 NUM: pl IND: + ] ARG0: e2 ]
          [ _the_q<0:3> LBL: h5 ARG0: x4 RSTR: h6 BODY: h7 ]
          [ compound<4:14> LBL: h8 ARG0: e9 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x4 ARG2: x10 [ x IND: + PT: notpro ] ]
          [ udef_q<4:7> LBL: h11 ARG0: x10 RSTR: h12 BODY: h13 ]
          [ _dog_n_1<4:7> LBL: h14 ARG0: x10 ]
          [ _bark_n_1<8:14> LBL: h8 ARG0: x4 ] >
  HCONS: < h0 qeq h1 h6 qeq h8 h12 qeq h14 > ]

If you want stored (as opposed to dynamically generated) YY tokens, then I would store them in the i-input field. It seems like it would be better to keep the original sentence there and put the YY tokens in a field like i-tokens, but it doesn’t work that way. You could keep the original sentence somewhere else, like i-comment or even a separate file.

I wouldn’t try and process things yourself and replace the values of the result file. That’s basically just reimplementing the process functionality.

If you want to use Python to programmatically update profiles, then yes, that is an appropriate function.

No, FieldMapper is for mapping the response objects (a PyDelphin thing, not an ACE thing) to [incr tsdb()] fields. You wouldn’t need to directly use FieldMapper unless you were, say, writing a PyDelphin interface to the LKB or doing something drastically different with ACE outputs.

1 Like

I’m not sure there’s a good way to do this off the shelf. You need to pass YY data to ACE but you also need i-input to be plain text or else FFTB won’t work properly. Off hand I would say a plausible solution would be to store both the plain text in i-input and the preprocessed YY data in i-tokens, and have the delphin process machinery take an option to feed ACE from i-tokens instead of i-input. Looking over the code for art, which does roughly the same thing as delphin process (I think), I see I had a facility to do something related to this, but not quite close enough to work for your problem. @goodmami, would that be easy to add? Or maybe such a facility already exists?

1 Like

It’s not a problem to convert to YY dynamically given a string, the problem is doing that while also processing the profile as a whole, because the scenario here is updating treebanks…

Given the answers from @goodmami and @sweaglesw, we do have a limitation in the current tools. But I guess you can overcome the limitation with some extra code. You can create the profiles with the commands I suggested, storing the YY markup in the i-input field and the string of the sentences in the comment or any other field of the input file.

For treebank with fftb, you will need to update the input file copying the string of the sentences to the i-input and saving the YY markup elsewhere. Does it work @sweaglesw? Does it make sense? How fftb process the data and how the string is used? I am assuming fftb is not processing the string (parsing, tokenize etc) during the treebank. But I may be wrong.

Good call. It actually already exists and I’d forgotten about it. To do this, use the --select option of delphin process, which takes a TSQL select query. To confirm, note that the i-input and i-tokens represent slightly different sentences (The cat meows. and The dog barks., respectively), but the i-tokens one is used in parsing:

$ delphin select "i-input i-tokens" tmp
The cat meows.@(1, 0, 1, <0:3>, 1, "The", 0, "null", "DT" 1.0000) (2, 1, 2, <4:7>, 1, "dog", 0, "null", "NN" 1.0000) (3, 2, 3, <8:13>, 1, "barks", 0, "null", "VBD" 1.0000) (4, 3, 4, <13:14>, 1, ".", 0, "null", "." 1.0000)
$ # see the --select option below
$ delphin process --options="-y" -g ~/delphin/erg-2018.dat --select i-tokens tmp
Processing |################################| 1/1
NOTE: parsed 1 / 1 sentences, avg 1558k, time 0.00758s
$ # confirm it parsed from i-tokens and not i-input
$ delphin select "mrs" tmp/ | delphin convert
[ TOP: h0
  INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ]
  RELS: < [ _the_q<0:3> LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] RSTR: h5 BODY: h6 ]
          [ _dog_n_1<4:7> LBL: h7 ARG0: x3 ]
          [ _bark_v_1<8:14> LBL: h1 ARG0: e2 ARG1: x3 ] >
  HCONS: < h0 qeq h1 h5 qeq h7 > ]

[ TOP: h0
  INDEX: e2 [ e SF: prop ]
  RELS: < [ unknown<0:14> LBL: h1 ARG: x4 [ x PERS: 3 NUM: pl IND: + ] ARG0: e2 ]
          [ _the_q<0:3> LBL: h5 ARG0: x4 RSTR: h6 BODY: h7 ]
          [ compound<4:14> LBL: h8 ARG0: e9 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x4 ARG2: x10 [ x IND: + PT: notpro ] ]
          [ udef_q<4:7> LBL: h11 ARG0: x10 RSTR: h12 BODY: h13 ]
          [ _dog_n_1<4:7> LBL: h14 ARG0: x10 ]
          [ _bark_n_1<8:14> LBL: h8 ARG0: x4 ] >
  HCONS: < h0 qeq h1 h6 qeq h8 h12 qeq h14 > ]

The --select query may be more complex with conditions, etc., but it must yield a single column, which serves as the input to ACE.

If you’d like to go this route, the docs have an example of using REPP as a preprocessor when parsing a profile: delphin.interface — PyDelphin 1.7.0 documentation. You could do something similar if you want to write a Python script to do the parsing, as this functionality is not available via the delphin process command.

Thanks, @arademaker , @sweaglesw and @goodmami !

I feel like we are getting somewhere, though for now I am still confused:

OK great, so, this allows me to keep the i-input string for fftb later but use i-tokens for parsing. How do I set the i-tokens to something?.. In other words, how did you create the input for the example above (assuming it is a itsdb Testsuite instance)?

Just to show you what I currently have:

    sentence_list = read_testsuite(ts) #returns a list of strings from i-input
    script_output = run_script('./sentences2freeling.sh', sentence_list) # returns a list of Freeling output strings
    yy = convert_sentences(script_output) # returns a list of YY strings
    assert len(yy) == len(ts['item'])

I’m trying to understand how to proceed from the above (or what to do differently) if the goal is to have updated treebanked profiles, with the old decisions kept.

The testsuites I have have just the sentences in i-input (and if I look into each item[i-tokens], what I see is a string 1 for some reason, not sure what that means). I can then obtain the YY format for each string but something like item['i-token'] doesn’t allow assignment. I know from the docs that Tables have the update function but I can’t figure out how to get to use it. How to I get to update i-tokens given a testsuite item?..

I think I need to do something like:

ts['item'].update(7, data)  

…where 7 is the index of the Row i-tokens. But I don’t understand how I create data from a list of YY strings? data is a “a mapping of column names to values for replacement”. How do I create that? – and there is an example right below, so let me try that:

table.update(0, {'i-input': '...'})

Sorry, too many docs :). I try to read them carefully but I don’t often succeed :).

I still have a question though: the above example with i-input looks like it’s updating one item?.. Or is the ‘…’ somehow a list of things?

ts['item'].update(7,{'i-tokens':yy}) 

does not crash but does not lead to the desired effect, I think, in that I can’t then find the YY tokens anywhere in the testsuite…

What exactly isn’t possible? If I manage to update each i-tokens for each item with YY, won’t I then be able to use process on the entire testsuite?

One important thing to know is that itsdb.TestSuite objects are like open SQL database connections. The data is persisted on disk, but changes are stored in-memory until you commit them. So if you want to use itsdb.Table.update() to change these, don’t forget to run TestSuite.commit() when you’re done. Here is an example:

$ cat update.py 
import sys

from delphin import itsdb
from delphin import repp

tokenizer = repp.REPP()  # default tokenizer, as an example
ts = itsdb.TestSuite(sys.argv[1])
item = ts['item']

for i, row in enumerate(item):
    tokens = tokenizer.tokenize(row['i-input'])
    # the update() function only changes data in-memory
    item.update(i, {'i-tokens': str(tokens)})  # cast to str for YY format
ts.commit()  # commit to write to disk

$ cat tmp/item  # before updating
1@@@@1@@The cat meows.@@@@1@3@@@
2@@@@1@@The dog barks.@@@@1@3@@@
$ python update.py tmp
$ cat tmp/item  # after updating
1@@@@1@@The cat meows.@(0, 0, 1, <0:3>, 1, "The", 0, "null") (1, 1, 2, <4:7>, 1, "cat", 0, "null") (2, 2, 3, <8:14>, 1, "meows.", 0, "null")@@@1@3@@@
2@@@@1@@The dog barks.@(0, 0, 1, <0:3>, 1, "The", 0, "null") (1, 1, 2, <4:7>, 1, "dog", 0, "null") (2, 2, 3, <8:14>, 1, "barks.", 0, "null")@@@1@3@@@

You can also use the low-level delphin.tsdb.write() function with a list of records you’ve created, but this requires a different workflow. The above is probably more user-friendly.

The update() function is on the itsdb.Table instance, not the individual rows, so it requires a row index. This is basically just a design limitation – the Row objects are meant to be immutable to save memory and reduce complexity. The index used in update() shouldn’t be a constant like 0 but some variable, such as the i above returned from enumerate(item).

I offered this as an alternative workflow that dynamically generates the tokens when processing. For this, you would not update the profiles with i-tokens, you would just define a delphin.interface.Processor subclass that performs your preprocessing and use the original profiles as inputs. Since you have to write custom Python code for this, you cannot use the delphin process command, but instead write your own script to use the Python API to do the processing. For example:

# define your custom Processor subclass...
ts = itsdb.TestSuite(ts_path)
with ace.ACEParser(grm, cmdargs=['-y']) as _cpu:
    cpu = PreprocessorWrapper(_cpu, ...)
    ts.process(cpu)

Sorry there is not one obvious way to do it. You have choices. Either preprocess your profile as a first step and use delphin process --select ... or write your own script to process the profile with dynamic preprocessing.

2 Likes

Sorry, but last message from @olzama is not clear to me. I tend to believe that going too low level as @goodmami is suggesting should not be safe… But I didn’t hear from @sweaglesw , I don’t know how and when FFTB uses the i-input… so maybe my naive suggestion of preprocessing with pydelphin/ace, dos some columns updated in the input before FFTB may not work.

@olzama is the grammar updated in GitHub - delph-in/srg: Spanish Resource Grammar? The treebanks are there in the same repo?

Sorry, I’m not sure what you’re referring to. I mentioned that delphin.tsdb is a lower-level approach to what the delphin.itsdb module offers, but these methods are all “above-board”, so I’m not sure what is not safe.

@sweaglesw said that i-input needs to be the original input sentence for FFTB to work (presumably for at least highlighting token spans), so the initial suggestion of putting the tokens in the i-input field was not ideal and we backed off of that. I’d forgotten about PyDelphin’s --select option, so it already has a perfectly usable alternative, assuming we want to store tokens in the profile (e.g., in the i-tokens field) before parsing.

If we don’t want to store tokens in the profile and instead want to perform preprocessing at the same time as parsing the profile, then PyDelphin’s command line interface (which is really just a convenient front-end to the Python API) will not work as it does not have an option for custom preprocessors.

Either method seems ok to me. Once the profiles have been parsed, then FFTB is used to update them with earlier decisions and to make new ones as needed.

Oh… I was saying that using a method close to what process offer would be the safer path. Ideally, @olzama should not need to consider modifications in the files directly. The process is the best interface: given a profile, process it with a given grammar and parameters populating the profile as necessary to record the results.

I believe @goodmami, we are on the same page. You are right, the level of abstraction is relative. As long as @olzama does not try to write in the files directly, I mean, writing lines in the text files of the profile herself, I would say… it should be fine IMHO, right?

But we are going too deep here without concrete tests. For instance, the idea of using the i-tokens and i-input seems fine but it depends on how FFTB uses the i-input. Suposse FFTB does the processing of tokenization and lexical analysis once we click in the sentence to start the annotation. In that case, it would differ from the tokenization/POS and morphological analysis done by the external tool @olzama is using. What we want is an alternative to enforce that whatever FFTB does for preparing the sentence for annotation and store the human-provided analysis, it must start from the morphological analysis provided by the external tool, right?

Hi @ebender, thank you for sharing this paper. I guess my next reading is https://faculty.washington.edu/ebender/papers/Montage_LREC.pdf. If I got it right, tools are still under development, right? I belive I got something from the examples in Slave, Sec 3.2.1 and 3.2.2, but I didn’t really understand what Montage does that finite-state tecniques can’t do. Maybe the point is more about the maintainability and the use for descriptive linguist.

After all, in the end of the paper, you mention the generation the XSFT files. So a opensource tool like http://fomafst.github.io can use them. That is something I would be interesting to explore. We are trying to keep the PorGram lexicon in sync with our MorphoBr full-form dictionary.

Sorry for being slow to reply, folks. I don’t know why but the discourse website silently drops replies that I send by email these days, and it is less frequent that I have a computer in front of me that is actually signed into the discourse site for posting.

FFTB, as far as I recall, only uses i-input for displaying the sentence – both in the “homepage” list of sentences and in the area at the top of the treebanking interface where you select spans with the mouse. FFTB does not do any analysis whatsoever on that text. Any necessary data is prerecorded in the token structures provided by the grammar, which are stored in the edge relation of the profile being treebanked. FFTB does parse those token structures, if I recall correctly, to find the character offsets spanned by them (typically in +FROM and +TO in the token AVM, I think – it’s been a while). FFTB needs to know those character offsets in order to know what part of the i-input string should be click-and-draggable and how to relate that to the different parts of the stored edges.

1 Like

@arademaker , the most recent version of the SRG is here (the development branch). It is not fully working though, there are annoying differences with the older logon version (which uses the older morphophonological analyzer).

I have not yet released the treebanks because before I do that, I need to first figure out all these issues that we are discussing. Then I will be able to pair a grammar release with a treebank version. For now, this is still in progress.

Also, may I suggest that we move the general discussion of dealing with morphophonology into a separate thread?

Alright, so, I did the following, for the old SRG MRS test suite:

  1. Updated the database schema. To check, I confirmed that the new profile loads in fftb.
  2. Updated the profile with i-tokens obtained from the Freeling morphophonological analyzer.
  3. Processed the profile with ACE, using --full-forest option as well as -y --yy-rules and specifying i-tokens as the input for the processing. I then verified that I can in principle treebank, using fftb.

Now I am trying to do what @sweaglesw suggested above, namely update the treebank using and old decisions file.

For now, I am just encountering some error:

grammar image: /home/olzama/delphin/srg/ace/srg-original.dat
Just one TSDB profile: mrs
Would update from profile: /home/olzama/delphin/logon/upf/srg/tsdb/mrs
listening on http://127.0.0.1:57157/private/
should GET    /private/
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /private/parse?profile=/&id=11
item id 11 -> input 'Llueve.'
unexpected error in tsdb file /home/olzama/delphin/logon/upf/srg/tsdb/mrs/item:1

Any further ideas on how to proceed from here? Thanks a lot again for all the help so far!

Update: after the meeting today with Dan and others, we made some progress (namely, instead of the actual old gold profile, we created a new, modern profile but added to it decision, preference, and tree from the old profile which had the outdated schema and everything).

Now if I try to update a modern profile with new edges but empty decision etc. using the “faked” old gold, I get further:

It looks like maybe some of them actually have to be re-parsed, for others, maybe something else needs to be fixed… We’ll see.

For now, I think it is maybe OK to close this thread and start new ones for specific issues. But the last thing I wanted to ask here: What should I do about the “unexpected errors”; is there any way to debug those?.. I still have plenty of them, even though they didn’t make it into the screenshot :).