Reparsing and updating a treebank keeping previous decisions

Sorry, but last message from @olzama is not clear to me. I tend to believe that going too low level as @goodmami is suggesting should not be safe… But I didn’t hear from @sweaglesw , I don’t know how and when FFTB uses the i-input… so maybe my naive suggestion of preprocessing with pydelphin/ace, dos some columns updated in the input before FFTB may not work.

@olzama is the grammar updated in GitHub - delph-in/srg: Spanish Resource Grammar? The treebanks are there in the same repo?

Sorry, I’m not sure what you’re referring to. I mentioned that delphin.tsdb is a lower-level approach to what the delphin.itsdb module offers, but these methods are all “above-board”, so I’m not sure what is not safe.

@sweaglesw said that i-input needs to be the original input sentence for FFTB to work (presumably for at least highlighting token spans), so the initial suggestion of putting the tokens in the i-input field was not ideal and we backed off of that. I’d forgotten about PyDelphin’s --select option, so it already has a perfectly usable alternative, assuming we want to store tokens in the profile (e.g., in the i-tokens field) before parsing.

If we don’t want to store tokens in the profile and instead want to perform preprocessing at the same time as parsing the profile, then PyDelphin’s command line interface (which is really just a convenient front-end to the Python API) will not work as it does not have an option for custom preprocessors.

Either method seems ok to me. Once the profiles have been parsed, then FFTB is used to update them with earlier decisions and to make new ones as needed.

Oh… I was saying that using a method close to what process offer would be the safer path. Ideally, @olzama should not need to consider modifications in the files directly. The process is the best interface: given a profile, process it with a given grammar and parameters populating the profile as necessary to record the results.

I believe @goodmami, we are on the same page. You are right, the level of abstraction is relative. As long as @olzama does not try to write in the files directly, I mean, writing lines in the text files of the profile herself, I would say… it should be fine IMHO, right?

But we are going too deep here without concrete tests. For instance, the idea of using the i-tokens and i-input seems fine but it depends on how FFTB uses the i-input. Suposse FFTB does the processing of tokenization and lexical analysis once we click in the sentence to start the annotation. In that case, it would differ from the tokenization/POS and morphological analysis done by the external tool @olzama is using. What we want is an alternative to enforce that whatever FFTB does for preparing the sentence for annotation and store the human-provided analysis, it must start from the morphological analysis provided by the external tool, right?

Hi @ebender, thank you for sharing this paper. I guess my next reading is https://faculty.washington.edu/ebender/papers/Montage_LREC.pdf. If I got it right, tools are still under development, right? I belive I got something from the examples in Slave, Sec 3.2.1 and 3.2.2, but I didn’t really understand what Montage does that finite-state tecniques can’t do. Maybe the point is more about the maintainability and the use for descriptive linguist.

After all, in the end of the paper, you mention the generation the XSFT files. So a opensource tool like http://fomafst.github.io can use them. That is something I would be interesting to explore. We are trying to keep the PorGram lexicon in sync with our MorphoBr full-form dictionary.

Sorry for being slow to reply, folks. I don’t know why but the discourse website silently drops replies that I send by email these days, and it is less frequent that I have a computer in front of me that is actually signed into the discourse site for posting.

FFTB, as far as I recall, only uses i-input for displaying the sentence – both in the “homepage” list of sentences and in the area at the top of the treebanking interface where you select spans with the mouse. FFTB does not do any analysis whatsoever on that text. Any necessary data is prerecorded in the token structures provided by the grammar, which are stored in the edge relation of the profile being treebanked. FFTB does parse those token structures, if I recall correctly, to find the character offsets spanned by them (typically in +FROM and +TO in the token AVM, I think – it’s been a while). FFTB needs to know those character offsets in order to know what part of the i-input string should be click-and-draggable and how to relate that to the different parts of the stored edges.

1 Like

@arademaker , the most recent version of the SRG is here (the development branch). It is not fully working though, there are annoying differences with the older logon version (which uses the older morphophonological analyzer).

I have not yet released the treebanks because before I do that, I need to first figure out all these issues that we are discussing. Then I will be able to pair a grammar release with a treebank version. For now, this is still in progress.

Also, may I suggest that we move the general discussion of dealing with morphophonology into a separate thread?

Alright, so, I did the following, for the old SRG MRS test suite:

  1. Updated the database schema. To check, I confirmed that the new profile loads in fftb.
  2. Updated the profile with i-tokens obtained from the Freeling morphophonological analyzer.
  3. Processed the profile with ACE, using --full-forest option as well as -y --yy-rules and specifying i-tokens as the input for the processing. I then verified that I can in principle treebank, using fftb.

Now I am trying to do what @sweaglesw suggested above, namely update the treebank using and old decisions file.

For now, I am just encountering some error:

grammar image: /home/olzama/delphin/srg/ace/srg-original.dat
Just one TSDB profile: mrs
Would update from profile: /home/olzama/delphin/logon/upf/srg/tsdb/mrs
listening on http://127.0.0.1:57157/private/
should GET    /private/
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /favicon.ico
should GET    /private/parse?profile=/&id=11
item id 11 -> input 'Llueve.'
unexpected error in tsdb file /home/olzama/delphin/logon/upf/srg/tsdb/mrs/item:1

Any further ideas on how to proceed from here? Thanks a lot again for all the help so far!

Update: after the meeting today with Dan and others, we made some progress (namely, instead of the actual old gold profile, we created a new, modern profile but added to it decision, preference, and tree from the old profile which had the outdated schema and everything).

Now if I try to update a modern profile with new edges but empty decision etc. using the “faked” old gold, I get further:

It looks like maybe some of them actually have to be re-parsed, for others, maybe something else needs to be fixed… We’ll see.

For now, I think it is maybe OK to close this thread and start new ones for specific issues. But the last thing I wanted to ask here: What should I do about the “unexpected errors”; is there any way to debug those?.. I still have plenty of them, even though they didn’t make it into the screenshot :).