Understanding and using released ERG treebanks

I spent about an hour reading what I could find on the wiki about treebanking (with great interest) but most of that seems to be about how to do treebanking. At the moment, I want to look at the already treebanked data, say, in [incr tsdb()].

I am pretty sure that the data that is in the ERG tsdb/gold/ directory is treebanked data. I can load that into tsdb but I can’t figure out how to access the gold parse. I tried filtering for “Annotated” in Options but I still get all the possible parses, it seems, and the first one is not necessarily the correct one, I don’t think? E.g. in the example below:

Screenshot from 2021-11-18 11-59-22

What’s the right way of looking at the treebanked data, or where is it documented in the wiki exactly?

Alternatively, I tried fftb. I installed acetools and art, made sure I was using the 0.9.30 version of everything as per the instructions, compiled the grammar using the 0.9.30 ace, created a fresh profile, and then ran:

fftb -g erg/2020/ace/grm30.dat --browser --webdir acetools/ testprofile/ --gold erg/2020/tsdb/gold/ccs

I can see the items in the browser but clicking on any of them results in the “404 no stored forest” error. How do I access the treebanked version? Either tsdb or fftb would be great.

I can see that the wiki falls short on how to view gold profiles. First, what you originally tried with the LKB and [incr tsdb()] will work, as follows: once you have the table you showed resulting from clicking on Browse–Results for a particular profile, choose a sentence, and then double-click on the “1” in the “derivation” column. This brings up a little window showing the stored derivation tree in bracketed text format, so now double-click on that one content line, and a new window will pop up showing the corresponding graphical parse tree. I think you were double-clicking on the text of the sentence itself in the Browse–Results window, and that causes the LKB to reparse the sentence, producing all analyses.

I think you would need to do more in order to use fftb and ACE to view the treebank results, because fftb expects to have a parse forest for each sentence, and these are stored in the edge' relation in the [incr tsdb()] profile. But storing the forests for each sentence takes up space, and has not been practical for inclusion in the gold profiles for each ERG release, so I have discarded the edge’ content for each gold profile. One can in principle reparse a gold profile using ACE to “repopulate” that edge relation, and then fftb will be happy to show you the treebanked parse selected from each sentence’s parse forest. But it’s probably not worth the effort for your purposes.

2 Likes

What is your use case @olzama ? It may be an opportunity to show you our reimplementation of the wsi interface ?

Hi @Dan and @olzama what wiki pages you guys think we can add the explanation above from @dan?

In the past I also got this “404 no stored forest” error and someone explained it to me, can’t remember who and where.

@arademaker this is for my project about neural supertagging. I am just starting things, so I simply wanted to look at my prospective training data. Of course I won’t be training using these GUIs but to explore they are very nice.

I would say, a small rephrasing somewhere in TreebankingTop, indicating that the linked pages will contain info not only on how to create a fresh treebank but also how to explore and existing one, and then perhaps add a section to Treebanking with the Fine System, if we are talking about [incr tsdb()].

Excellent, thank you, Dan, this works just as you describe. I added these instructions to ItsdbTreebanking · delph-in/docs Wiki · GitHub.