Assessing the accuracy of treebanks parsed with a parse ranking model

What’s the best procedure to assess the accuracy of a grammar ran on a corpus using a parse ranking model? So, say, with ACE using -1 flag but not in a full forest mode? FFTB I think requires the full-forest mode, so I don’t understand how to then accept or reject the result. Surely there is a way to do that with ACE+FFTB?

…Or is the way to load both the gold treebank and the -1 treebank in pydelphin and compare MRS? What do people do normally?


there is a way of doing that with the acetools, described here:

1 Like

For very simple use cases you can use PyDelphin’s compare command/function:

$ delphin compare tmp-cur/ tmp-gold/ 
1	<0,1,0>
2	<1,0,1>

The output shows <unique-parses-in-current, shared-parses, unique-parses-in-gold>. Note that items where neither side parsed the input (i.e., <0,0,0>) are skipped.

If you want to tabulate the results instead of just viewing the per-item comparisons, you may find it easier to use the Python API:

>>> from delphin.commands import compare
>>> total = diffs = 0
>>> for result in compare('tmp-cur', 'tmp-gold'):
...     total += 1
...     if result['test'] != 0 or result['gold'] != 0:
...         diffs += 1
>>> total
>>> diffs

Again, the total above only includes those with any parses. If you want the total number of items, you can get it like this;

>>> from delphin import itsdb
>>> cur = itsdb.TestSuite('tmp-cur')
>>> len(cur['item'])
1 Like

Thanks, @bond and @goodmami ! I think I may have confused everyone with my second post. I actually don’t want to compare anything (at least not in all cases). What I want is to run a new corpus (e.g. a test corpus) with the -1 flag and without the --full-forest pydelphin flag, because I only want to get one tree per sentence (or maybe at most n trees, e.g. n=5), and then I want to be able to assess the accuracy of that. Without having a full forest ever.

How do people do that? @bond is that the scenario in that Indra documentation (I don’t think so?)

I’m now confused. Is that not what we showed above?

I think you also need to define what your test is for accuracy. The delphin compare method doesn’t require the -1 option; it works with n results in the current or gold profile, where n is 0 or greater. If it returns the tuple <x,y,z>, you might define a success as:

  • <0,0+,0> – Any and all parses are the same in both profiles; neither has any unique parses. This is the approach used by the Grammar Matrix’s regression tests (see
  • <0+,0+,0> – Any and all gold parses have equivalent MRSs in the current profile; current profile may have extra. This focuses on recall over precision.
  • <0+,1+,0+> – There exists a shared parse; ignore unique parses in either side. This doesn’t say much except that the grammar can produce one of the expected parses.

None of the above considers parse ranking order. If you want the top-1 parse to be the same as those in the gold profile, use the -1 option or write your own comparison logic that doesn’t use delphin.mrs.compare_bags().

I think you show comparison to a gold treebank (what I talk about in the second post, erroneously). What I need is accepting or rejecting trees from something that was run with a parse ranking model and not in a full forest mode. Does that make sense? So, suppose it’s an entirely new corpus and I don’t want to ever treebank the full forest. I only want to run it with -1 or '-5` or something like that, and then accept or reject trees from that corpus. No full forest. Do I have to use tsdb++, or?..

This much is clear. I never suggested anything above that would use the full-forest mode.

Ok, so you want to check whether (a) the grammar can produce a correct parse and (b) that the parse ranking model puts it at the top-1 or within the top-5? This sounds like regular treebanking using [incr tsdb()].

Unsolicited suggestion: Maybe you truly don’t care about correct parses appearing after the first or top-5, which is fine, but I think there is value in keeping a larger amount if you’re willing to inspect and filter more than 1 or 5 results during treebanking since you can get the rank from the result-id field to see if it is 0 (first parse), less than or equal to 4 (top-5), etc. E.g., maybe later with a profile from a newer version of the grammar/parse-ranking-model you want to know if the correct parse moved up from the top-20 into the top-5.


Thanks, @goodmami !

Yes, that’s what I thought (that I should probably be using tsdb++). I remember it being problematic though, and having to switch to fftb (which requires full forest). Hence my question (how do people do it). I suppose the answer still is “with tsdb++”!

I think in the end, we will do a full-forest treebanking pass and then will compare a -1 run with that using pydelphin.

I am still now sure how people actually do this, in the absence of a full-forest profile.

After a profile is treebanked in [incr tsdb()], what happens then? What this does is populate the file tree in the profile folder, but can I use pydelphin libraries to compute accuracy using tree somehow? If not, then what do people do? Look at accuracy in [incr tsdb()]? I just tried, and it dies often when I finish treebanking via Annotate. I now remember that I had a similar experience with it when I tried using it for my dissertation, and the solution was “switch to fftb” (in that case, I was not using a parse ranker so full forest made sense).

So, the question remains, how exactly people assess the accuracy of ranked models? What is the adopted process?

@arademaker says he has a procedure with fftb; hopefully we can figure it out. I can’t (fftb won’t load my top-ranked profiles). By the way, the problem with just using pydelphin compare is, you still then have to check all the diffs manually because there may well be equally good results. So, in the end, there must still be a way to accept or reject trees obtained by running a parser with a parse ranking model, preserving the decisions and so on.