Running different versions of ACE on the ERG treebanks and comparing the results

I need to run two different versions of ACE (not the ERG!) on the ERG treebanks, and then compare the results using FFTB to establish differences wrt parsing accuracy. (Update: I realized that, in the end, for the way I phrased the question it is irrelevant whether I am comparing ACE versions or the ERG versions; what’s relevant is I need to run ACE and end up with a FFTB-loadable profile).

I added a new option to my experimental version of ACE (which is currently temporarily called -z and the value for that is a path to a directory with some additional data to speed up parsing), so, to just parse, I do:

./ace -g erg.dat -R -z supertags/sentences/ sentences.txt 

But I don’t want to just parse, I want to compare the accuracy using FFTB.

What is the best option? Based on reading about ACE options, I can do:

./ace -g erg.dat --tsdb-stdout --itsdb-forest -R -z supertags/sentences/ sentences.txt 

the above produces some output but I don’t know what to do with that output. How to use it to compare two treebanks with FFTB?

I am used to using pydelphin to parse tsdb profiles for the SRG using command line (delphin process) but I cannot figure out how to pass the new argument. I will create a different question about that.

What are you trying to find out in the comparison? Are you looking for differences in coverage? Ambiguity? Parsing time?

I wonder if FFTB isn’t the tool you want in this case, but rather maybe something with pydelphin. Once you’ve parsed the treebanks, pydelphin wouldn’t need to know which version of ace or the -z flag.

I need to know the differences in accuracy. So, I need to know if the parses I get with this new thing are the same as the gold ones (or how many of them are). So I think FFTB is what I need? (That being said, it is true that the parsing itself can be done with a pydelphin wrapper; it has some limitations compared to how ACE can be run so these are sort of just two different options. I am experimenting with pydelphin as well.)

So I think you just want to use pydelphin to compare the profiles, like we do in the Grammar Matrix regression testing set up. The only question I have is how pydelphin interacts with the full forest encoding…

Ah, thanks for reminding me about the regression test setup! I will take a look at how that’s done.

@sweaglesw replies:

Do you have a useful parse ranking model at the moment or not? If so, it may be more interesting to compare the top ranked result with gold rather than comparing the whole forest with gold.

If what you really want to do is check whether the gold parse is still in the (pruned?) forest produced by your revised ace, it ought to be straightforward to ask fftb to do an automatic update on a profile parsed with the new system and see how many items end up being unanalyzed that were accepted before.

The question of getting pydelphin to pass an extra option through to ace may have to be answered by Mike. You could do it using art, e.g.:

mkprof -s erg/tsdb/gold/mrs new-mrs-profile
art -f -a "./ace -O -g erg.dat -z supertags/sentences/" new-mrs-profile
fftb -g erg.dat --gold erg/tsdb/gold/mrs --auto new-mrs-profile

I just tried this (without the -z flag) and got a successful automatic update. Might work for you?

The delphin.ace module defines an allow-list of command line arguments for ACE that PyDelphin can handle. Since PyDelphin has to parse the output of ACE, it does not allow any arguments that alter the format of the output in ways it cannot parse.

If the -z option does not change the format of ACE’s output, then the easiest way to get PyDelphin to accept it would be something like this:

import ace
ace._ace_argparser.add_argument('-z')
parser = ace.AceParser(..., cmdargs=['-z'])

Note that this method is only available through the library usage of PyDelphin and not via the delphin process command. Also note that it is using the non-public module datum _ace_argparser, which is not documented nor guaranteed to persist across versions, but I have no plans to change it now.

PyDelphin is fully capable of loading and inspecting the profiles, but it has no special functionality to interpret full-forest edges for comparison, so you’d have to implement the comparison logic yourself. If you can instead parse to a regular profile with enumerated results, then you can compare them much like is done for the Grammar Matrix.

1 Like

Thanks, @goodmami !

So, I am not sure actually (what I need). I am experimenting with a new supertagger and ACE, and so I parse ERG treebanks with ACE+supertagger and then I want to compare time, coverage, and accuracy. Time and coverage are no problem (well, time is interesting but still not a problem), but with accuracy, what I have is the released ERG treebanks. Those are loadable in pydelphin, including the gold derivation. They have a non-empty result file as well as edges. Can I compute accuracy using pydelphin in such a case? I can parse with experimental setup however I want assuming I can add -z (which I can; I already did by hacking into my local pydelphin ace.py).

Ok, so the gold profile has enumerated results with derivations and MRSs. If by accuracy you mean something based on having equivalent MRSs, then you can do this as long as you are parsing the new profile with enumerated results (i.e., not full-forest mode). Once you have the non-full-forest profile, you can use PyDelphin’s delphin compare at the command line or something like delphin.mrs.compare_bags() in Python.

If you are parsing in full-forest mode, PyDelphin cannot help with comparing the MRSs as they are not present in the profile.

I suggest using an alternative strategy for comparison that avoids the need to get entangled in the full-forest treebanking machinery. Since we have a reasonable parse selection model trained for the 2023 version of the ERG that you are using, you should be able to simply parse a profile with the standard ERG finding only the top-ranked parse, and then reparse that same profile with supertagging enabled, again recording only the top-ranked parse. Then compare the two profiles to see how often the supertagger produces a different parse (thus failing to find the top-ranked one that the standard parsing strategy produces). That should be a valid measure of the accuracy of the supertagger, and of course also lets you measure the improvement in efficiency. This method also has the benefit of letting you do the comparison on other data sets that are not in Redwoods, if you wish.