--full-forest flag for ACE and comparing treebanks in tsdb++

When I run ACE with the --full-forest flag (in order to be able to treebank with with fftb, for example), it appears that, if I load that profile into tsdb++, tsdb++ thinks there is only one reading per sentece in that profile? If I run without --full-forest, then I get all the readings and can compare profiles with respect to ambiguity and so on. Do I need to keep two versions in order to be able to use fftb to treebank and tsdb to compare grammars wrt ambiguity?

I thought [incr tsdb()] did not support full-forest treebanking. Has that changed?

I would imagine not! So, does it follow that I must have two versions of the treebanks if I want to use both tools?..

I suppose? Isn’t the regular profile an export of the full-forest one? So I suppose you could view it as an intermediary artifact. You create the exported profile in order to get comparison numbers, then you can throw it away.

Otherwise, trying to keep the full-forest and regular profiles in sync sounds like a pain.

1 Like

No, you don’t need two versions of the treebank in order to use both ACE and [incr tsdb()], but you will have to parse a particular profile twice, once with --full-forest so you can actually treebank it using the fftb tool, and once without that flag so you can see how many candidate analyses are computed for each sentence.
Each of your profiles contains both an “edge” file and a “result” file. When you supply the --full-forest flag in parsing a profile, ACE stores the full forest for each sentence of the profile in the “edge” file, and when you annotate a sentence to identify the intended analysis, that one preferred derivation is stored in the “result” file. If you don’t ask ACE to produce the full forests, then you are asking it to give you all of the parses for each sentence, and it stores those in the “result” file (leaving the “edge” file empty). This latter option is okay for relatively short sentences, but for longer sentences you would end up with thousands or millions or even billions of parses, and it would not be practical to store them all in the “result” file. This combinatorial explosion in parses for longer sentences is why the full-forest option for ACE and the “fftb” tool are so valuable. But if you are currently working with sentences of length 10-15 tokens for your treebanking, you may well be able to afford to store all of the analyses without the full-forest option, and then [incr tsdb()] can indeed give you more detailed comparison between parsing runs over a profile by showing you how many parses were found for each sentence in each run. But you would presumably still like to know if the intended analysis is present among that set of candidates, so if you just work with the (non-full-forest) run where you store them all in the “result” file, you would need to use the older LKB-based treebanking tool to find that intended analysis for each sentence, since I don’t think fftb can work without the full forest stored in the “edge” file. It seems preferable to parse first with --full-forest and treebank, and then if you want to compare ambiguity numbers with a previous run, reparse without that flag. But you won’t need (or want) to treebank that second run – it will just allow you to compare ambiguity measures.

1 Like

Thank you, @Dan ! But if FFTB overwrites the result file, then, I suppose, I am back to having two copies, if I wanted to be able to both use FFTB to treebank and to look at ambiguity? In other words, I need to preserve that enormous result file for that case.

Yes, you’re right, you’ll need to keep a separate run to store that very large result file in order to compare ambiguity measures. If that file gets too large, you might consider dividing your profile into smaller segments.