Differences between full-forest and unpacked profiles

Context: @kphowell and I are working on treebanking profiles based on inferred grammars (from the AGGREGATION project), using fftb. These grammars are very noisy and have low coverage (~10%, give or take). fftb seems set up to expect nearly complete coverage over profiles–if you click on an item that has no results, you get a 404 and then have to use the back button on the browser and track which item you just looked at as it doesn’t change color to “visited”. So, we’re trying to use PyDelphin to downselect the profiles so that they only contain the examples that the grammar can parse, before treebanking.

@goodmami has provided the following steps to do so:

delphin process -g grm.dat original-profile/
delphin mkprof --full --where 'readings > 0' --source original-profile/ new-profile/
delphin process -g grm.dat --full-forest new-profile/ 

Questions: (for @sweaglesw especially)

  1. When I ran the first step, I noticed that for some sentences, ace was giving an error saying it ran out of RAM while unpacking. I understand this to mean that there is some loss of coverage in the process of storing the old-fashioned/non-full-forest profile. Is this correct?
  2. Is there a way to get summary statistics (e.g. coverage) out of fftb profiles?

Yes, that error means the parsing process didn’t enumerate everything theoretically licensed by the grammar. Reduced coverage is a possibility.

You can get an estimate of coverage from the full forest profiles pretty easily too, but again there are reasons it may not be 100% accurate. Resource exhaustion can occur in the chart building phase too (though again you would see a message), possibly leading to undercounting, and the packed chart often hides latent unification failures, possibly leading to overcounting. But if you judge those risks acceptable, the condition you want is that there is at least one root edge in the “edge” file for the input in question. My recollection is that by default the system will only store edges that are connected to roots, i.e. there shouldn’t be any edges at all for items that are outside of coverage. So you could estimate the number of inputs that are parseable with something like:

tsdb -home my-profile -query “select parse-id from edge” | wc -l

… but I am still on vacation driving through California, and don’t have a computer in front of me at the moment, so the above command is probably not 100% correct :-). I expect to be home early next week.

Woodley

Thanks, Woodley!