Thanks, @goodmami !
I think maybe described the issue in a bit confusing way, let me try to rephrase with a concrete example.
I am working with the latest ERG release, with the treebanks. There is a bunch, and the grammar has some coverage over them and some treebanked coverage. I want to use them (specifically the gold trees) for training a supertagger.
While looking at them, I noticed that one of them (specifically wsj23; see the link in the first post) only has stored results for the first 1000 sentences (or rather, for some of the first 100 sentences; wichever ones are covered by the grammar, I assumed). I found out from @Dan that this meant simply that he hasn’t treebanked the remaining sentences yet. So the reported coverage in redwoods.xls is somewhere in the 90% but if I load the profile say into [incr tsdb()], then it appears to be only around 32, which can be confusing. Anyway, maybe that’s fine, but I am trying to understand whether there is a robust way of telling, given a treebank and without any reparsing, was a sentence simply skipped/not treebanked yet, or is the sentence definitely not covered by the grammar. This is mainly because I want to be sure that I know what I’ve got, have the means of checking things for consistency in my own code, etc. Since looking at wsj23 was confusing, I wanted a way of making sure I can spot any other places like that.
All the potentially confusing items will have no error message associated with them (because they have been processed by ACE), unless ACE ran out of RAM.
I was hopeful about the
readings field that you mentioned but it appears all the items in wsj23 starting from item #1001 have 0 readings (even though many of them definitely will be covered by the ERG).