Wsj23 items 1001-2416

It looks like the ERG 2020 parses wsj23 items up to 1000 (some are of course not parsed, I think 50 items out of these 999) — and then it doesn’t parse the remaining 1416. That’s strange, isn’t it? Is the corpus somehow specially sorted?

I guess there are 2416 items in the test suite but only 1000 accounted for in redwoods.xls; not sure what that means?

Yes, so far I have only treebanked the first 1000 items in WSJ section 23, to be used for testing models trained on the first 22 sections (or also including the other Redwoods data). I expect to complete section 23 for the next ERG release.

I guess I am confused about the items not being parsed; I suppose the treebank format makes it look so? When I work with the profile using pydelphin response objects, it appears as if the grammar has no results for all of those 1416 items, which is of course unlikely, as undoubtbtedly, the ERG parses some of those items. That’s what I am confused about.

Maybe it will be best to ignore wsj23 if it confuses your scripts to work with a profile that has only been partially treebanked. I only parsed the first 1000 items in that profile for treebanking, so the remaining 1416 were simply ignored, with nothing recorded for them.

1 Like

Thank you for the explanation, @Dan !