"Ignored" treebanks in the ERG

What are the “red” (“ignored”) treebanks in the ERG releases, meaning why are they called “ignored”? They should not be used for either training or development or testing? Why?

From redwoods.xls:

Screen Shot 2021-12-14 at 11.37.24 AM

Sorry for the long delay in replying, @olzama! The profiles marked as “ignored” have special characteristics that make them unsuitable for training or testing of statistical models for parsing standard English. Namely:
– wlb03 and wnb03 consist of user-generated content gathered from Linux and NLP blogs as part of Oslo’s WeSearch project, where many of the items are not well-formed sentences (e.g. “sudo apt-get install darkstat”). Though these profiles are parsed using the standard ERG, the data is noisy enough that it seems best to ignore it for training or testing.
– ntucle is a corpus of second-language learner English gathered by NTU in Singapore, intended to be parsed by a variant of the ERG that includes so-called mal-rules (to parse certain ungrammatical structures). This variant is compiled with the erg/ace/config-mal.tdl file.
– omw is a corpus of dictionary definitions from the Open Multilingual Wordnet, also gathered by NTU, and intended to be parsed by a different variant of the ERG which includes a small number of additional rules for constructions typical of definitions, such as the dropping of an ordinarily obligatory direct object (e.g. “to devour”). This variant is compiled with ace/config/erg-dict.tdl.

1 Like

Thank you, Dan!