Distinguishing between items for which there is no gold tree and items for which there is no parse

Given an ERG-release treebank, is there a robust way to distinguish between items for which there was no gold tree recorded and items for which the grammar didn’t have a parse, using pydelphin tools such as the processed_items() method ?

When iterating over the items, it is possible to see that some have results and some don’t, and the ones that don’t have results occasionally do have an error message recorded in them, such as “ran out of RAM”. But many do not have any error message, and I am not sure I can distinguish between the ones that the grammar couldn’t parse and the ones at which @Dan simply haven’t looked yet.

Is it possible given the existing tools, without redoing/reparsing anything?

Related to this: Wsj23 items 1000-2416

There are a few fields in a profile to look for, but some depend on how the profile was processed. You can look at readings in the parse relation to see how many parses were found, and the number of corresponding results are how many of those are stored in the profile. If the profile was processed with [incr tsdb()], then you can look at the error field in the parse relation for the reason parsing stopped, but this doesn’t necessarily mean no results were returned (this may also depend on which engine, e.g., ACE or the LKB, was used).

>>> from delphin import itsdb
>>> ts = itsdb.TestSuite('erg-trunk/tsdb/gold/cb')
>>> for response in ts.processed_items():
...     if response['error']:
...         print(response['i-id'], response['readings'], response['error'])
... 
1140 0 ran out of RAM while building forest
1230 0 ran out of RAM while building forest
1270 0 ran out of RAM while building forest
3470 0 ran out of RAM while building forest
4890 -1 PVM message buffer overflow
6320 0 ran out of RAM while building forest
6450 0 ran out of RAM while building forest
6470 0 ran out of RAM while building forest
6790 0 ran out of RAM while building forest
7280 0 ran out of RAM while building forest
7860 0 ran out of RAM while building forest
8000 0 ran out of RAM while building forest
8530 0 ran out of RAM while building forest
8560 0 ran out of RAM while building forest

A profile processed with ACE and PyDelphin will not record the error as it is output on ACE’s stderr and it is not always feasible to match that to a input item during batch processing. I’m not sure if art pairs up error messages appropriately, either. You could test this by giving ACE a very small RAM limit to induce a memory error.

If you want to find if a result was selected as the gold result, I’m not entirely sure. If the number of results are fewer than the readings, that might be a sign that the profile has been “thinned” to remove results not selected during treebanking, or perhaps that ACE ran out of memory while unpacking the results rather than during the parse search (not sure on this one; maybe Woodley can confirm). Otherwise I’m not sure how easy it is to refer to the decision or preference relations (whichever encodes treebanking decisions) to determine the gold result. In any case, these relations will be empty if the profile was processed and not treebanked.

Thanks, @goodmami !

I think maybe described the issue in a bit confusing way, let me try to rephrase with a concrete example.

I am working with the latest ERG release, with the treebanks. There is a bunch, and the grammar has some coverage over them and some treebanked coverage. I want to use them (specifically the gold trees) for training a supertagger.

While looking at them, I noticed that one of them (specifically wsj23; see the link in the first post) only has stored results for the first 1000 sentences (or rather, for some of the first 100 sentences; wichever ones are covered by the grammar, I assumed). I found out from @Dan that this meant simply that he hasn’t treebanked the remaining sentences yet. So the reported coverage in redwoods.xls is somewhere in the 90% but if I load the profile say into [incr tsdb()], then it appears to be only around 32, which can be confusing. Anyway, maybe that’s fine, but I am trying to understand whether there is a robust way of telling, given a treebank and without any reparsing, was a sentence simply skipped/not treebanked yet, or is the sentence definitely not covered by the grammar. This is mainly because I want to be sure that I know what I’ve got, have the means of checking things for consistency in my own code, etc. Since looking at wsj23 was confusing, I wanted a way of making sure I can spot any other places like that.

All the potentially confusing items will have no error message associated with them (because they have been processed by ACE), unless ACE ran out of RAM.

I was hopeful about the readings field that you mentioned but it appears all the items in wsj23 starting from item #1001 have 0 readings (even though many of them definitely will be covered by the ERG).

Hi,

I think the answer is something like, if t-active is positive, then the tree will be the result-id in the preferences file (which matches with the results file).

So I think just

In the past Sephan has said:

Thanks @bond for the note about t-active.

Except on this forum, I haven’t known Stephan to be of so few words :wink: Maybe there was some more you intended to convey here?

1 Like

I think maybe Discourse cut out part of Francis’s message. I asked him in an email and he replied:

t-active is an item-level property, i.e. the above condition will
select a set of items (‘parse’ attempts, to be precise, but these
nearly stand in one-to-one correspondence with items). for each item
where there is at least one active (aka ‘gold’) three, the next order
of business is to determine which of the ‘result’ tuples (i.e.
separate readings) were accepted by the annotatator: the ‘result-id’s
in question are recorded in the ‘preference’ relation, and only the
corresponding ‘result’ records represent good (aka preferred)
readings. i suspect it might well be possible to express the above
nesting of queries and join conditions in full SQL, but not in the
more restricted TSQL sub-language, i am afraid.

Or the definitive answer is in the code:
(defun export-trees (data &key (condition statistics-select-condition)
path prefix interrupt meter
(compressor “gzip -c -9”) (suffix “gz”)
(stream tsdb-io))

(loop
with offset = (cond
((search “vm6” data) 60000)
((search “vm13” data) 130000)
((search “vm31” data) 310000)
((search “vm32” data) 320000)
((search “ecoc” data) 1000000)
((search “ecos” data) 2000000)
((search “ecpa” data) 3000000)
((search “ecpr” data) 4000000)
(t 0))
with target = (format
nil
“~a/~a”
(or path “/lingo/oe/tmp”) (directory2file data))
with lkb::chart-packing-p = nil
with reconstruct-cache = (make-hash-table :test #'eql)
with items = (analyze
data :thorough '(:derivation :mrs) :condition condition)
with increment = (when (and meter items)
(/ (- (get-field :end meter) (get-field :start meter))
(length items) 1))
with gc-strategy = (install-gc-strategy
nil :tenure tsdb-tenure-p :burst t :verbose t)

  initially
    #+:allegro (ignore-errors (mkdir target))
    (when meter (meter :value (get-field :start meter)))
  for item in items
  for i-wf = (get-field :i-wf item)
  for input = (or (get-field :o-input item) (get-field :i-input item))
  for i-comment = (get-field :i-comment item)
  for parse-id = (get-field :parse-id item)
  for results = (let ((results (get-field :results item)))
                  (sort (copy-list results) #'<
                        :key #'(lambda (foo) (get-field :result-id foo))))
  for trees = (select '("t-active" "t-version") '(:integer :integer)
                      "tree"
                      (format nil "parse-id == ~a" parse-id)
                      data)
  for version = (when trees
                  (loop
                      for tree in trees
                      maximize (get-field :t-version tree)))
  for active = (when version
                 (let ((foo (select '("result-id") '(:integer)
                                    "preference"
                                    (format
                                     nil
                                     "parse-id == ~a && t-version == ~d"
                                     parse-id version)
                                    data)))
                   (loop
                       for bar in foo
                       collect (get-field :result-id bar))))
  for file = (format
              nil
              "~a/~@[~a.~]~d~@[.~a~]"
              target prefix (+ parse-id offset) suffix)
  when results do
    (format
     stream
     "[~a] export-trees(): [~a] ~a active tree~:[~;s~] (of ~d).~%"
     (current-time :long :short)
     (+ parse-id offset)
     (if version (length active) "all")
     (or (null version) (> (length active) 1))
     (length results))
    (clrhash *reconstruct-cache*)

    #+:allegro
    (multiple-value-bind (stream foo pid)
        (run-process
         compressor :wait nil :input :stream
         :output file :if-output-exists :supersede
         :error-output nil)
      (declare (ignore foo #-:allegro pid))

      (format
       stream
       ";;;~%;;; Redwoods export of `~a';~%;;; (~a@~a; ~a).~%;;;~%~%"
       data (current-user) (current-host) (current-time :long :pretty))
      (format
       stream
       "[~d] (~a of ~d) {~d} `~a' (~a)~%~a~%"
       (+ parse-id offset)
       (if version (length active) "all") (length results) i-wf
       input i-comment
       #\page)

      (export-tree item active :offset offset :stream stream)
      (unless *redwoods-thinning-export-p*
        (export-tree item active
                     :complementp t :offset offset :stream stream))

      (force-output stream)
      (close stream)
      (sys:os-wait nil pid))

    (when increment (meter-advance increment))
  when (interrupt-p interrupt) do
    (format
     stream
     "[~a] export-trees(): external interrupt signal~%"
     (current-time :long :short))
    (force-output stream)
    (return)
  finally
    (when meter (meter :value (get-field :end meter)))
    (when gc-strategy (restore-gc-strategy gc-strategy))))
1 Like