Treebanked profiles: Derivations and "thinning"

If I have a treebanked profile (decisions file non-empty), and it has more than one item in the Derivation column, when I look at the results in [incr tsdb()], what does it mean exactly? Does it mean all the items in that column are gold, or is there only one gold tree, in principle (in our setup), and so another, “thinning” step is required?

The “thinning” is described here: ItsdbTreebanking_ItsdbExporting · delph-in/docs Wiki · GitHub

I am concerned that it references “redwoods” in the command; is that command somehow designed only for the ERG?

Update: When I added (setf tsdb::*redwoods-thinning-export-p* t) or (setf tsdb::*redwoods-thinning-export-p* '(:derivation :tree :mrs) to my .tsdbrc, and then did Trees|Export, I did get 1 item output per item, except I didn’t get any trees. Instead, I think I am getting just the item itself:

Screen Shot 2022-10-19 at 12.04.32 PM

Maybe that’s because of the outdated schema?

The boolean flag tsdb::redwoods-thinning-export-p should be set to T (the first of the two settings you tried) in order to cause the export to produce a thinned set of items each of which has at most one result stored. To determine whether the export includes both the MRS and the parse tree in addition to the derivation, set the list-valued variable tsdb::redwoods-export-values to include :mrs and :tree; for example
(setf tsdb::redwoods-export-values '(:derivation :tree :avm :mrs :eds))

The fact that many of the names of variables and functions in the treebanking tool mention “redwoods” is simply due to the tool’s original development being driven by the construction of the ERG’s Redwoods treebank. The machinery is in fact largely grammar-independent, and was used successfully, for example, by Montserrat Marimon in constructing the Tibidabo treebank with the SRG, as you know, so it should be able to serve your current purposes.

In principle yes, though in reality the profiles that Montse gave me are not thinned… And using tsdb::redwoods-thinning-export-p set to t only produces the item string, not the tree…

So setting tsdb::redwoods-export-values as suggested before you export the thinned profile does not also save the trees for you? It did for me using the “classic” LKB, though I have not yet tried it with LKB_FOS. You probably already know this, but it’s necessary to have the (right version of the) SRG loaded before thinning, since [incr tsdb()] needs the grammar in order to compute the tree (and MRS, if desired) from each item’s stored derivation, by rebuilding the corresponding feature structure.

1 Like

Thanks, @Dan , I finally succeeded, at least with one of the profiles.

I think in some of the profiles I have outdated schemas, maybe that’s the issue. Or maybe I had forgotten to load the grammar before, or maybe I had the .tsdbrc settings in the wrong order or something.

So now I have text files, one per item, with the gold tree etc.

What I would like to have, however (to reuse my ERG supertagging code) is tsdb profiles with just one gold derivation. How did you achieve that, @Dan ? Did you write your own scripts for that? Thanks again!

Incidentally, how to read the decision file in the profile? For example, for one sentence, I have 3 derivations (according to [incr tsdb()],

Screen Shot 2022-10-21 at 1.51.53 PM

… and in the decision file I have the following 4 lines:

580@36@4@3@V_ACC_DLR@Imagínense@0@1@28-apr-2014 15:10:46
580@36@4@3@V_PSV-SE-OR-CAUS-ALT_DLR@Imagínense@0@1@28-apr-2014 15:10:46
580@36@4@2@v_np_le@Imagínense@0@1@28-apr-2014 15:10:46
580@36@1@2@v_np*_prn_le@Imagínense@0@1@28-apr-2014 15:10:46

Does it tell me which derivation was chosen as gold?..

After “thinning”, I get this in the text fine:

[580] (1 of 3) {1} `Imagínense.'

[580:0] (active)

(ROOT_S
 (640 HD_OPTSB_C 0.145836 0 2
  (639 HD_OPTCMP-V_C 1.56434 0 2
   (638 HD-PT_C 0.643647 0 2
    (636 +PP3CN000 0 0 1
     (635 VMM03P0 0 0 1
      (6 imaginar_vprn-np@v_np*_prn_le 0 0 1
       ("Imagínense" 1 "\"Imagínense\""))))
    (637 FP -0.0934759 1 2
     (8 fstop_pt@pt_-_fstop_le 0 1 2 ("." 2 "\".\"")))))))

(S (VP (V (V (V (V "Imagínense"))) (PT (PT ".")))))


 [ TOP: h1
   INDEX: e2 [ e SF: IFORCE E.TENSE: PRES E.ASPECT: ASPECT E.MOOD: IMP ]
   RELS: <
          [ "_imaginar_v_rel"
            LBL: h1
            ARG0: e2
            ARG1: u4 [ u DEF: + PNG.PN: 3PL PNG.GEN: GENDER PRONTYPE: PRONTYPE ]
            ARG2: u3 ] >
   HCONS: < > ]

I am assuming there is some kind of encoding here, indicating which line in decision corresponds to the gold tree, and also which line in decision corresponds to which derivation in the order in which they appear in the LKB, but I couldn’t find the documentation for that (I just searched the docs for “tsdb decision”). I am using the classic LKB at the moment, by the way.

I believe “decision” is where the choices made in the process of treebanking are stored, so not one specific tree, but rather “yes” or “no” on specific properties of trees. There are entailment relations between these properties, and it may be that both the actually selected properties and the entailed ones are stored. The Relations file should tell you how to interpret the fields of the decision file, and in particular what the 2nd and 3rd to last fields are (since these look like binary values).

OK, thank you, @ebender .

Then I need to go back to the question of how to read the relations file.

In the relations file, in the portion of it that says “decision”, the second-to-last and the third-to-last values are called d-start and d-end. Not sure that’s informative in this case?.. Also, sadly, the relations file only tells me the value type for the fields but not the possible range of values (e.g. binary or not, or what would each value signify, in case of binary values, for example). So it is still hard for me to interpret. The names of the fields (e.g. d-start) also are cryptic to me I’m afraid. (There is some documentation e.g. here and here but again, regretfully, I can’t say that it helps me much, as it is a bit hard for me to interpret at this point. In particular, the documentation I’m finding doesn’t really explain the relationship between these files and treebanking (I don’t think).

decision:
  parse-id :integer :key
  t-version :integer
  d-state :integer
  d-type :integer
  d-key :string
  d-value :string
  d-start :integer
  d-end :integer
  d-date :date

To wrap up, my task at hand here remains obtaining SRG treebank tsdb profiles similar to Redwoods, i.e. so that I have profiles (loadable with pydelphin modules) which have only one gold tree stored in the “derivation” field. I am sure that is doable, probably by understanding the format of the tsdb files, but so far I am not finding the relevant documentation.

I don’t think you have to fully understand the database schema to achieve your goal, but it might help to read up on the Redwoods approach more generally. Oepen et al 2004 is probably a good place to start:

Oepen, S., Flickinger, D., Toutanova, K., & Manning, C. D. (2004). Lingo redwoods. Research on Language and Computation , 2 (4), 575-596.

I’m guess that d-state is what records the decision on this particular discriminant. d-start and d-end are probably positions in the string (as measured in tokens, not characters).

1 Like