Chasing down non-local features-related glitches in the grammar

@Dan you suggested yesterday in the grammar engineering meeting that the SRG’s miserable performance (parsing speed, in particular due to unpacking) may be related to bugs related to non-local features (e.g. SLASH).

How should I start chasing those down? You suggested I look at normal sentences which should not have any fillers and gaps in them, right? And then… notice if I do in fact get some gaps in them… without fillers? Is that what I am looking for? Or something else?

(If someone other than @Dan knows what I am looking for, don’t be shy :). )

1 Like

I think he was describing using ace in verbose mode to get lots of info dumped out & then grepping through that for SLASH (or the name of the rules?).

  1. Create an [incr tsdb()] profile with short sentences
  2. Process them with packing off (can be with ace, doesn’t have to be LKB}
  3. I would try with the LKB: look at the parse relation in an [incr tsdb()] profile for a short sentence with a high edge count
  4. Process that sentence individually with the LKB
  5. View the parse chart…
2 Likes

Thanks a lot, @ebender ! And do you know what is it that I am looking for specifically with respect to SLASH? Non-empty shash lists?

Yes, that would be the likely culprit. Basically lots of edges that don’t contribute to full parses because there isn’t a filler-gap construction in the sentence, but that are built anyway because there might be.

1 Like

Hmm. Of course with the many morphological rules for verbs, extraction can indeed then happen from any of those possible forms, so in many cases, the application of extraction rules will be expected?..

Does the following parse chart look potentially informative? I see a bunch of extraction rules but I don’t know whether that is suspicious or not, given we have lots of V-lexical rules:

The sentence is just one verb meaning “to imagine something to oneself”. This is the entire chart.

I find it’s often helpful to grab an edge (say 103) and look at the associated (partial) parse tree. Do you see an interaction between rules that can be ruled out?

When @Dan talks about a grammar “leaking” in this context, I think he means that there’s lots of extra processing that ultimately goes nowhere, and it’s nice to find analyses that have fewer leaks.

1 Like

Looking through the parse charts is a possible method, but a time-intensive one. My suggestion was simpler: Choose a few longish sentences that should not have any filler-gap constructions in their derivations, and parse them exhaustively, storing the resulting derivation trees in a file. Then see if the SLASH feature shows up non-empty anywhere in that file (it should not, if the grammar is airtight). The idea is that if the grammar somewhere fails to constrain the SLASH feature, non-empty values should eventually sneak into full derivations where they should not be. So for example you might see a non-empty SLASH in a modifier phrase where its parent phrase has an empty SLASH.

3 Likes

What does it mean to parse “exhaustively”? Which ACE flags should I be using? I have just realized I don’t actually know how to get the trees; I am not seeing them in any of the tsdb files, only the MRSs.

With ACE (and the LKB) you have the option of asking the parser to produce just one parse or all parses for a sentence. Parsing “exhaustively” means asking the parser to produce all of the parses it can find for a sentence, “exhausting” or using all of its resources in the search for parses. With ACE you get this exhaustive parsing behavior as the default with no options added; adding the “-1” option would restrict the parser to only produce one parse.
As for recording derivation trees with ACE, you can add the “-v” option (for “verbose”) when you invoke the parser and then pipe the output to a file which you can inspect later. For example:
echo “pescados comen siempre” | ace -g srg.dat -v > /tmp/output.txt
will cause ACE to parse the sentence (meaning either “fish always eat” or “they always eat fish”) using the Spanish grammar, and record the set of resulting derivation trees (along with MRSs and other data) in the file “output.txt” in the /tmp directory.

That’s why I am confused; that’s what I thought, too, but it looks like I am only getting the MRS if I simply run ACE with -v:

(base) olga@condorito:~/delphin/SRG/grammar/srg$ ace -g ace/srg.dat -y --yy-rules -v
NOTE: loading frozen grammar SRG (1008)
NOTE: semantic index hash contains 47027 entries in 65536 slots
NOTE: max-ent model hash contains 77162 entries in 262144 slots
NOTE: 7057 types, 54504 lexemes, 730 rules, 439 orules, 36 instances, 91537 strings, 169 features
permanent RAM: 0k

(1, 0, 1, <0:10>, 1, "imaginar" "imagínense", 0, "vmm03p0" "+pp3cn00", "vmm03p0" "+pp3cn00" 1.00000000) (2, 1, 2, <10:11>, 1, "." ".", 0, "fp", "fp" 1.00000000)
 lexical lookup found lexeme 'imaginar_v-np'
 lexical lookup found lexeme 'imaginar_vprn-cp_p_ind'
 lexical lookup found lexeme 'imaginar_vprn-np'
 lexical lookup found lexeme 'imaginar_v-cp_p_ind'
 lexical lookup found lexeme 'fstop_c'
 lexical lookup found lexeme 'fstop_pt'
SENT: (yy mode)
[ LTOP: h0 INDEX: e2 [ e SF: prop TENSE: untensed MOOD: indicative ] RELS: < [ "_imaginar_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1: i3 ARG2: x4 ] > HCONS: < h0 qeq h1 > ] ;  (923 hd_optsb_c -0.375513 0 2 (922 hd-pt_c 0.000000 0 2 (920 +PP3CN00 0.000000 0 1 (919 vmm03p0 0.000000 0 1 (918 v_acc_dlr 0.000000 0 1 (3 imaginar_v-np 0.000000 0 1 ("imaginar" 1 "token [ +FORM \"imaginar\" +FROM \"0\" +TO \"10\" +ID diff-list [ LIST cons [ FIRST \"1\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"vmm03p0\" REST cons [ FIRST \"+pp3cn00\" REST null ] ] +PRBS cons [ FIRST \"0.000000\" REST cons [ FIRST \"1.000000\" REST null ] ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]"))))) (921 fp 0.000000 1 2 (8 fstop_pt 0.000000 1 2 ("." 2 "token [ +FORM \".\" +FROM \"10\" +TO \"11\" +ID diff-list [ LIST cons [ FIRST \"2\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"fp\" REST null ] +PRBS cons [ FIRST \"1.000000\" REST null ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]")))))
[ LTOP: h0 INDEX: e2 [ e SF: prop TENSE: untensed MOOD: indicative ] RELS: < [ "_imaginar_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1: i3 ARG2: x4 ] > HCONS: < h0 qeq h1 > ] ;  (928 hd_optsb_c -0.375513 0 2 (927 hd-pt_c 0.000000 0 2 (926 +PP3CN00 0.000000 0 1 (925 vmm03p0 0.000000 0 1 (924 v_psv-se-or-caus-alt_dlr 0.000000 0 1 (3 imaginar_v-np 0.000000 0 1 ("imaginar" 1 "token [ +FORM \"imaginar\" +FROM \"0\" +TO \"10\" +ID diff-list [ LIST cons [ FIRST \"1\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"vmm03p0\" REST cons [ FIRST \"+pp3cn00\" REST null ] ] +PRBS cons [ FIRST \"0.000000\" REST cons [ FIRST \"1.000000\" REST null ] ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]"))))) (921 fp 0.000000 1 2 (8 fstop_pt 0.000000 1 2 ("." 2 "token [ +FORM \".\" +FROM \"10\" +TO \"11\" +ID diff-list [ LIST cons [ FIRST \"2\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"fp\" REST null ] +PRBS cons [ FIRST \"1.000000\" REST null ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]")))))
[ LTOP: h0 INDEX: e2 [ e SF: prop TENSE: untensed MOOD: indicative ] RELS: < [ "_imaginar_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1: i3 ARG2: u4 ] > HCONS: < h0 qeq h1 > ] ;  (934 hd_optsb_c -1.542537 0 2 (933 hd_optcmp-v_c -0.060367 0 2 (932 hd-pt_c 0.000000 0 2 (930 +PP3CN00 0.000000 0 1 (929 vmm03p0 0.000000 0 1 (5 imaginar_vprn-np 0.000000 0 1 ("imaginar" 1 "token [ +FORM \"imaginar\" +FROM \"0\" +TO \"10\" +ID diff-list [ LIST cons [ FIRST \"1\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"vmm03p0\" REST cons [ FIRST \"+pp3cn00\" REST null ] ] +PRBS cons [ FIRST \"0.000000\" REST cons [ FIRST \"1.000000\" REST null ] ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]")))) (931 fp 0.000000 1 2 (8 fstop_pt 0.000000 1 2 ("." 2 "token [ +FORM \".\" +FROM \"10\" +TO \"11\" +ID diff-list [ LIST cons [ FIRST \"2\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"fp\" REST null ] +PRBS cons [ FIRST \"1.000000\" REST null ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]"))))))
3 hyps / 3 reconstructed / 3 readings
NOTE: 3 readings, added 375 / 133 edges to chart (101 fully instantiated, 12 actives used, 57 passives used)	RAM: 1624k

But in the ACE options, I am finding only the option -T to not output trees and only output MRS. So it is as if -T is somehow my default?.. @sweaglesw do you know what might be going on?

In your output, look for the semicolon character “;” which separates the MRS from the derivation tree. Your output following that semicolon is the derivation tree, beginning with

(934 hd_optsb_c -1.542537 0 2 (933 hd_optcmp-v_c ...
1 Like

Aha, thanks! But… Does that look like the complete output that would contain SLASHES? I mainly see structures related to tokens there.

You’re right that the derivation tree won’t show the feature structures that would contain occurrences of the SLASH feature. But the grammar will have only a small number of rules that introduce a non-empty SLASH, so you can check for each of those in the output derivation trees using a simple script.

Thanks, @Dan !

Still, am I right thinking that, if we have many possibilities for V-lexical rules, that would very much increase the number of extraction-related and argument drop-related edges in the chart? (Even if they don’t result in a parse.)

The below chart for Imaginense (“Imagine to your self”) looks unpleasant but perhaps this is exactly what should happen, given the number of possible V-rules?

It looks like the grammar should be more tightly constrained so that you don’t have a lexical entry undergo the extracted-complement rule when the next token is a punctuation mark. You’ll see in your chart that the bottom left cell has some occurrences of the xcmp rule, but it would be good to find a way of preventing these. The ERG does this using some token mapping rules that notice the presence of particular punctuation marks such as the period, and stamp a feature on the verb’s edge that prevent that edge from undergoing any other syntactic rules until the punctuation token combines with the verb. But this depends on the assumption that punctuation is attached low. If the SRG attaches punctuation high in the tree, you’ll need some other method to block those unwanted extraction edges, but it will be worthwhile.

1 Like