Analyzing information structure diffs

Here’s the kind of diffs that I need to analyze for some of the current information structure regression test failures that we have in the trunk (you can open the image in new tab and it will be bigger).

Any advice on how to make sense of this? Some of this is just reordering of where the quantifier is; I believe such things never lead to a regression test failure. But it seems to me that all other differences are also due to reordering, the structures themselves are now in a different order?

Can anyone spot anything else? What could cause the failure here?

Except for the first MRS, the others have extra (duplicated) ICONS constraints. There may be other differences, but these are what stood out to me.

Also you can try delphin.edm (or the other implementations) if you need a second opinion; it doesn’t use the isomorphism test.

$ delphin edm gold.txt cur.txt --format simplemrs
Precision:	1.0
   Recall:	0.9830508474576272
  F-score:	0.9914529914529915

But I think if you reorder the MRSs, there will be no differences wrt the duplicated focus information; the gold has it too.

What exactly does it tell me, could you elaborate? Like in that example you gave, what does precision and recall mean there?

Oh I see, these are all for the same item. In that case the EDM test tells you little since it pairs off one MRS at a time.

When you give it paired-off MRSs, it aligns them by CFROM/CTO values and tells you how similar the graphs are in terms of matching predicates, roles, etc.

For further debugging, you might try the API form of compare, or rather the mrs.compare_bags() function with count_only=False. This makes the normal triple of (gold-only, shared, test-only) counts return the actual MRSs grouped into those buckets.

>>> from delphin.codecs import simplemrs
>>> from delphin import mrs
>>> gold = simplemrs.load('gold.txt')
>>> test = simplemrs.load('cur.txt')
>>> left, shared, right = mrs.compare_bags(gold, test, count_only=False)
>>> for m in left:
...     print(simplemrs.encode(m, indent=True))
... 
[ TOP: h0
  INDEX: e2 [ e SF: iforce E.ASPECT: aspect E.MOOD: mood E.TENSE: tense ]
  RELS: < [ _kim_n<-1:-1> LBL: h4 ARG0: x5 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] ]
          [ exist_q<-1:-1> LBL: h6 ARG0: x5 RSTR: h7 BODY: h8 ]
          [ _chase_v<-1:-1> LBL: h1 ARG0: e2 ARG1: x3 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] ARG2: x5 ]
          [ _dog_n<-1:-1> LBL: h9 ARG0: x3 ]
          [ exist_q<-1:-1> LBL: h10 ARG0: x3 RSTR: h11 BODY: h12 ] >
  HCONS: < h0 qeq h1 h7 qeq h4 h11 qeq h9 >
  ICONS: < e2 contrast-focus x5 e2 semantic-focus x3 e2 semantic-focus x3 > ]
[ TOP: h0
  INDEX: e2 [ e SF: iforce E.ASPECT: aspect E.MOOD: mood E.TENSE: tense ]
  RELS: < [ _kim_n<-1:-1> LBL: h4 ARG0: x5 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] ]
          [ exist_q<-1:-1> LBL: h6 ARG0: x5 RSTR: h7 BODY: h8 ]
          [ _chase_v<-1:-1> LBL: h1 ARG0: e2 ARG1: x3 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] ARG2: x5 ]
          [ _dog_n<-1:-1> LBL: h9 ARG0: x3 ]
          [ exist_q<-1:-1> LBL: h10 ARG0: x3 RSTR: h11 BODY: h12 ] >
  HCONS: < h0 qeq h1 h7 qeq h4 h11 qeq h9 >
  ICONS: < e2 contrast-focus x5 e2 semantic-focus x3 > ]

These are the MRSs on the gold side that couldn’t be matched with those on the test side.

1 Like

To make things easier for you, here are the gold and test MRSs that look the most similar (each having only two ICONS):

Gold:

[ TOP: h0
  INDEX: e2 [ e SF: iforce E.ASPECT: aspect E.MOOD: mood E.TENSE: tense ]
  RELS: < [ _kim_n<-1:-1> LBL: h4 ARG0: x5 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] ]
          [ exist_q<-1:-1> LBL: h6 ARG0: x5 RSTR: h7 BODY: h8 ]
          [ _chase_v<-1:-1> LBL: h1 ARG0: e2 ARG1: x3 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] ARG2: x5 ]
          [ _dog_n<-1:-1> LBL: h9 ARG0: x3 ]
          [ exist_q<-1:-1> LBL: h10 ARG0: x3 RSTR: h11 BODY: h12 ] >
  HCONS: < h0 qeq h1 h7 qeq h4 h11 qeq h9 >
  ICONS: < e2 contrast-focus x5 e2 semantic-focus x3 > ]

Test:

[ TOP: h0
  INDEX: e2 [ e SF: iforce E.ASPECT: aspect E.MOOD: mood E.TENSE: tense ]
  RELS: < [ exist_q<-1:-1> LBL: h4 ARG0: x5 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] RSTR: h6 BODY: h7 ]
          [ _kim_n<-1:-1> LBL: h8 ARG0: x5 ]
          [ _chase_v<-1:-1> LBL: h1 ARG0: e2 ARG1: x3 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] ARG2: x5 ]
          [ exist_q<-1:-1> LBL: h9 ARG0: x10 [ x COG-ST: cog-st PNG.GEND: gender PNG.NUM: number PNG.PER: person SPECI: bool ] RSTR: h11 BODY: h12 ]
          [ _dog_n<-1:-1> LBL: h13 ARG0: x10 ] >
  HCONS: < h0 qeq h1 h6 qeq h8 h11 qeq h13 >
  ICONS: < e2 semantic-focus x10 e2 contrast-focus x5 > ]

I notice that the test one has ARG1: x3 on _chase_v which is not the ARG0 of some other EP. In the gold, this links to the _dog_n EP.

2 Likes

Thank you Mike! You nailed it! I couldn’t find it for the life of me. Automatic diff highlights too much…

Hmm, I can think of a couple of ways to produce better diffs. One is to use the isomorphism code to get the largest matching subgraph and highlight the rest. Another is to normalize the EP order and variable names so regular diffing tools (see this thread) are more helpful. So, if you want more things to do this summer you could try implementing that for PyDelphin :slight_smile:

1 Like

I’ll bring that up with Emily :slight_smile:.

Yes, I think at a minimum, sorting things is necessary.

1 Like