Using pydelphin's edm correctly

I’d like to report some partial DMRS comparison metric, and I understand the pydelphin edm library is exactly for that.

I am trying to use it and I am a bit confused; perhaps I am doing something wrong or am interpreting the results wrong.

I am trying two ways: command line and API. I have some experimental results and the gold is the ERG treebanks which are under trunk/tsdb/gold. II am using pydelphin’s ACE wrapper to parse the profile. (On a related note, I am seeing not identical results from using pydelphin’s Testsuite process function vs. Parser interact on each individual item in the profile… But that is perhaps another question.)

With the edm:

Experiment 1

Parsed 11/25 sentences
Coverage: 0.44
2 same, 23 different, 0.08% exact match, 7.843531713485718 sec/sen
EDM: P = 0.2293354943273906, R = 0.06495294927702548, F = 0.10123412627436952

The edm numbers are from using edm through the API, where gold_mrs and results are dicts of item id to mrs:

    gold_dmrs = [dmrs.from_mrs(gm) for gm in gold_mrs.values()]
    results_dmrs = [dmrs.from_mrs(r) for r in results.values()]
    edm_p, edm_r, edm_f = edm.compute(gold_dmrs, results_dmrs)

Strangely, when I try to use edm through command line—on the same gold and test profile— I get:

Precision:	0.9384615384615385
   Recall:	0.265886671254875
  F-score:	0.41437254200929563

I do get warnings, both through the API and through command line:

EDSWarning: broken handle constraint: <HCons object (h0 qeq h1) at 140114977756880>
/home/olga/delphin/parsing_with_supertagging/venv/lib/python3.8/site-packages/delphin/eds/ EDSWarning: broken handle constraint: <HCons object (h0 qeq h1) at 140114977752944>

Do the warnings perhaps mean that I cannot trust the numbers unless I get rid of the warnings? Or is command line the only reliable way to obtain the edm metric?..

I believe this may be partially related to the EDM behavior you’re seeing. The process command calls ACEParser.process_item() instead of ACEParser.interact(). The former keeps track of metadata from the profile, such as parse-id, whereas the latter only records the direct output from ACE.

If your problem is more than that, please open a separate thread or PyDelphin bug report about it. Nevermind, I think that’s what this thread is about: Different exact MRS match results with pydelphin Testsuite process and Parser interact

What are these gold_mrs and results data structures that you’re calling values() on? It looks like they are dictionaries of some kind.

The EDM comparison relies on the input gold and test arguments being lists of equal length with aligned representations. For example, imagine you have a profile with 3 items. In the gold profile, items 1 and 2 parsed, but 3 didn’t. In the test profile items 1 and 3 parsed, but 2 didn’t. In this case, the lists should look like this:

gold = [<result1>, <result2>, None]
test = [<result1>, None, <result3>]

If they are not aligned in this way, EDM won’t be able to do a proper comparison. I think this explains the low numbers you are seeing.

Also, it’s worth noting that when you compare profiles with delphin edm gold-profile test-profile, PyDelphin converts the MRSs in the profile into EDS, not DMRS. I don’t think that EDM with EDS vs with DMRS will make a big difference in the scores, though.

These warnings are from the conversion process. Some of the MRSs are not fully valid. E.g., you might have h0 qeq h1, but maybe there is no EP with LBL: h1. If there is an error in conversion, it is treated as though the item failed to parse, but if it’s just a warning it should be able to proceed.

1 Like

Thank you @goodmami for the help!

Indeed, making sure that the comparison lists look like this:

gold = [<result1>, <result2>, None]
test = [<result1>, None, <result3>]

fixed the problem. I now get the same numbers through calling edm.compute() as through calling edm via command line.