Using ERG for information extraction


#1

After some initial experiments with ERG coverage, I am starting to investigate how to extract information from the parse trees. For instance:

SENT: Fassett referred to sandstones below the La Ventana as basal Cliff House Sandstone.
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: past MOOD: indicative PROG: - PERF: - ]
RELS: < [ proper_q<0:7> LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] RSTR: h5 BODY: h6 ]
 [ named<0:7> LBL: h7 CARG: "Fassett" ARG0: x3 ]
 [ _refer_v_to<8:16> LBL: h1 ARG0: e2 ARG1: x3 ARG2: x9 [ x PERS: 3 NUM: pl IND: + ] ]
 [ udef_q<20:51> LBL: h10 ARG0: x9 RSTR: h11 BODY: h12 ]
 [ _sandstone_n_1<20:30> LBL: h13 ARG0: x9 ]
 [ _below_p<31:36> LBL: h13 ARG0: e14 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x9 ARG2: x15 [ x PERS: 3 NUM: sg IND: + ] ]
 [ _the_q<37:40> LBL: h16 ARG0: x15 RSTR: h17 BODY: h18 ]
 [ compound<41:51> LBL: h19 ARG0: e20 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x15 ARG2: x21 [ x PERS: 3 NUM: sg IND: + PT: notpro ] ]
 [ proper_q<41:43> LBL: h22 ARG0: x21 RSTR: h23 BODY: h24 ]
 [ named<41:43> LBL: h25 CARG: "LA" ARG0: x21 ]
 [ named<44:51> LBL: h19 CARG: "Ventana" ARG0: x15 ]
 [ _as_p<52:54> LBL: h1 ARG0: e28 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: e2 ARG2: x29 [ x PERS: 3 NUM: sg ] ]
 [ udef_q<55:83> LBL: h30 ARG0: x29 RSTR: h31 BODY: h32 ]
 [ _basal/JJ_u_unknown<55:60> LBL: h33 ARG0: e34 [ e SF: prop TENSE: untensed MOOD: indicative PROG: bool PERF: - ] ARG1: x29 ]
 [ compound<61:83> LBL: h33 ARG0: e35 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x29 ARG2: x36 [ x PERS: 3 NUM: sg IND: + PT: notpro ] ]
 [ proper_q<61:66> LBL: h37 ARG0: x36 RSTR: h38 BODY: h39 ]
 [ named<61:66> LBL: h40 CARG: "Cliff" ARG0: x36 ]
 [ compound<67:83> LBL: h33 ARG0: e42 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x29 ARG2: x43 [ x IND: + PT: notpro ] ]
 [ udef_q<67:72> LBL: h44 ARG0: x43 RSTR: h45 BODY: h46 ]
 [ named_n<67:72> LBL: h47 CARG: "House" ARG0: x43 ]
 [ _sandstone_n_1<73:83> LBL: h33 ARG0: x29 ] >
HCONS: < h0 qeq h1 h5 qeq h7 h11 qeq h13 h17 qeq h19 h23 qeq h25 h31 qeq h33 h38 qeq h40 h45 qeq h47 >
ICONS: < > ]

Is there any simples way to inspect this structure with https://pydelphin.readthedocs.io? For example, how to easily enumerate the proper nouns (merging tokens arguments of compounds)? Just asking since I suspect that many functions to extract patterns from the MRS just already exist, right?


#2

Hi Alexandre,

There are some ways to do this. First I’ll answer for the general case, and below I’ll discuss challenges for the proper-noun case.

General case

PyDelphin does have facilities for MRS inspection, although the general purpose pattern-matching you’re asking for don’t quite exist yet. I would look at methods on Xmrs objects and the delphin.mrs.query module for building your own solutions.

There is an undocumented “MrsPath” utility which was inspired by XPath but for MRS graph structures, however it was underutilized and it will be removed in a future version of PyDelphin. Rather than for querying, it was used to generate path descriptions of MRS subgraphs given an Xmrs object. See here for some human-readable test cases. In the future I may add support for queries using Semantic Fingerprints, but I don’t know when that will be.

The pydmrs package has support for queries using a DMRS Graph Description Language. Unfortunately I don’t know of much documentation, but you might find this paper, these slides, and these unit tests informative.

In general, I think that DMRS graphs are easier to use than MRS graphs. You might also enjoy the PENMAN serialization of DMRS, which puts the graph in a tree-like structure such that useful relationships are encoded in a parent–child relationship. Such a serialization for your sentence might look like this:

(10002 / _refer_v_to
   :ARG1-NEQ (10001 / named
      :carg "Fassett"
      :RSTR-H-of (10000 / proper_q))
   :ARG2-NEQ (10004 / _sandstone_n_1
      :RSTR-H-of (10003 / udef_q)
      :ARG1-EQ-of (10005 / _below_p
         :ARG2-NEQ (10010 / named
            :carg "Ventana"
            :RSTR-H-of (10006 / _the_q)
            :ARG1-EQ-of (10007 / compound
               :ARG2-NEQ (10009 / named
                  :carg "LA"
                  :RSTR-H-of (10008 / proper_q))))))
   :ARG1-EQ-of (10011 / _as_p
      :ARG2-NEQ (10020 / _sandstone_n_1
         :RSTR-H-of (10012 / udef_q)
         :ARG1-EQ-of (10013 / _basal/jj_u_unknown)
         :ARG1-EQ-of (10014 / compound
            :ARG2-NEQ (10016 / named
               :carg "Cliff"
               :RSTR-H-of (10015 / proper_q)))
         :ARG1-EQ-of (10017 / compound
            :ARG2-NEQ (10019 / named_n
               :carg "House"
               :RSTR-H-of (10018 / udef_q))))))

Proper nouns

Simple proper nouns, like “Fassett” in your example, are named EPs quantified by proper_qs. Slightly more complicated are compounds, like “La Ventana”, where a compound EP joins two named EPs (and in this case, the syntactic head noun is quantified by _the_q instead of proper_q). Coordinated names (“Bill and Melinda Gates”) are more complicated. Some proper nouns include common nouns, such as “The University of Washington”, which has _university_n_1 and not named("University"), so the line starts to blur about where the proper nouns begin and end.

One method from a colleague of mine was to do NER with an off-the-shell system and then to project these onto the relevant EPs in the MRS. This allowed us to avoid making decisions for the above complexities, but it wasn’t always clear how to project words (tokenized with a different scheme than what the ERG uses) onto the MRSs (see this thread on the mailing list: http://lists.delph-in.net/archives/developers/2017/002537.html). Unfortunately I don’t think I have access to this code now.

You could also crawl the MRS starting from, e.g., named EPs (or things with CARGs, including mofy, dofw, etc), and traverse to acceptable nodes (to proper_q, to compound and its ARG1/ARG2, etc.).

For my dissertation research I had a subgraph extraction method that looked for MRS fragments that included all predicates in a given bag, plus any predicates I allowed to be included “for free”. For example, I could ask it to find the subgraph including named("La") and named("Ventana"), and allow it to include any compound nodes necessary to join the two.

Summary

There are several ways to find patterns in MRS graphs, but it’s not really “simple”. Let me know if any of the above look appealing, and maybe I can provide some more information.


#3

Thank you @goodmami! I will play with the libraries and report my progress. Actually, after reading the paper about the Searchbench (http://aclweb.org/anthology/P/P11/P11-4002.pdf), I started to think that reproduce it would be the best option for my first goal. The question is how much reproducible is it. Many details are missing in the two papers that I read so far. Does anyone know if the Searchbench code is available elsewhere? From the paper Advances in Deep Parsing of Scholarly Paper Content it seems the important piece is the PET parser ability to receive XML input with POS and NER already marked. So far I was using ACE, a new tool to investigate now! A lot of fun! :wink:


#4

Ulrich Schaefer is the primary author of the Anthology Searchbench and its back-end, called Heart of Gold (HoG). He would be able to answer your questions (although he’s not on this forum, so you’ll need to email him or the developers list for him to see the questions). You might also find some answers at these links:

I believe that ACE does not have implementations for several technologies used by HoG, such as RMRS output and some XML input formats, so PET is suggested for such a setup.

I also recall that Stephan Oepen and Milen Kouylekov worked on efficient large-scale search of EDS banks stored as RDF triples (“WeSearch”), which might do what you want:


#5

Alexandre, we (me, Kristen Howell, and Adam Rhine) have a paper (upcoming at COLING) and the associated source code where we crawl DMRS obtained with ACE to extract some of the relations. We can share it with you (though I don’t know how much help it will be for your specific purposes).


#6

Pydmrs has a matching library (https://github.com/delph-in/pydmrs/tree/master/pydmrs/matching), where you can give a query DMRS to be matched against other DMRSs.


#7

Can you share the paper?


#8

The paper is here.

If you find that you would like to set up anything similar, let me know, I can probably help (share some code etc.) We can’t share the data from that project, so, that’s a bit of an issue since some of the code assumes that particular data format. But I am sure we can figure something out.