Using ERG for information extraction

After some initial experiments with ERG coverage, I am starting to investigate how to extract information from the parse trees. For instance:

SENT: Fassett referred to sandstones below the La Ventana as basal Cliff House Sandstone.
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: past MOOD: indicative PROG: - PERF: - ]
RELS: < [ proper_q<0:7> LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg IND: + ] RSTR: h5 BODY: h6 ]
 [ named<0:7> LBL: h7 CARG: "Fassett" ARG0: x3 ]
 [ _refer_v_to<8:16> LBL: h1 ARG0: e2 ARG1: x3 ARG2: x9 [ x PERS: 3 NUM: pl IND: + ] ]
 [ udef_q<20:51> LBL: h10 ARG0: x9 RSTR: h11 BODY: h12 ]
 [ _sandstone_n_1<20:30> LBL: h13 ARG0: x9 ]
 [ _below_p<31:36> LBL: h13 ARG0: e14 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x9 ARG2: x15 [ x PERS: 3 NUM: sg IND: + ] ]
 [ _the_q<37:40> LBL: h16 ARG0: x15 RSTR: h17 BODY: h18 ]
 [ compound<41:51> LBL: h19 ARG0: e20 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x15 ARG2: x21 [ x PERS: 3 NUM: sg IND: + PT: notpro ] ]
 [ proper_q<41:43> LBL: h22 ARG0: x21 RSTR: h23 BODY: h24 ]
 [ named<41:43> LBL: h25 CARG: "LA" ARG0: x21 ]
 [ named<44:51> LBL: h19 CARG: "Ventana" ARG0: x15 ]
 [ _as_p<52:54> LBL: h1 ARG0: e28 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: e2 ARG2: x29 [ x PERS: 3 NUM: sg ] ]
 [ udef_q<55:83> LBL: h30 ARG0: x29 RSTR: h31 BODY: h32 ]
 [ _basal/JJ_u_unknown<55:60> LBL: h33 ARG0: e34 [ e SF: prop TENSE: untensed MOOD: indicative PROG: bool PERF: - ] ARG1: x29 ]
 [ compound<61:83> LBL: h33 ARG0: e35 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x29 ARG2: x36 [ x PERS: 3 NUM: sg IND: + PT: notpro ] ]
 [ proper_q<61:66> LBL: h37 ARG0: x36 RSTR: h38 BODY: h39 ]
 [ named<61:66> LBL: h40 CARG: "Cliff" ARG0: x36 ]
 [ compound<67:83> LBL: h33 ARG0: e42 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x29 ARG2: x43 [ x IND: + PT: notpro ] ]
 [ udef_q<67:72> LBL: h44 ARG0: x43 RSTR: h45 BODY: h46 ]
 [ named_n<67:72> LBL: h47 CARG: "House" ARG0: x43 ]
 [ _sandstone_n_1<73:83> LBL: h33 ARG0: x29 ] >
HCONS: < h0 qeq h1 h5 qeq h7 h11 qeq h13 h17 qeq h19 h23 qeq h25 h31 qeq h33 h38 qeq h40 h45 qeq h47 >
ICONS: < > ]

Is there any simples way to inspect this structure with https://pydelphin.readthedocs.io? For example, how to easily enumerate the proper nouns (merging tokens arguments of compounds)? Just asking since I suspect that many functions to extract patterns from the MRS just already exist, right?

Hi Alexandre,

There are some ways to do this. First I’ll answer for the general case, and below I’ll discuss challenges for the proper-noun case.

General case

PyDelphin does have facilities for MRS inspection, although the general purpose pattern-matching you’re asking for don’t quite exist yet. I would look at methods on Xmrs objects and the delphin.mrs.query module for building your own solutions.

There is an undocumented “MrsPath” utility which was inspired by XPath but for MRS graph structures, however it was underutilized and it will be removed in a future version of PyDelphin. Rather than for querying, it was used to generate path descriptions of MRS subgraphs given an Xmrs object. See here for some human-readable test cases. In the future I may add support for queries using Semantic Fingerprints, but I don’t know when that will be.

The pydmrs package has support for queries using a DMRS Graph Description Language. Unfortunately I don’t know of much documentation, but you might find this paper, these slides, and these unit tests informative.

In general, I think that DMRS graphs are easier to use than MRS graphs. You might also enjoy the PENMAN serialization of DMRS, which puts the graph in a tree-like structure such that useful relationships are encoded in a parent–child relationship. Such a serialization for your sentence might look like this:

(10002 / _refer_v_to
   :ARG1-NEQ (10001 / named
      :carg "Fassett"
      :RSTR-H-of (10000 / proper_q))
   :ARG2-NEQ (10004 / _sandstone_n_1
      :RSTR-H-of (10003 / udef_q)
      :ARG1-EQ-of (10005 / _below_p
         :ARG2-NEQ (10010 / named
            :carg "Ventana"
            :RSTR-H-of (10006 / _the_q)
            :ARG1-EQ-of (10007 / compound
               :ARG2-NEQ (10009 / named
                  :carg "LA"
                  :RSTR-H-of (10008 / proper_q))))))
   :ARG1-EQ-of (10011 / _as_p
      :ARG2-NEQ (10020 / _sandstone_n_1
         :RSTR-H-of (10012 / udef_q)
         :ARG1-EQ-of (10013 / _basal/jj_u_unknown)
         :ARG1-EQ-of (10014 / compound
            :ARG2-NEQ (10016 / named
               :carg "Cliff"
               :RSTR-H-of (10015 / proper_q)))
         :ARG1-EQ-of (10017 / compound
            :ARG2-NEQ (10019 / named_n
               :carg "House"
               :RSTR-H-of (10018 / udef_q))))))

Proper nouns

Simple proper nouns, like “Fassett” in your example, are named EPs quantified by proper_qs. Slightly more complicated are compounds, like “La Ventana”, where a compound EP joins two named EPs (and in this case, the syntactic head noun is quantified by _the_q instead of proper_q). Coordinated names (“Bill and Melinda Gates”) are more complicated. Some proper nouns include common nouns, such as “The University of Washington”, which has _university_n_1 and not named("University"), so the line starts to blur about where the proper nouns begin and end.

One method from a colleague of mine was to do NER with an off-the-shell system and then to project these onto the relevant EPs in the MRS. This allowed us to avoid making decisions for the above complexities, but it wasn’t always clear how to project words (tokenized with a different scheme than what the ERG uses) onto the MRSs (see this thread on the mailing list: http://lists.delph-in.net/archives/developers/2017/002537.html). Unfortunately I don’t think I have access to this code now.

You could also crawl the MRS starting from, e.g., named EPs (or things with CARGs, including mofy, dofw, etc), and traverse to acceptable nodes (to proper_q, to compound and its ARG1/ARG2, etc.).

For my dissertation research I had a subgraph extraction method that looked for MRS fragments that included all predicates in a given bag, plus any predicates I allowed to be included “for free”. For example, I could ask it to find the subgraph including named("La") and named("Ventana"), and allow it to include any compound nodes necessary to join the two.

Summary

There are several ways to find patterns in MRS graphs, but it’s not really “simple”. Let me know if any of the above look appealing, and maybe I can provide some more information.

Thank you @goodmami! I will play with the libraries and report my progress. Actually, after reading the paper about the Searchbench (http://aclweb.org/anthology/P/P11/P11-4002.pdf), I started to think that reproduce it would be the best option for my first goal. The question is how much reproducible is it. Many details are missing in the two papers that I read so far. Does anyone know if the Searchbench code is available elsewhere? From the paper Advances in Deep Parsing of Scholarly Paper Content it seems the important piece is the PET parser ability to receive XML input with POS and NER already marked. So far I was using ACE, a new tool to investigate now! A lot of fun! :wink:

Ulrich Schaefer is the primary author of the Anthology Searchbench and its back-end, called Heart of Gold (HoG). He would be able to answer your questions (although he’s not on this forum, so you’ll need to email him or the developers list for him to see the questions). You might also find some answers at these links:

I believe that ACE does not have implementations for several technologies used by HoG, such as RMRS output and some XML input formats, so PET is suggested for such a setup.

I also recall that Stephan Oepen and Milen Kouylekov worked on efficient large-scale search of EDS banks stored as RDF triples (“WeSearch”), which might do what you want:

Alexandre, we (me, Kristen Howell, and Adam Rhine) have a paper (upcoming at COLING) and the associated source code where we crawl DMRS obtained with ACE to extract some of the relations. We can share it with you (though I don’t know how much help it will be for your specific purposes).

Pydmrs has a matching library (https://github.com/delph-in/pydmrs/tree/master/pydmrs/matching), where you can give a query DMRS to be matched against other DMRSs.

1 Like

Can you share the paper?

The paper is here.

If you find that you would like to set up anything similar, let me know, I can probably help (share some code etc.) We can’t share the data from that project, so, that’s a bit of an issue since some of the code assumes that particular data format. But I am sure we can figure something out.

1 Like

Regarding the http://wesearch.delph-in.net interface, does anyone here used it before? I wrote to Milen yesterday, but I didn’t hear back from him yet. The code is available, but I didn’t understand how to produce the expected input files for the create-index command from the parse output or a profile.

There is actually a CLMS student (Roman) working on creating indices for the ERG 2018 version of the treebanks and updating documentation as he goes. What is your use case?

Hi @ebender,

We want to have an interface for browsing and searching semantic structures so as we can show the value of deep parsing of texts. People could explore the corpus looking for useful patterns to be further explored to information extraction. For instance, in our corpus, simple queries like

e:*_produce_v_*[ARG2 x]
x:*oil*

can be very informative.

We have already created a profile using art+ace with ~ 850 small sentences. We have also identified that the MRS produced by the parse and redwoods scripts described in http://moin.delph-in.net/ErgProcessing are different from the MRS produced by ACE with the trunk version of ERG: LTOP vs TOP. Not sure what is the reason for the difference: ERG version or the parser?

We created a repository for the code downloaded from http://wesearch.delph-in.net at https://github.com/own-pt/wsi. What is not clear is the input formats supported by the create-index from the wesearch. It can read files produced by the redwoods script (gz compressed files like the example in http://moin.delph-in.net/ErgProcessing/SampleExport. But I would like to produce the input for create-index from the profile created with art+ace using the trunk version of ERG. How can I make it?

Is there any documentation about this kind of file produced by the redwoods script? I know that this script actually calls the export function from [incr tsdb()]. I know that we have one item per file, but is this file a simple txt with many possible views (term used in http://moin.delph-in.net/ErgProcessing but also called ‘perspectives on HPSG analyses’ in http://moin.delph-in.net/ErgProcessing/SampleExport) separated by blank lines without any specific order or markup to identify the structures?

Can any other tool produce this kind of files from a profile? I didn’t find anything about it in the https://pydelphin.readthedocs.io/en/latest/tutorials/itsdb.html documentation and the output of the delphin convert command is different: one single file with one kind of representation for each item.

Alternatively, can we import to [incr tsdb()] a profile created with art+ace only to export it in this one-item-per-file format?

Ops! I hope I made myself understandable.

1 Like

Can you share this code?

Thanks, but Matic’s pydmrs isn’t what I was talking about. It was work by a different colleague, as a extension of https://github.com/sinantie/NeuralAmr

For my dissertation code, the script that does what I described is here: https://github.com/goodmami/xmt/blob/master/scripts/extract-subgraphs

Unfortunately I did not document that repo well enough. If you are interested, I’ll need to dig around to put together some instructions for getting it working.

1 Like

The slides @goodmami pointed previously present the GraphLang. In the http://www.lrec-conf.org/proceedings/lrec2016/pdf/634_Paper.pdf this language was mentioned as under construction. What is the current stage? Is it fully implemented in pydmrs?

Moreover, https://github.com/matichorvat/pydmrs and https://github.com/delph-in/pydmrs are not related. That is, one is not the fork of the other. I am taking the second one as the ‘official’ pydmrs. But the first one may confuse people, maybe @goodmami or @guyemerson knows about the history behind these two repositories?

They are completely separate, IIRC. Matic’s predates the other and is for his work on realization. The delph-in/pydmrs project is a more general-purpose library maintained by lab at Cambridge. The shared name is just coincidence, I think.

1 Like

Thank you.

Historically, Matic’s pydmrs was something like a “version 0” of delph-in/pydmrs, before Ann got us (me, Matic, Alex, Ewa) to start working on a shared codebase.

GraphLang is stable, and is available in Pydmrs as a subpackage (pydmrs/graphlang). Alex used it extensively for his thesis (there is an introduction in section 4.3.2). I realised when checking his thesis that the link to an overview of GraphLang is broken. For the time being, I’ve put it here, but this is not meant as a long-term solution – I guess this information should go in the repo.

1 Like

Ah, I was mistaken. Thanks for clarifying.

Thank you @guyemerson