New Dev Docs for: Understanding the MRS Format

I’ve put together developer-focused writeup of “Understanding the MRS Format”. I’ve included lots of information from this forum such as:

And many others. I’ve tried to capture “everything I wish I had known”, again, from a developer perpective.

I’d love any comments on it. My plan is put it in the how-to section of the main documentation, by mid next month if there are no objections.

Enjoy!

1 Like

The only thing that call my attention is the use of some nonstandard terminology of the domain. I am afraid that can introduce more confusion to people.

Surely I understand you have the goal to educate developers. In contrast, many documents from DELPH-IN are written by academic researchers to academic researchers and students.

Anyway, I would vote to make clear in the documents the distinct nature of each document.

1 Like

Thanks @arademaker! Can you give me an example of using non-standard terminology? In the two docs I just posted, in particular, I’ve tried very hard not to introduce any new terminology. I’ve certainly used analogies like “lamdba function” as analogous to “scopal argument” but only as analogies.

Help me ferret it out!

I realized the doc has moved and I can’t update the original post. It is here now.

@arademaker any specifics you can give here? I’d love to fix terminology if I’m misusing it.

I’ve updated the disclaimer at the top to try to address your concern:

This section is designed to give application developers an overview of the Minimal Recursion Semantics format which is the primary artifact used by DELPH-IN to represent the meaning of a phrase. For a deeper dive into MRS, or one that has a more academic or linguistic approach, explore Minimal Recursion Semantics: An Introduction.

I can try to read this in more depth later, but also it would be good to distinguish more between the ERG and ACE. While it’s true that the particular processor has idiosyncrasies that lead to particular MRS and/or trees and other parts of the representation, in an ideal world (I think) all of the processors would work exactly the same and the grammars would be the only variable. You should get the same MRS for the same utterances and version of the ERG (or any grammar) between ACE and the LKB or Agree or PET.

Each MRS document also has multiple interpretations. Using constraints that are included as part of the MRS, a set of trees (called well-formed trees ) can be built from the flat list of predications in a given MRS. These well-formed trees define all the alternative meanings of that particular MRS.

I know the focus on scope-resolved representations has come up before and I suppose we haven’t come to a conclusion on it. I suspect the use of the term “tree” to refer to these is one of the concerns @arademaker has. In linguistics, while the tree data structure comes up in several fields and areas, it is very associated with syntax trees, which the ERG produces. In delph-in land, usually we refer to trees as derivations (though I’m not sure why). So it is a bit confusing to say an utterance gets a reading which is a pairing of 1. a tree/derivation and 2. an MRS and then additionally say that an MRS can be expanded into trees.

A DELPH-IN parser like ACE will usually generate more than one MRS document representing the various high-level interpretations of a phrase. Each one contains a list of predicate-logic-like predications and not a tree like you’ll see in many natural language systems. That’s because it is underspecified . Even though the parser has already done one level of interpretation on the phrase, there are still (usually) multiple ways to interpret that .

Technically, the predications are stored as a bag, not a list. Also, I don’t think the reason the predications are stored as a bag is because of underspecification, but rather that there isn’t information that is stored in the ordering of predications and/or arcs between predications.

While I’m not particularly opposed to this interpretation of “underspecified,” (though others might be), I think that usually an entire MRS isn’t referred to as underspecified. Instead, various components are underspecified. But, probably not a big deal if clarified in a footnote or introductory note or something.

  • Whether it was actually seen in the text (starts with _) or added abstractly by the system (no initial _)

Probably better to say “by the grammar,” but that also doesn’t really capture what grammatical predicates are for, I think.

If you pick variable values such that the MRS is true for a given world, then you have understood the meaning of the MRS in that world.

I think this is a misstatement of how truth-conditional semantics works. I think traditionally, the sentence “means” the set of truth conditions that are true, as opposed to a single truth condition that is true. For instance, if Bob says “cats walk” and then I see a cat walking, it’s not the case that I understand what Bob meant.

LBL:

It’s unclear why just for LBL you’re including the colon, is this a typo?

  • PT: ?

Looks like PT means “prontype” and is for distinguishing different kinds of pronouns like reflexives, etc. See its definition.

This indicates that the verb go is the “main point of the phrase”. This is called the “syntactic head” in linguistics.

I don’t think “syntactic head” is accurate here. Usually the ARG0 of the verb or other main predicate is the INDEX, but 1. this isn’t the syntactic head (which, especially in delph-in and HPSG, typically refers to the head of a phrase, not an utterance) and 2. I believe there are cases when the INDEX is not the ARG0 of the main verb/etc. (in a different way than the copula example you provide).

Thanks for the feedback @trimblet! I’ll see if responding to each point separately helps keep the thread more manageable.

I realized in rereading Minimal Recursion Semantics: An Introduction that it never uses the term “scope-resolved tree” and instead uses “scope-resolved MRS structure” or "scope-resolved MRS’.

I’ll switch the docs to “scope-resolved MRS”.

Thanks!

1 Like

Could you point me at an example that you think needs better distinguishing? Or maybe you are saying that I should point developers at other parser alternatives? I definitely do focus on ACE since it seemed like a good first choice for developers that just want to consume the MRS output of various grammars.

Got it: I’ll switch list to bag.

Thanks: I’ll change the phrasing to clarify that it is the connection between the predications that are underspecified, and, because the connections are underspecified, it is represented as a bag and not a tree.

1 Like

Here I’m just trying to give the reader a gentle introduction into how to think about what MRSs represent and how one might go about “processing/solving/doing something with them”. In addition to not talking about the whole set of truth conditions, I suspect I went too broad with “Then you have understood the meaning of the MRS in that world.”

Does something like this work better:

Old: Thinking of MRS variables as variables in a math equation can help: The MRS is effectively defining a formula with variables. If you pick variable values such that the MRS is true for a given world, then you have understood the meaning of the MRS in that world.

New: Thinking of MRS variables as variables in a math equation can help: The MRS is effectively defining a formula with variables. One way to use the formula is to find the set of all variable values that make the MRS true in a given world. Those values, combined with the predications they are used in, provide a valuable tool for getting at the meaning of the original phrase.

1 Like

I haven’t read the whole thing, but the first thing that stood out was the use of “format” in the “MRS Format”. It is not a format, but a formalism; an abstract metalanguage for describing semantic structures and principles for their well-formedness. What I would call a “format” is one of the various serializations of a representation, such as SimpleMRS, Indexed MRS, MRS XML, etc.

Also I think it’s fairly common to refer to resolved scopal structures as trees and not just the syntactic derivations. However this is only the “scope tree” of the MRS and not the full MRS. Non-scopal arguments will not always form a tree. E.g., in The dogs tried to chase a cat, there’s a scope-resolved tree where dogs scope over cat (multiple cats) and another the other way around (one cat), but dogs is an argument of both tried and chase, which is not tree-like.

1 Like

Got it, I’ll update to make it consistent

I do prefer calling scope-resolved MRS, “scope resolved trees”. I think it makes it clearer. Hmm.

Could you point me at an example that you think needs better distinguishing? Or maybe you are saying that I should point developers at other parser alternatives? I definitely do focus on ACE since it seemed like a good first choice for developers that just want to consume the MRS output of various grammars.

My point is less about advertising one processor or another and more about the MRS are produced by the grammar and the formalism and not by the processor. To use a metaphor, I think programming languages are often discussed in the abstract about their specification (e.g. Python 3.11.3) and less about their specific implementation (e.g. Jython 2022-09-10 or CPython 3.11.0). This metaphor isn’t perfect, but generally the ERG should be thought of more as the programming language specification and standard library and the processors (ACE, etc.) are more like the underlying implementation language, e.g. C, Java, etc. To follow that logic, I think it makes more sense to focus more on the specification (ERG) than the underlying implementation (ACE), so I think it would make more sense to rephrase the first paragraph to something like this:

The DELPH-IN English Resource Grammar (ERG) produces from an English phrase a data structure called an “Minimal Recursion Semantics” (MRS) which is a technical representation of human language. The ACE processor, among other processors, processes the grammar and the phrase to produce the output. Processors can be used with any of the other DELPH-IN grammars to convert other natural languages into the MRS format. While the examples below use English, the concepts apply across the DELPH-IN grammars.

1 Like

Fixed!

It seemed clearer at the time, but obviously it can come across as a typo. Fixed.

Great, thank you! Can you point me at something that helps me figure out what its values could be? I don’t know how to read the grammar files outside of the obvious stuff, and I don’t think these define the actual values that show up in the MRS, do they?:

prontype := *sort*.
real_pron := prontype.
notpro_or_refl := prontype.
notpro_or_non_refl := prontype.
non_refl := real_pron & notpro_or_non_refl.
std := non_refl.
recip := non_refl.
refl := real_pron & notpro_or_refl.
impers := non_refl.
demon := non_refl.
zero := non_refl.
notpro := notpro_or_refl & notpro_or_non_refl.
1 Like

Great, thank you! Can you point me at something that helps me figure out what its values could be? I don’t know how to read the grammar files outside of the obvious stuff, and I don’t think these define the actual values that show up in the MRS, do they?

The situation is slightly more complicated than this, but generally those are the types that show up in output grammar, yes (after going through the SEM-I). You can see some of them in output MRS such as for:

$ echo "I like you" | ace -g erg.dat -1Tf
SENT: I like you
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ]
RELS: < [ pron<0:1> LBL: h4 ARG0: x3 [ x PERS: 1 NUM: sg IND: + PT: std ] ]
 [ pronoun_q<0:1> LBL: h5 ARG0: x3 RSTR: h6 BODY: h7 ]
 [ _like_v_1<2:6> LBL: h1 ARG0: e2 ARG1: x3 ARG2: x8 [ x PERS: 2 IND: + PT: std ] ]
 [ pron<7:10> LBL: h9 ARG0: x8 ]
 [ pronoun_q<7:10> LBL: h10 ARG0: x8 RSTR: h11 BODY: h12 ] >
HCONS: < h0 qeq h1 h6 qeq h4 h11 qeq h9 >
ICONS: < > ]
NOTE: 1 readings, added 890 / 257 edges to chart (132 fully instantiated, 82 actives used, 76 passives used)	RAM: 3763k
$ echo "I like myself" | ace -g erg.dat -1Tf
SENT: I like myself
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ]
RELS: < [ pron<0:1> LBL: h4 ARG0: x3 [ x PERS: 1 NUM: sg IND: + PT: std ] ]
 [ pronoun_q<0:1> LBL: h5 ARG0: x3 RSTR: h6 BODY: h7 ]
 [ _like_v_1<2:6> LBL: h1 ARG0: e2 ARG1: x3 ARG2: x8 [ x PERS: 1 NUM: sg IND: + PT: refl ] ]
 [ pron<7:13> LBL: h9 ARG0: x8 ]
 [ pronoun_q<7:13> LBL: h10 ARG0: x8 RSTR: h11 BODY: h12 ] >
HCONS: < h0 qeq h1 h6 qeq h4 h11 qeq h9 >
ICONS: < > ]
NOTE: 1 readings, added 949 / 317 edges to chart (162 fully instantiated, 87 actives used, 90 passives used)	RAM: 4052k

Note the values std and refl for the second pron, x8 in both of these.


Bonus, you can also mess around with the MRS sometimes to figure out what a feature is doing by providing an underspecified MRS and seeing what the ERG generates, e.g. removing the other features for x8:

before: [ LTOP: h0 INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ] RELS: < [ pron<0:1> LBL: h4 ARG0: x3 [ x PERS: 1 NUM: sg IND: + PT: std ] ]  [ pronoun_q<0:1> LBL: h5 ARG0: x3 RSTR: h6 BODY: h7 ]  [ _like_v_1<2:6> LBL: h1 ARG0: e2 ARG1: x3 ARG2: x8 [ x PERS: 1 NUM: sg IND: + PT: refl ] ]  [ pron<7:13> LBL: h9 ARG0: x8 ]  [ pronoun_q<7:13> LBL: h10 ARG0: x8 RSTR: h11 BODY: h12 ] > HCONS: < h0 qeq h1 h6 qeq h4 h11 qeq h9 > ICONS: < > ]
 after: [ LTOP: h0 INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ] RELS: < [ pron<0:1> LBL: h4 ARG0: x3 [ x PERS: 1 NUM: sg IND: + PT: std ] ]  [ pronoun_q<0:1> LBL: h5 ARG0: x3 RSTR: h6 BODY: h7 ]  [ _like_v_1<2:6> LBL: h1 ARG0: e2 ARG1: x3 ARG2: x8 [ x PT: refl ] ]  [ pron<7:13> LBL: h9 ARG0: x8 ]  [ pronoun_q<7:13> LBL: h10 ARG0: x8 RSTR: h11 BODY: h12 ] > HCONS: < h0 qeq h1 h6 qeq h4 h11 qeq h9 > ICONS: < > ]

Then use your new MRS to generate:

$ echo "[ LTOP: h0 INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: - ] RELS: < [ pron<0:1> LBL: h4 ARG0: x3 [ x PERS: 1 NUM: sg IND: + PT: std ] ]  [ pronoun_q<0:1> LBL: h5 ARG0: x3 RSTR: h6 BODY: h7 ]  [ _like_v_1<2:6> LBL: h1 ARG0: e2 ARG1: x3 ARG2: x8 [ x PT: refl ] ]  [ pron<7:13> LBL: h9 ARG0: x8 ]  [ pronoun_q<7:13> LBL: h10 ARG0: x8 RSTR: h11 BODY: h12 ] > HCONS: < h0 qeq h1 h6 qeq h4 h11 qeq h9 > ICONS: < > ]" | ace -g erg.dat -e
I like yourselves.
I like themselves.
I like myself.
I like yourself.
I like ourselves.
I like itself.
I like him / herself.
I like himself / herself.
I like themself.
I like herself.
I like oneself.
I like himself.
1 Like

I haven’t had a chance to review the document yet, but I wanted to chime in to say that @trimblet 's point is really key. It’s not ACE that is producing these representations. It’s the ERG.

There’s an ongoing confusion in the NLP literature around this, because for treebank trained parsers, there isn’t really a distinction between the processor and the grammar. But there definitely is for us, and everything to do with semantic representation design and another analytical choices rests with the grammar, not the processor.

2 Likes

I’ll definitely update the docs to make that clear.

Sorry but that misses my point. The scope tree is only part of the MRS. The predicate-argument structure is not a tree.

1 Like

Maybe getting specific will help me understand your point. I think below is a representation of the predicate linkages for the scope-resolved MRS you refer to with “The dogs tried to chase a cat” where “dogs scope over cat (multiple cats)”

            ┌────── _dog_n_1(x3)                   ┌────── _cat_n_1(x11)
_the_q(x3,RSTR,BODY)                 ┌─ _a_q(x11,RSTR,BODY)
                 └─ _try_v_1(e2,x3,ARG2)                └─ _chase_v_1(e10,x3,x11)

I would call this entire representation “a tree” in the computer science sense, since it:

… represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be connected to exactly one parent,[1] except for the root node, which has no parent (i.e., the root node as the top-most node in the tree hierarchy)

I get that not all possible arguments of all predications are scopal, but that doesn’t make it any less of a tree, in the computer science sense, it just means that some nodes don’t have children.

I also get (from @trimblet s point above) that the word “tree” is sensitive in linguistics since it usually is associated with a syntax tree and thus might bring the wrong intuition to those practiced in the art, so it might be wise to avoid the term when possible for clarity.

I think you might be saying that some parts of the representation above are commonly (in DELPH-IN) referred to as trees, but not the whole thing?

The full MRS is a graph, not a tree, because it includes the predicate-argument structure.

I think the phrase “scope tree of a fully resolved MRS” might be helpful.

1 Like