Trouble with DMRSjson

Hi,

I’m trying to visualize DMRS with py-delphin.

First, I had DMRS(simple) files that I managed to generate a pdf using delphin.codecs.dmrstikz and pdflatex. Now, I’m working with DMRSjson files and try to visualize them.

I tried:

with open(sys.argv[1]) as json_file:
	json_data = delphin.codecs.dmrsjson.load(json_file)
	tex = [delphin.codecs.dmrstikz.dumps(json_data)]
	byte_tex = tex[0].encode('utf-8')
	pdfl = pdflatex.PDFLaTeX.from_binarystring(byte_tex, "output_pdf")
	pdf, log, cp = pdfl.create_pdf(keep_pdf_file=True, keep_log_file=False)

But it doesn’t work. Error:

$ python3 grew_to_pdf.py input.json

Traceback (most recent call last):
File “grew_to_pdf.py”, line 24, in
json_data = delphin.codecs.dmrsjson.load(json_f)
File “/…/dmrsjson.py”, line 43, in load
return [from_dict(d) for d in data]
File “/…/dmrsjson.py”, line 43, in
return [from_dict(d) for d in data]
File “/…/dmrsjson.py”, line 188, in from_dict
for node in d.get(‘nodes’, []):
AttributeError: ‘str’ object has no attribute ‘get’

The error appears during the line: delphin.codecs.dmrsjson.load(json_file)

It’s strange because the same code portion works well with DMRX input:

with open(sys.argv[1]) as xml_file:
	xml_data = delphin.codecs.dmrx.load(xml_file)
	tex = [delphin.codecs.dmrstikz.dumps(xml_data)]
	byte_tex = tex[0].encode('utf-8')
	pdfl = pdflatex.PDFLaTeX.from_binarystring(byte_tex, "output_pdf")
	pdf, log, cp = pdfl.create_pdf(keep_pdf_file=True, keep_log_file=False)

and create the pdf file.

So I’m thinking the problem might be caused by json input file but it was generated with py-delphin. I tried creating another file that contains example of dmrsjson that we can find on py-delphin docs [delphin.codecs.dmrsjson — PyDelphin 1.5.1 documentation] and I got the same issues.

I share the input json file I’m using: https://we.tl/t-eQIJfOz9fh

Do you have any idea where the problem could come from?

I have a second question. Is it possible to print the sentence represented by the DMRSjson file? Is there a command to do this with py-delphin?

Thanks

Hello,

The code looks correct, at least the first 3 lines:

with open(sys.argv[1]) as json_file:
	json_data = delphin.codecs.dmrsjson.load(json_file)
	tex = [delphin.codecs.dmrstikz.dumps(json_data)]

(I didn’t test the rest)

The issue is that there is a mismatch between the call to read the JSON data and the JSON data itself. The reason is actually documented, but maybe not in the most intuitive place. At the beginning of the section on Deserialization Functions it says this (emphasis added):

The deserialization functions load() , loads() , and decode() accept textual serializations and return the interpreted semantic representation. Both load() and loads() expect full documents (including headers and footers, such as <mrs-list> and </mrs-list> around a mrx serialization) and return lists of semantic structure objects. The decode() function expects single representations (without headers and footers) and returns a single semantic structure object.

For JSON representations, this means that the document should be an array (i.e., a list of DMRSs) and not an object. That is, instead of:

{
    "top": 10000,
    "nodes": [...],
    ...
}

It should be:

[
    {
        "top": 10000,
        "nodes": [...],
        ...
    }
]

Alternatively, if you know each file will only have one JSON object (one DMRS), then you can use the decode() function instead. Similarly, use encode() on that object, or continue to use load() but put json_data in a list (i.e., load([json_data])):

with open(sys.argv[1]) as json_file:
    json_data = delphin.codecs.dmrsjson.decode(json_file.read())
    tex = [delphin.codecs.dmrstikz.encode(json_data)]
    ...

(Aside: when testing this I discovered I had left in a debugging print call in the dmrstikz module. I have pushed delphin-latex v1.0.1 to PyPI. Update with pip install --upgrade delphin-latex. This doesn’t change anything functionally, it just avoids some excessive printing to stdout.)

There’s no command for this (like at the command line), but once you’ve loaded the data into a DMRS object you can access it with the DMRS.surface attribute, assuming the data has the surface form encoded already:

>>> from delphin.codecs import dmrsjson
>>> d = dmrsjson.decode(open('input.json').read())
>>> d.surface
'Adams, a Missouri'

Thanks you for the answer.

First part is working perfectly.

For the second part, it’s a little bit more complex. I omitted to say that I can’t use ‘surface’ directly.

We made changes to the DMRS graphs and ‘surface’ does not correspond anymore. So the question is: is it possible to generate a sentence from the node information?

To do that, I need a link between nodes and ‘surface’. As I understant it, these informations are linked by the ‘lnk(from, to)’ feature

nodes: [
	nodeid: 1
	[...]
	"lnk": {"from": 0, "to": 3}
	nodeid: 2
	[...]
	"lnk": {"from": 4, "to": 7}
	nodeid: 3
	[...]
	"lnk": {"from": 8, "to": 12}
],
"surface" = "The new chef "

‘from’ - ‘to’ describes the location of the token in the string ‘surface’
Is that correct ?

The problem with this is if we want to create and add a new node like:

{
	"nodeid": 10001,
	"predicate": "_new_a_1",
	"sortinfo": {
		"lemma": "be"
		"pos": "v",
		"sense": "id";
		"SF":  "prop", 
		"TENSE":  "pres", 
		"MOOD":  "indicative", 
		"PERF":  "-", 
		"cvarsort":  "e"
	},
},

it can’t refer to a token present in ‘surface’. More, features of this new node are created according to the parameters of the other nodes, we do not know exactly in advance which word and which feature will be added.
That would be great if a tool is able to generate words from these informations

In the general case, generating a sentence from a DMRS graph requires a grammar (or some other kind of generation system). This is because DMRS is designed to abstract over certain aspects of the surface form, and the mapping between the two is non-trivial.

From the snippets you’ve shown, it looks like you’re using DMRS for English based on the semantics of the ERG. In this case, as long as you have a well-formed DMRS (according to the ERG), you should be able to generate a surface string using a processing engine like ACE or the LKB. Generation using ACE is accessible in Pydelphin: Using ACE from PyDelphin — PyDelphin 1.5.1 documentation

1 Like

I agree with @guyemerson and I’d like to add a couple things.

The “lnk” info you’re referring to does provide a hint as to the character ranges that semantic fragments correspond to in the string that was parsed to produce that semantic representation. It is useful for things like visualization tools, but note the following:

  1. The “lnk” information is optional; it is not part of the semantics proper; it encodes no meaning.
  2. The “surface” string, also optional, is often not included with a semantic representation. With PyDelphin I started including it in a few situations as it was useful for my purposes.
  3. The same semantic representation may be produced for multiple surface strings, and the same surface string may be parsed into more than one semantic representation. There is not a tight coupling between the two.
  4. The cfrom/cto positions are not always what you’d expect. Effects from tokenization or morphological analysis can shift this values around. Some predications (or nodes in DMRS) do not correspond to any surface token, and some surface tokens do not correspond with any predication.

So I would not look too deeply into cfrom/cto for any purpose beyond tracing the possible character spans that some semantic fragment came from in the sentence that was parsed.

The second thing I wanted to mention is that if you’re manipulating semantic structures and want to generate strings from them using a grammar as Guy mentioned above, you’ll need to be careful that the semantics is actually licensed by or valid with respect to the grammar. There are some principles of grammar composition you can follow to help ensure this validity, but it’s impractical to try and predict whether something is ultimately valid—you’ll just have to try generating with the grammar and see. You might look into the neural generation work that was done (e.g., Hajdik et al. 2019; citation below). We got good results and it should be more robust to some kinds of invalidity (I don’t think we tested this specifically, however).

  • Hajdik, Valerie, Jan Buys, Michael Wayne Goodman, and Emily M. Bender. “Neural Text Generation from Rich Semantic Representations.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp. 2259-2266. 2019.
2 Likes

Thank you for the help. The ACE generation is very interesting.

But the problem is that we used Csaw unstead of ACE to parse because ACE had troubles with long sentences. I tried to use ACE generation function but have multiple errors.

Is it possible to generate the sentences associated with MRS_Csaw from ACE?

If not, is there a way to increase the ACE limit of words/RAM to use it on longer sentences?

Csaw isn’t guaranteed to produce valid MRS according to the ERG. This is by design, because it’s intended to be more robust. (Some errors might be easy to fix, but I would expect there to be a long tail of difficult cases.)

I believe ACE’s limits on RAM/words can be adjusted with command-line options: AceOptions - Deep Linguistic Processing with HPSG (DELPH-IN)

If you have long sentences, another option to reduce processing cost is to chunk the sentences, parse the chunks, then combine the parses: Semantic chunking

1 Like

When parsing running text such as sentences in newspaper articles, I use the following command line options when starting ACE:

ace -g erg.dat -1 --max-chart-megabytes=15000 --max-unpack-megabytes=16000 --max-words=150 --timeout=150

If your machine doesn’t have 16Gb of RAM, you should reduce both of the megabyte settings to fit within your hardware RAM limits, and it’s best to have the chart-megabytes limit a little lower than the unpack-megabytes one.

1 Like