Testsuites and grammar profiling tools


#1

In a thread about a PyDelphin error, @arademaker asked what “test suites” are and how their useful and also about grammar profiling. I’m starting this separate thread so the discussion doesn’t get lost and so knowledgeable people can more easily contribute.

For the question about test suites, @olzama gave a brief description of what test suites are and linked to PyDelphin docs that explain the terminology (which is from the [incr tsdb()] user manual). I also added how profiles help manage the correspondence between inputs and outputs and store performance-related info.

@arademaker then said:

My task is mainly the grammar evaluation. Given a set of sentences I want to determine what the reason when a sentence was not parsed and which sentences generate more readings and why. I suspect some ambiguities are caused by compound terms not considered as such.

So I think it would be useful to lay out the various grammar profiling and development tools and their uses.

  • [incr tsdb()] – the original grammar profiling tool; it has the most support for inspecting the competence and performance of grammars by looking at their outputs over test items. It has a GUI from which you can click on individual items and view parse trees, semantics, etc., as well as filter results on TSQL queries and compare to gold profiles. [incr tsdb()] can process (e.g., parse) test suites using a “CPU” such as the LKB, PET, or ACE, and it can be used to produce treebanks.
  • PyDelphin – implements the database format defined by [incr tsdb()], which facilitates scripting over profiles; also implements a subset of TSQL queries and models derivation trees and MRS semantics. PyDelphin can also process profiles using ACE or via the web API.
  • gTest – scripts built on top of PyDelphin to further support for regression testing, coverage testing, and auditing the well-formedness of semantic representations.
  • gDelta – a tool for comparing syntactic differences between the outputs of two versions of a grammar
  • Typediff – a tool for exploring the analyses of grammatical phenomena through syntactic derivations
  • FFTB – “full forest treebanker”, a standalone treebanking tool built on top of ACE (also can be integrated with [incr tsdb()])
  • … more are listed at the ToolsTop wiki

Also does anyone have links to resources about using these tools? E.g., PyDelphin has a tutorial and [incr tsdb()] has a partial manual. Maybe some Ling567 slides or something?

And @arademaker do you have any more specific questions?


'Peg' object has no attribute 'format': parsing with ACE via pyDelphin
#2

Thank you @goodmami, I don’t have more questions yet. But I thought I could use this thread to share my experiments and findings with the tools. As you said, maybe knowledgeable people can step in and contribute.

PyDelphin:

I parsed 99 sentences manually extract from the article https://www.slb.com/~/media/Files/resources/oilfield_review/ors09/sum09/basin_petroleum.pdf. My plan was to test the PyDelphin functions for creating a test suite.

>>> from delphin import itsdb
>>> from delphin.interfaces import ace
>>> ts = itsdb.TestSuite('newprof')
>>> with ace.AceParser('/Users/ar/hpsg/ace/erg.dat') as cpu:
...     ts.process(cpu)
NOTE: parsed 73 / 82 sentences, avg 813426k, time 3591.92578s

The code was executed in a MacOS MacBook Pro with 16GB RAM. During the process, the Python process ended up with 27G of RAM allocated. After writing the results to the disk, the result file in the test suite folder has 24GB. Loading and saving the test suite in the disk is very slow. But inspecting some tables is quite fast:

>>> l = list([x[0] for x in ts.select('parse:readings')])
>>> import functools
>>> functools.reduce(lambda x, y: x + y, l) / len(l)
11491.89898989899

So in the average, I got 11K readings per sentence! :roll_eyes: I also discovered the empty lines are not automatically ignored when I create the test suite, I have some items that are empty items (all with zero readings).

Next steps are: (1) understand how to control the test suite execution to consume less resource (possible limiting the number of readings per sentence?); (2) what are the relevant ways to inspect the results; and (3) test the other tools.

So far, I got the impression that maybe test suites could be more suitable for grammar developing - with small sentences - than grammar competence evaluation. Ops! 1AM, time to sleep!


#3

I can respond to some of the points here…

The code was executed in a MacOS MacBook Pro with 16GB RAM. During the process, the Python process ended up with 27G of RAM allocated.

The way you called ACE from PyDelphin did not change the maximum amount of memory for the parse chart or for unpacking the results, and the default is 1200M and 1500M, respectively, I believe. You would probably see even more readings per sentence if you increase that limit. I tried parsing the first sentence of the PDF:

goodmami@tpy:~/grammars$ echo "The best way to reduce investment risk in oil and gas exploration is to ascertain the presence, types and volumes of hydrocarbons in a prospective structure before drilling." \
 | ace -g erg-2018-x86-64-0.9.30.dat -R 
NOTE: hit RAM limit while unpacking
NOTE: 27922 readings, added 65193 / 58214 edges to chart (13559 fully instantiated, 1506 actives used, 19753 passives used)	RAM: 1536008k
NOTE: parsed 1 / 1 sentences, avg 1536008k, time 14.04341s

(The -R option of ACE tells it to not output results, which is useful for quickly determining how many readings it can find)

So for this sentence it gets 27,922 readings before it runs out of RAM, and it takes 14s to do it. If I run ACE with --max-chart-megabytes=3500 --max-unpack-megabytes=3500, it takes 36s before it runs out of RAM and finds 122,967 (!) results. If instead of run with -n5 (the top 5 results), it takes 8s and < 1G RAM to find those 5 results. So I think you want to play around with RAM limits and the maximum number of results, depending on your task. If you are doing an NLP pipeline, there’s often little reason to use more than 1 result per input. For treebanking, I recall Dan used to use the top 500 results (which is the space from which the top parse(s) will be selected), but I think he may now use the full search space with full-forest treebanking.

As for PyDelphin’s ridiculous memory usage, this is because the current version stores the whole profile (input items, parse info, results, etc.) in memory. Since the memory required to store the 24G profile exceeds the 16G of your system, it has to swap RAM to disk, which partially causes the very slow performance you’re seeing. There is a ticket for implementing “incremental” profile reading/writing, which should help keep the memory of the Python process constant, and will also allow it to start writing results as it gets them, rather than after it has them all. I’ll prioritize fixing this issue.

In the meantime, I suggest adding some options to ACE (answering your question (1)). In PyDelphin, that would look like:

with ace.AceParser('/Users/ar/hpsg/ace/erg.dat',
                   cmdargs=['--max-chart-megabytes', str(MAXCHART),
                            '--max-unpack-megabytes', str(MAXUNPACK),
                            '-n', str(NRESULTS)]):
    ...

where MAXCHART, MAXUNPACK, and NRESULTS are the values you choose to use. I also recommend using gzip=True when you write the profile to disk, so that the non-empty files get compressed.

I also discovered the empty lines are not automatically ignored when I create the test suite, I have some items that are empty items (all with zero readings).

This is the expected behavior, unless those empty lines were not in the input when creating the test suite.

… (2) what are the relevant ways to inspect the results; and (3) test the other tools.

If you want to perform TSQL queries on the profiles, the LOGON tsdb utility is much faster than PyDelphin’s equivalent functionality. Or if you’re familiar with the file structure, unix utilities like cut and awk can be quick, e.g., to get the average number of readings:

goodmami@tpy:~/grammars/erg-trunk$ ~/logon/bin/tsdb -query 'select i-id readings' -home tsdb/gold/wsj02a/ \
  | awk -F' \| ' '{ count += $2 } END { print count / NR }'
0.916

(this profile has a maximum of 1 reading per item, so the average is less than 1)

Also, [incr tsdb()] and art can be quicker and less memory-hungry to parse profiles, but art may be less robust than PyDelphin when ACE crashes.

I hope this helps, and maybe others can fill in more gaps.


#4

Thank you @goodmami for the comments. I didn’t pay attention to the options that I can pass to instantiate the AceParser, good to know. I executed again the process in less than 2 minutes asking for <= 10 readings per sentence. How to combine the parameters --max-chart-megabytes, --max-unpack-megabytes and -n is still not clear for me. In Paris, Dan Flickinger suggested –timeout=60 –max-words=150 –max-chart-megabytes=4000 –max-unpack-megabytes=5000 –rooted-derivations –udx –disable-generalization besides other options for ubertagging. But I do have to learn more about how one option affects the other. Moreover, I haven’t played with PET yet, something that I want to give a try to explore the pre tokenization option.


#5

I am not sure if I should open another thread for that or not. Anyway…

I tried the http://moin.delph-in.net/FftbTop tool. My impression was that it would be easy to explore the profile created with pyDelphin with this tool. But I got the following errors:

First I tried to pass the same ERG compiled grammar file that I use to create the profile, compiled with ACE. I am using the fftb for MacOS:

$ ./fftb -g ../ace/erg.dat --browser ~/work/papers/manual/current/
grammar image: ../ace/erg.dat
version mismatch: this is ACE version 0.9.19, but this grammar image was compiled by ACE version 0.9.27

The version mismatch problem I don’t know how to solve. My understanding is that libace.dylib in the dependencies directory needs to be recompiled.

Then I tried with the erg-1214.dat compiled ERG file provided in the fftb distribution. Unfortunately, once the browser page opens, in any item that I click I got back the message 404 no stored forest found for this item. Is it related to the mismatch between the version of the grammar that I used to produce the profile and the version that I passed to load the tool? Or is it related to the way the PyDelphin created the profile?


#6

Is it related to the mismatch between the version of the grammar that I used to produce the profile and the version that I passed to load the tool? Or is it related to the way the PyDelphin created the profile?

Someone else will have to address many of the points in your last post, but I think, as you suspect, that the versinon of ACE packaged with the default FFTB is not the same as the one you used to parse, so it cannot load the grammar you used. I would think that using the packaged erg-1214.dat file would work, assuming you parsed with a 1214 version of the ERG, but maybe recompiling with your ACE version is the best bet, but it might not be easy to do.

I don’t think the second issue is due to how PyDelphin created the profile, per say, but you may need to call ACE with a special option (see AceOptions) to create a full-forest profile. However, these modes are untested with PyDelphin, so it may not work well. This is about all I can say now. If you’re using the ERG, you might try these instructions instead.


#7

Woodley has just released a new version of ACE, to go with the ERG 2018 release. I think it would be best to use that if possible, though I don’t know if fftb needs updating as well.

To this question:

My task is mainly the grammar evaluation. Given a set of sentences I want to determine what the reason when a sentence was not parsed and which sentences generate more readings and why. I suspect some ambiguities are caused by compound terms not considered as such.

I would encourage you to first look at what the parse selection model is doing for you. Large ambiguity spaces are not a problem if the top analysis is suitable. If the top analysis isn’t suitable (enough of the time), then some options are:

(a) Doing some domain-specific treebanking to help teach the parse selection model what you’re looking for.
(b) Pre-tokenization of domain-specific terms.

I believe that software-support for (b) exists, but wouldn’t know exactly where to point you for it.


#8

hi @ebender , what do you mean by the top analysis? The first reading returned by the parser, right? If so, it seems a good strategy but for that, I need to understand if [incr tsdb()] can help me on that process of annotating the best analysis/reading for each sentence.

For (b) I suppose PET parser has this support with the YY mode if I understood it right in http://moin.delph-in.net/PetInput.


#9

There is no more recent MacOS version of FFTB currently. If ERG 1214 is the version you used to parse the profile you wish to treebank, then using the erg-1214.dat that you have that is compatible with your FFTB version should not be a problem. If you have a more specific grammar version you are using, you will need a corresponding grammar image file for that grammar that matches the FFTB version (0.9.19, apparently). You could produce one using the 0.9.19 version of ACE (see http://sweaglesw.org/linguistics/ace/download/)

The “no stored forest found” problem most likely is not a version mismatch, but instead indicates that full forest information was not actually recorded into that profile when you parsed it. You can look at the “edge” or “edge.gz” relation inside the profile directory to see if it is empty or populated. I cannot speak to how to use PyDelphin to create a profile suitable for Fftb; it may be possible, and Mike would be the one to answer that (the critical thing is that the edge relation is populated). If you are using “art” then you will need to pass “-f” on its command-line and also “-O” on ACE’s command-line. If you are using the LOGON “parse” script you will need to pass “–protocol 2”.

If you are interested in using YY mode and ART and ACE together with FFTB, several grammarians have documented their setups for this, e.g.:
ZhongYYMode
IndraTreebanking
JacyYYMode


#10

Yes by ‘top analysis’ I mean the top-ranked one according to the parse selection model. If you’re running the parser with a parse selection model, this will be the one returned first.

If so, it seems a good strategy but for that, I need to understand if [incr tsdb()] can help me on that process of annotating the best analysis/reading for each sentence.

I’m not sure what you mean here. I was imagining that you’d run the parser in best-1 mode and just evaluate whether the single parse returned for each sentence was suitable.

If instead you want to see if there is a suitable parse among the top N for some N, then yes, you’ll want to use treebanking software, either [incr tsdb()] or fftb.


#11

If -O is the option to use for ACE, then PyDelphin does not support creating full-forest profiles. The option is not documented as far as I know (I added it to the AceOptions wiki by looking at main.c in ACE’s source code, so I just guessed at its meaning), and it appears to change the output format to one PyDelphin does not expect. I’ll make a ticket to add support for it, but it’s not a high priority for me right now.

@arademaker I suggest not using PyDelphin to process profiles if you want to do full-forest treebanking. Instead try [incr tsdb()] or art for processing, then use fftb to do the treebanking step.