Pydelphin testsuite.process() gzip-related error

Why would I sometimes hit this error but not always:

  File "/home/olga/delphin/parsing_with_supertagging/venv/lib/python3.8/site-packages/delphin/itsdb.py", line 891, in process
    _add_row(self, tablename, data, buffer_size)
  File "/home/olga/delphin/parsing_with_supertagging/venv/lib/python3.8/site-packages/delphin/itsdb.py", line 918, in _add_row
    ts.commit()
  File "/home/olga/delphin/parsing_with_supertagging/venv/lib/python3.8/site-packages/delphin/itsdb.py", line 793, in commit
    tsdb.write(
  File "/home/olga/delphin/parsing_with_supertagging/venv/lib/python3.8/site-packages/delphin/tsdb.py", line 845, in write
    raise NotImplementedError('cannot append to a gzipped file')
NotImplementedError: cannot append to a gzipped file

when running:

        with ace.ACEParser(grammar, cmdargs=cmdargs, executable=ace_exec, stderr=errf) as parser:
            ts.process(parser)

cmdargs can be “-1” or “-1 --ubertagging=000.1”.

With some profiles, the process finishes but with others there is the error above.

on the ERG tsdb profiles (all of which have the same format, namely, item.gz etc)?

When you parse a profile, the item(.gz) file will be read, but not written to. One possibility for the error is that some of the profiles you are working with have been compressed to save space, but may need to be uncompressed before processing.

1 Like

What @Dan said is partially true for PyDelphin. The .gz files are compressed, but PyDelphin is happy to read and write them transparently (i.e., usually you don’t need to know or care whether the file was gz-compressed). One exception is when PyDelphin is appending to a file on disk rather than writing the whole file anew, because the gzip compression would work better on the whole file than compressing it piecemeal.

The problem

There are two questions:

  1. Why is it gz-compressing the files?
  2. Why is it appending to the files?

RE (1), if you aren’t passing the -z or --gzip options, it won’t compress the results, but the exception can still be raised if the profile has already-compressed files. In the snippet below, gzip is the flag to determine if the results will be compressed, and use_gz is a flag to indicate if an existing file is gzipped:

I think the or use_gz part of the condition is some vestigial code from when PyDelphin decided to gzip depending on whether the files were already gzipped (Disallow appending to gzipped files in tsdb.write · delph-in/pydelphin@485d4e9 · GitHub), and it might not be relevant anymore.

RE (2), if you run process on a profile with > 1000 items, PyDelphin will try to append the results in batches as it goes. This was done to help it deal with very large profiles.

Feel free to file a bug report on GitHub, since I think PyDelphin is not behaving correctly in this case.

Workarounds

Try to process a version of the profile that does not have compressed files. You can do this with the mkprof command:

delphin mkprof --refresh my-profile  # add --gzip to compress again

If you’re using the Python API instead of the command line, you can also try increasing the buffer size to sidestep the issue:

...
    ts.process(parser, buffer_size=N)

where N is greater than the number of items in your profile.

1 Like