gTest for a coverage test against a large skeleton file


#1

Dear all (esp. Mike),

I run gTest (https://github.com/goodmami/gtest) for a coverage test against a large skeleton file (item file) which has 13 millions of lines/sentences in Indonesian. My computer worked very hard and became unresponsive (I could not do other things with my computer).
I (with the help of a friend) tried to split this large file into 13,000 files (each file has 1,000 lines/sentences) and used all CPUs in my machine (parallel processing) to run gTest but again, my computer became unresponsive (maybe I need to wait for hours and while waiting, I cannot do other things with my computer).
Is there a better way to do this?

Last time when I run gTest for JATI skeleton (it has 2,003 lines/sentences), without splitting the file / parallel processing, it took around 4 hours…

Thanks,
David


#2

That’s probably not what you are asking about (so I apologize in advance if that’s the case) but have you considered renting Amazon Web Services for a short time (for parallel processing)? I have not done it myself, but I think it is supposed to be relatively inexpensive (e.g. a machine with 16 cores and 100GB memory would be something like $2/hour, I am told).


#3

Well, if 2,000 sentences takes 4 hours, then you would expect 1,000 to
take 2 hours, and 13 million, 26,000 hours (about 1,000 days or 3
years). So it is not so surprising it is not finishing. Either you
need to get a lot more cpus, or I think you should choose a smaller
sample of your corpus (maybe aim for 5,000 sentences, which you can
run overnight in 10 hours)? If you really need 13 million, then you
need to set up a large server cluster. For testing (coverage and
regression) a smaller set will tell you almost as much as a larger
set. If you want to do data mining, then maybe running art (ace’s
batch processor) directly would make things faster.

Do you know what the bottleneck is? Is it the parsing or POS tagging
(I think the UI system is not fast)? As the POS info should not
change, you could try storing the input in yy-mode, and then process
that directly (then you don’t have to run the POS tagger each time).


#4

Hi David,

Some things…

  • gTest calls art in a subprocess, so using art directly probably won’t help
  • Furthermore, art will crash if ACE throws an error or prints unexpected output, so it’s not good to use it for very large jobs. In general, smaller profiles are better for many reasons. If, instead of art, you process with PyDelphin on top of ACE, it is more robust to these crashes, but it will be slower.
  • Processing separately from the gTest analysis is a good idea. You can run gTest over processed profiles (with the --static option) in the develop branch.
  • If your computer is unresponsive with a single process, it’s probably because it exhausted the memory rather than used up the CPUs. Parallelizing with multiple processes on the same computer will only exacerbate that problem, and moreover will lead to inaccurate results (more items will timeout or memout when they might be otherwise parsable).
  • Try limiting the number of outputs (if you only care about coverage, limit it to -n 1). Besides memory savings, you’re also less likely to unexpectedly fill up your disk with huge profiles.
  • Use the --max-chart-megabytes and --max-unpack-megabytes options to something reasonable for your system.
  • See if there’s any spurious ambiguity that could be engineered away.

Olga’s and Francis’s suggestions are also good.


#5

Thank you, Olga, Francis, and Mike!

I am sorry for my late reply.

Olga: thanks! yes, my group leader (I am now doing internship in a company) said that I can get an access to a machine like that if needed. I can try.
Francis: you are right! A smaller set for testing is the best choice. The bottleneck is, I think, INDRA overparses some constructions.
Mike: that’s it! I only care about coverage and should have limited it to -n 1. Could you please tell me where to put this -n 1 in this command line gTest.py -G ~/grammar/INDRA C :mrs ?


#6

Could you please tell me where to put this -n 1 in this command line gTest.py -G ~/grammar/INDRA C :mrs ?

It goes in the options before the ‘C’ command, e.g.:

gTest.py -G ~/grammar/INDRA --ace-opts='-n1 --max-chart-megabytes=2048' C :mrs

But really I recommend processing and testing separately, especially for large jobs. Processing and testing at the same time is good for short regression tests.