On one of my computers, I was perfectly able to put the gold treebanks under tsdb/gold/ and I now see examples come up in LTDB (I cannot get the MRS displayed there but that is another matter). I do not recall any problems setting that up (although of course I may have forgotten).
Now on another machine, I am trying to set that up and, after putting the (same) treebanks under tsdb/gold/ and then running grm2db.py, I get:
rocessing /home/olga/delphin/SRG/grammar/srg/tsdb/gold/tbdb02
Traceback (most recent call last):
File "grm2db.py", line 134, in <module>
process_tsdb(conn, cfg['ver'], golddir, log)
File "/home/olga/delphin/tools/ltdb/scripts/gold2db.py", line 186, in process_tsdb
gold, sent, lexind, typind = process_results(root, log)
File "/home/olga/delphin/tools/ltdb/scripts/gold2db.py", line 44, in process_results
deriv = first_result.derivation()
File "/home/olga/delphin/tools/ltdb/py3/lib/python3.7/site-packages/delphin/interface.py", line 87, in derivation
drv = derivation.from_string(drv)
File "/home/olga/delphin/tools/ltdb/py3/lib/python3.7/site-packages/delphin/derivation.py", line 389, in from_string
udfnode = _from_string(s)
File "/home/olga/delphin/tools/ltdb/py3/lib/python3.7/site-packages/delphin/derivation.py", line 474, in _from_string
score=float(gd['score']) if gd['score'] else None,
ValueError: could not convert string to float: '0,000000'
What could be the reason for that? It has to be something very simple if it is working on another machine…
I thought maybe it was the python version issue, somehow, because I noticed a recommendation to run the scripts with python 3.8, so I created another virtual environment using 3.8.10. That did not seem to help.
Are these the same files? Or are you re-processing the profiles on different machines?
It looks like it could be a localization issue, with the decimal comma used instead of a decimal point (i.e., 0,000000 == 0.000000). Python’s float only parses decimal points. If that’s the case, you should be able to just replace the , with . in that column.
I do not think I am reprocessing anything. That’s what’s mysterious… I mean, there must be some explanation of course for why there is this comma there on one machine and not the other. But the treebanks I simply downloaded from github in both cases (at least I think so).
Replacing the commas in the derivation string doesn’t sound too robust… I’d rather obtain the correct/expected format from ACE (?) in the first place…
Here’s the workaround I did for now but I really doubt that’s the right thing to do. I still don’t understand why I have the issue in the first place (on one machine out of three; same treebanks, I think).
Here’s where the data comes from, a tsdb profile, processed items:
for response in ts.processed_items():
sid=response['i-id']
profile = ts.path.name
if response['results']:
first_result=response.result(0)
deriv = first_result.derivation()
Then we get to the string representation already in pydelphin’s interface.py:
drv = self.get('derivation')
drv = re.sub(r'(\d),(\d)', r'\1.\2', drv) # Added this to get rid of the comma
try:
from delphin import derivation
if isinstance(drv, dict):
drv = derivation.from_dict(drv)
elif isinstance(drv, str):
drv = derivation.from_string(drv)
This seems to work but like I said, I doubt it’s the solution? The solution should be obtaining the 0.0000 in the first place?..
I think I am introducing these “0,0000” things by running fftb update on the profiles… Really not sure why I am only facing the issue now though. I suppose I must have done most of the treebanking on my laptop? Which is an American laptop. I didn’t think so but based on the facts… Maybe I have updated all of them again before the release, and this way the version that was released has the expected “0.0000” format in the result. But at the moment I update using my European office machine, I get the commas. Not sure what the best solution is?
@sweaglesw What do you think? Does this sound like an FFTB issue, possibly, or is this something more general, perhaps the solution is in the OS setup?
To summarize: it appears like I get tsdb profiles with the result file containing scores in the European format 0,0000 when I run fftb update on one of my machines, which is my office machine that was setup by someone before me.
Hm, as Francis suggests I suspect that may be FFTB outputting the zero in European format due to locale settings, though I am not sure. I would definitely try setting the number format locale environment variable to American before running and see if that helps.