Fftb localizaion: scores output with comma decimal separator after treebank update

On one of my computers, I was perfectly able to put the gold treebanks under tsdb/gold/ and I now see examples come up in LTDB (I cannot get the MRS displayed there but that is another matter). I do not recall any problems setting that up (although of course I may have forgotten).

Now on another machine, I am trying to set that up and, after putting the (same) treebanks under tsdb/gold/ and then running grm2db.py, I get:

rocessing /home/olga/delphin/SRG/grammar/srg/tsdb/gold/tbdb02
Traceback (most recent call last):
  File "grm2db.py", line 134, in <module>
    process_tsdb(conn, cfg['ver'], golddir, log)
  File "/home/olga/delphin/tools/ltdb/scripts/gold2db.py", line 186, in process_tsdb
    gold, sent, lexind, typind = process_results(root, log)
  File "/home/olga/delphin/tools/ltdb/scripts/gold2db.py", line 44, in process_results
    deriv = first_result.derivation()
  File "/home/olga/delphin/tools/ltdb/py3/lib/python3.7/site-packages/delphin/interface.py", line 87, in derivation
    drv = derivation.from_string(drv)
  File "/home/olga/delphin/tools/ltdb/py3/lib/python3.7/site-packages/delphin/derivation.py", line 389, in from_string
    udfnode = _from_string(s)
  File "/home/olga/delphin/tools/ltdb/py3/lib/python3.7/site-packages/delphin/derivation.py", line 474, in _from_string
    score=float(gd['score']) if gd['score'] else None,
ValueError: could not convert string to float: '0,000000'

What could be the reason for that? It has to be something very simple if it is working on another machine…

I thought maybe it was the python version issue, somehow, because I noticed a recommendation to run the scripts with python 3.8, so I created another virtual environment using 3.8.10. That did not seem to help.

Are these the same files? Or are you re-processing the profiles on different machines?

It looks like it could be a localization issue, with the decimal comma used instead of a decimal point (i.e., 0,000000 == 0.000000). Python’s float only parses decimal points. If that’s the case, you should be able to just replace the , with . in that column.

1 Like

I do not think I am reprocessing anything. That’s what’s mysterious… I mean, there must be some explanation of course for why there is this comma there on one machine and not the other. But the treebanks I simply downloaded from github in both cases (at least I think so).

Replacing the commas in the derivation string doesn’t sound too robust… I’d rather obtain the correct/expected format from ACE (?) in the first place…

Here’s the workaround I did for now but I really doubt that’s the right thing to do. I still don’t understand why I have the issue in the first place (on one machine out of three; same treebanks, I think).

Here’s where the data comes from, a tsdb profile, processed items:

    for response in ts.processed_items():
        sid=response['i-id']
        profile = ts.path.name 
        if response['results']:
            first_result=response.result(0)
            deriv = first_result.derivation()

Then we get to the string representation already in pydelphin’s interface.py:

        drv = self.get('derivation')
        drv = re.sub(r'(\d),(\d)', r'\1.\2', drv) # Added this to get rid of the comma
        try:
            from delphin import derivation
            if isinstance(drv, dict):
                drv = derivation.from_dict(drv)
            elif isinstance(drv, str):
                drv = derivation.from_string(drv)

This seems to work but like I said, I doubt it’s the solution? The solution should be obtaining the 0.0000 in the first place?..

I think I am introducing these “0,0000” things by running fftb update on the profiles… Really not sure why I am only facing the issue now though. I suppose I must have done most of the treebanking on my laptop? Which is an American laptop. I didn’t think so but based on the facts… Maybe I have updated all of them again before the release, and this way the version that was released has the expected “0.0000” format in the result. But at the moment I update using my European office machine, I get the commas. Not sure what the best solution is?

I don’t know if this is a FFTB issue or not but at least it doesn’t appear to be an LTDB issue, so, changing the category.

@sweaglesw What do you think? Does this sound like an FFTB issue, possibly, or is this something more general, perhaps the solution is in the OS setup?

To summarize: it appears like I get tsdb profiles with the result file containing scores in the European format 0,0000 when I run fftb update on one of my machines, which is my office machine that was setup by someone before me.

Can you set the locale when you run it?

Hm, as Francis suggests I suspect that may be FFTB outputting the zero in European format due to locale settings, though I am not sure. I would definitely try setting the number format locale environment variable to American before running and see if that helps.

1 Like

Apologies for my ignorance; how should I set the locale exactly? At what level is that done, the terminal, before I run fftb?..

https://unix.stackexchange.com/questions/7309/set-the-language-for-a-single-program-execution

Hmm.

echo $LANG
en_US.UTF-8

So, the locale is American already. This is in the terminal where I run fftb.

You should also check the individual settings:

$printenv | grep LC

Mine are unexpectedly varied:
bond@tanso:~/deb$ printenv | grep LC
LC_ADDRESS=en_SG.UTF-8
LC_NAME=en_SG.UTF-8
LC_MONETARY=en_SG.UTF-8
LC_PAPER=en_SG.UTF-8
LC_IDENTIFICATION=en_SG.UTF-8
LC_TELEPHONE=en_SG.UTF-8
LC_MEASUREMENT=en_SG.UTF-8
LC_TIME=en_SG.UTF-8
LC_NUMERIC=en_SG.UTF-8
bond@tanso:~/deb$ printenv | grep LANG
LANGUAGE=en_AU:en_US:en
LANG=en_AU.UTF-8

1 Like

Aha! command line - How to set LC_NUMERIC to English permanently? - Ask Ubuntu