"dubious" sentences in the SQL query

What are “dubious” sentences, in [incr tsdb()]?

Screen Shot 2020-05-22 at 10.59.35 AM

(As an aside, I am happy I deleted most of my Russian test suites which had the word “final” in them.)

So, they are sentences whose iw field is set to 2 (0 being ungrammatical and 1 grammatical).

I am creating my test suites by importing lists of sentences. The ones marked by the * end up with iw=0 (ungrammatical).

Is there a way of automatically getting iw=2 by prepending a sentence with something like a question mark? I checked, and the question mark does not seem to work. Is the only way to to then manually edit the “items” file?

Hi Olga. It’s not clear if you want explanation about when i-wf is 2, so I’m sorry if you already know this. When i-wf is 2 that means the content might not be a sentence at all. For example, if you convert an academic article into a test suite you’ll have a mix of sentences along with mathematical formulae, data tables, citation lists, etc. For this data, you don’t want the grammar to even attempt to parse the non-sentential data because it will give you inaccurate statistics about coverage (e.g., for grammatical items, getting a parse is good and failing to parse is bad; for ungrammatical items, getting a parse is bad and failing to parse is good; what about non-language data? just skip it).

As for importing these items from text files with sentences, I don’t think the tools have something like the ungrammaticality marker * for the “dubious” items. PyDelphin, at least, let’s you make profiles with CSV-like files with a header indicating the columns. For instance:

$ cat sents
i-wf@i-input@i-author
1@The dog barked.@goodmami
0@Barked dog the.@goodmami
2@<img src="dog.jpg"/>@goodmami
$ delphin mkprof dogs --delimiter="@" --relations ../Relations --input sents
    9746 bytes	relations
     131 bytes	item
       0 bytes	analysis
    [...]
$ cat dogs/item
1@@@@1@@The dog barked.@@@@1@3@@goodmami@
2@@@@1@@Barked dog the.@@@@0@3@@goodmami@
3@@@@1@@<img src="dog.jpg"/>@@@@2@2@@goodmami@

You can give it values for i-id, i-length, and i-difficulty if you want, otherwise it will generate appropriate itself. This, along with correct TSDB-style escaping and outputting the right number of columns based on the relations file, is the advantage of using this method instead of writing out item records manually.

Ah, thank you, Mike! So, different from what I’d mark with ?? in the test suite.

Actually it would probably be appropriate for that, too. If an item is questionable—that is, you cannot decide if it is grammatical or not—it is perhaps best to ignore it for coverage statistics.

Yes. I would use it if I could just use ?? in the list of sentences… I rarely work with profiles directly at this stage, as I keep changing the test suite.

Ok, so you mean you would use the feature if you could prefix the sentences with ?? but not if you need to use the --delimiter option? It’s easy to make that change, but (a) I don’t know how widespread this convention is and (b) why not one “?” instead of two? In any case, I’m unlikely to add this to PyDelphin but I could be convinced.

Right, because I simply use text files for the initial version of the test suite. Often times I even create them with the web questionnaire (Test Sentences) first.

? or ?? does not matter I think; ? would be fine unless we expect languages where it is a character (in which case ?? I suppose is less likely to be a character than ?, maybe).

But what I am talking about here is tsdb; I don’t suppose a change could be forthcoming there :).

Regarding this point, I would not rely on ?? being probably less likely. If it’s a character it is conceivable that it could be doubled, even at the start of a sentence. Instead, you can just insert a space as the first character of the sentence, since the markers are only in the first column, and surrounding spaces are stripped. E.g., if you have sentences (1) “?ello world”, (2) “?ello ?ello ?ello”, and (3) “?ello / ?oodbye” where the first is grammatical, the second is ungrammatical, and the third is dubious, you could encode them like this (if we had ? as a “dubious” marker):

 ?ello world
*?ello ?ello ?ello
??ello / ?oodbye

This way you can even use * as the first character in a sentence.

1 Like