Fields in TSDB++ Profile


#1

I’m creating a list of test cases for my phenomena, and hand-spun a TSDB++ item generator from my tables. I was having some issues until I realized that TSDB++ really wanted an integer value for the field “i-difficulty” so I just defaulted it to 1. I’m assuming this field means something like “How difficult is it to capture this sentence grammatically?” but I realized I don’t have a good understanding for all of the fields in the Relations file. I figured this would be a good place to ask. Below are the required item fields and my interpretation of them. My question is: What is the intended meaning of the fields I’ve left off or guessed at?

id (integer) - Unique ID
origin - speaker or location the data came from
register - sociolinguistic register??
format - orthography choice??
difficulty - difficulty to parse??
category - ??
input - non-segmented orthography
tokens - segmented orthography
gloss - glossed segments
translation - free translation
wf - well-formedness (0 ungrammatical, 1 grammatical)
length - number of words (why is this necessary?)
comment - free comment field
author - author of the test suite/test item
date - date it was added (also unclear why this is necessary)


#2

input' is what you want the parser to process --- at least in most settings. I'm not sure whattokens’ is. `length’ is there (I think) to support the analyses that [incr tsdb()] can give you which break down coverage & performance by sentence length.

I think category',difficulty’, `register’ etc have to do with the TSNLP approach to testsuite construction (but I’m not quite sure).

Not all of these fields are required. You might look to the make_item script from 567 for clues:
http://courses.washington.edu/ling567/make_item


#3

There are 3 data types in test suites: :integer, :string, and :date. The default (i.e., unset) value for the latter two is an empty string (e.g., in ...@@..., there’s an empty string or date field in between the two @s), but for integer fields the default is -1. Integers are generally used for identifiers, lengths, or counts, and 0 is a meaningful value for these, so -1 is taken as a non-value.

There are some integer fields, called “coded attributes” (defined in *tsdb-coded-attributes* in [incr tsdb()]'s globals.lisp) for which the value does not correspond to an identifier or a scalar, but to categories (like an Enum in some programming languages). For these, the default of -1 might not be acceptable. Those are, with acceptable default values:

  • i-difficulty (1)
  • i-wf (1)
  • polarity (-1)

The meaning of the i-wf is given on the wiki (http://moin.delph-in.net/ItsdbReference), but the other two are not defined. I think your choice of 1 is an acceptable value for i-difficulty (my guess is they range from 1 to 6, but that is just conjecture (see here)).

i-format may have to do with the kind of input data, e.g., if it’s XML, text, LaTeX, HTML, etc., but I’ve only ever seen none as the value of that field, so I’m not sure.

I think Emily is right about i-category, i-difficulty, and i-register.

i-input is the input string for the system, but this may be pre-segmented (e.g., as Jacy does), or not (as the ERG does). These strings will get tokenized, probably by a REPP, and the result (the YY tokens) is probably what could go in i-tokens, but I’m not sure.

i-length is the number of words (not post-REPP tokens). For example, “Kim didn’t sleep.” is 3 words, but 5 tokens in the ERG’s tokenization (“Kim” “did” “n’t” “sleep” “.”). I’m not sure what you’d do if your language doesn’t use spaces to separate words and you use that form in the i-input field.

Also note that i-comment is sometimes hijacked to embed additional fields in a Lisp S-Expression format (e.g., (:probability 1.0) (:score 0.8)).

Most fields (including i-date) are not required. I think all you need is i-id, i-input, i-wf, and maybe i-length)

You might find more information here: http://www.delph-in.net/itsdb/publications/index.html