Treebanked coverage (accuracy) with [incr tsdb()] vs with counting lines

How do I reliably obtain the accuracy for old, let’s call them “frozen” treebanked profiles which cannot be reparsed using ACE? So, I don’t want to change anything in them, I cannot reparse and update them, I want to know simply how many trees in them were maked as accepted.

I am currently trying different methods and getting different results. All methods yield the same number of items in the profile but differ when it comes to counting accepted items.

1: count the lines in the preference file and divide that number over the number of lines in the item file:

This would make you think that there are 155 accepted items in the profile.

  1. Now I open a profile in fftb and manually count the non-accepted items (the yellow and the red ones).

I am not giving you the full list but believe me, I counted 10 times, and there are 27 non-accepted items in that list, which would mean there are 181-27=154 accepted items. Not 155.

Spoiler: there is an item that appears as “accepted” in fftb which does not appear in the preference file; the sentence does not look Spanish (has a typo or is in a different language).

On the other hand, in the preference file, some item IDs appear twice:


So, this would account for the difference between what I see in the preference file and what I see if I open the profile in fftb, but what does it mean?

  1. Finally, I tried relying on [incr tsdb()] in the first place but I don’t fully understand how to do it there either. I thought what I should do is: select the TSQL condition t-active=1, like this:

And then I would see the number of accepted items which I can then myself divide by the number of total items to get the treebanked coverage (accuracy), e.g. 153/181:

Perhaps [incr tsdb()] knows how to not count the doubly-listed item twice and knows how to exclude the foreign item (because it is marked as such elsewhere in the database?), so perhaps I should trust this 153 number.


If I load another profile from the same corpus, with the same TSQL query, I see a number of “results” which is larger than the number of total items (519 “out of” 388):

So, I don’t trust what I am doing here. Maybe the profiles are broken somehow, but then I can’t trust what I see in [incr tsdb()] anyway.

Any comments on the correct procedure? What is the correct way to query a profile for it’s accuracy? (Again, assume we can’t reparse it with ACE to create an updated version.)

Has anyone perhaps been answering this via email and the replies didn’t make it?..

Actually, sorry, let me close this question and open a new related one.

The answer to this question is: thin the profile.