Using pydelphin to create a tsdb profile from other profiles?

Is there a high-level API to take some tsdb profiles and select some items from it and put them into a new profile, keeping of course all the information including treebanking decisions, results, etc? I cannot figure out from the docs. I can create a new profile using the i-input values written into a file and then I could go over those new items, look at the original reference items (which I’d keep in some order) and update their ids and their item fields, however that won’t give me results, parses, decisions, and so on, from the original testsuite.

Is it possible to easily split and combine profiles?..

I have come up with this code but it is not yet correct (and of course I am not even sure at this point that I am doing something sensible in the first place):

def create_dev_test_split(path_to_profiles, output_dir, test_ratio, db_schema):
    # Split item ids into dev and test:
    dev_ids, test_ids = split_ids(path_to_profiles, test_ratio)
    # Create a temporary empty text file:
    with open(output_dir + '/temp.txt', 'w') as f:
        f.write('')
    commands.mkprof(output_dir + '/dev/', source=output_dir + '/temp.txt', schema=db_schema)
    commands.mkprof(output_dir + '/test/', source=output_dir + '/temp.txt', schema=db_schema)
    dev_profile = itsdb.TestSuite(output_dir + '/dev/')
    test_profile = itsdb.TestSuite(output_dir + '/test/')
    for ts_path in glob.glob(path_to_profiles + '/*'):
        ts = itsdb.TestSuite(ts_path)
        for i,item in enumerate(ts['item']):
            if item['i-id'] in dev_ids:
                copy_from_db(dev_profile, item, ts_path)
            elif item['i-id'] in test_ids:
                copy_from_db(test_profile, item, ts_path)
            else:
                print('Item in neither set.')
    dev_profile.commit()
    test_profile.commit()


def copy_from_db(profile, item, ts_path):
    profile['item'].append(item)
    q_parse = '* from parse where i-id = ' + str(item['i-id'])
    selection_parse = commands.select(q_parse, ts_path)
    # The non-empty files in the original database:
    related_tables = ['run', 'decision', 'edge', 'preference', 'result', 'tree']
    for sdp in selection_parse.data:
        r = itsdb.Row(selection_parse.fields, sdp)
        parse_id = r['parse-id']
        profile['parse'].append(r)
        for rt in related_tables:
            q = '* from ' + rt + ' where parse-id = ' + str(parse_id)
            selection = commands.select(q, ts_path)
            for sd in selection.data:
                rs = itsdb.Row(selection.fields, sd)
                profile[rt].append(rs)

However, even though I am using the same relations file as in the original test suite, in the end I cannot write the new test suite out because of a mismatch in fields:

Traceback (most recent call last):
  File "/home/olga/delphin/GAUSS/gauss-repo/venv/lib/python3.8/site-packages/delphin/tsdb.py", line 855, in write
    (join(record, fields) + '\n').encode(encoding))
  File "/home/olga/delphin/GAUSS/gauss-repo/venv/lib/python3.8/site-packages/delphin/tsdb.py", line 492, in join
    _mismatched_counts(values, fields)
  File "/home/olga/delphin/GAUSS/gauss-repo/venv/lib/python3.8/site-packages/delphin/tsdb.py", line 502, in _mismatched_counts
    raise TSDBError('number of columns ({}) != number of fields ({})'
delphin.tsdb.TSDBError: number of columns (23) != number of fields (21)

Upon inspection, the difference is the following:

But at this point I rather suspect I am doing something that I am not supposed to be doing, anyway… It’s just that I have treebanked profiles which I would like to rearrange but I would really like to avoid having to retreebank them.

Hi Olga, I appreciate that you do look through the docs to find answers. The fact that you could not find an answer means there’s still more room for improvement in the docs :slight_smile:

For the delphin.commands.mkprof function (or the delphin mkprof command), you’ll want to use the full=True or --full options to make sure all relevant data is copied, and you can use TSQL queries in the where=... or --where option to filter particular items. As for joining multiple profiles together, there isn’t functionality to do that specifically. You can just concatenate the files, but if they were processed separately you might have conflicting item or edge ids, and I don’t have a good solution for you besides to manipulate them yourself to ensure uniqueness.

Or if I misunderstood what you’re trying to do, maybe you can elaborate on the task?

Thanks, @goodmami !

The task is:

Suppose I have two profiles, A and B, all items in which are unique. So, just two different datasets. They had been processed with ACE and then treebanked, so, we have edges, parses, results, decisions in the profile. The item IDs are unique for sure. As for edges, that I do not know; probably not.

What I want is to create new profiles, C and D, each of which will be some selection from the original two. Suppose I want to take every odd item from A and B and put it into C, and then every even item from A and B I want to put into D. Such that C and D are then correct profiles with all the treebanking decisions intact.

Is that possible or am I out of luck and once a profile has been processed and treebanked, it is not supposed to be manipulated so as to create a different profile from it? (In which case I suppose the solution is to collect ID numbers once and then just retrieve them from A and B every time, instead of having C and D?)

Sounds like maybe I can achieve what I want by taking A and B and then creating from it A’, B’, A’‘, and B’‘, so, four profiles, and then I can use A’ and B’ instead of C and A’’ and B’’ instead of D?

But I am not sure. I can create a copy of a profile (with all the data) using --full but can I then delete items?.. Or do you mean, where can be used in the same step as mkprof?

Ah yes, indeed there is a where option in mkprof. I will try to build a correct query then :sweat_smile:

I think it works like this:

def create_dev_test_split(path_to_profiles, output_dir, test_ratio, db_schema):
    # Split item ids into dev and test:
    dev_ids, test_ids = split_ids(path_to_profiles, test_ratio)
    # From a list of ids, create a query of the form: 'i-id = id1 or i-id = id2 or ...':
    dev_query = 'i-id = ' + ' or i-id = '.join([str(i) for i in dev_ids])
    test_query = 'i-id = ' + ' or i-id = '.join([str(i) for i in test_ids])
    for ts_path in glob.glob(path_to_profiles + '/*'):
        dev_profile_name = ts_path.split('/')[-1] + '_dev'
        commands.mkprof(output_dir + '/dev/' + dev_profile_name, source=ts_path, schema=db_schema, where=dev_query, full=True)
        test_profile_name = ts_path.split('/')[-1] + '_test'
        commands.mkprof(output_dir + '/test/' + test_profile_name, source=ts_path, schema=db_schema, where=test_query, full=True)

Thanks again, @goodmami !

By the way, I was able to locate all the relevant portions of the documentation, but the problem was that I was also finding much documentation which looked like it may be relevant but was not in this case (such as all the low-level documentation). But that’s a typical problem, I think. With tsdb in particular, my problem often is that I don’t know relational databases all that well and so I often do not think/ask questions in the right terms.

On this point, note that mkprof has the following modes:

  • full=True or --full: means the relevant data from all tables (AKA relations AKA files), not the specific items copied
  • skeleton=True or --skeleton: only copy files relevant for a skeleton (i.e., the TSDB core files)
  • default (neither of the above are true): only copy the TSDB core files, but instantiate empty files for the rest (this mimics the behavior of Woodley’s mkprof utility)

For all of the above, you can use where= or --where to select which items to include. Selections are internally consistent, so if you filter on i-id, you will also be filtering rows in result where the parse-id is linked to the i-id in the parse table, etc. Edges are also linked by parse-id, but I think the e-id starts over each time you process, so if you process two profiles, you probably won’t be able to join them without reassigning e-id numbers to one of the profiles (e.g., starting at the max e-id from the first profile + 1, or something).

If you can just work with the split profiles that are not rejoined, that would avoid some headaches.

Part of the issue may be how TSDB (and the delphin.tsdb module) uses traditional relational DB terminology like relation, record, etc. and for the delphin.itsdb module I use more SQL-like terms like table, row, etc., and casually we have something else entirely like *-file (item file, etc.), line, etc. I tried to write these up at the top of the guide here: Working with [incr tsdb()] Test Suites — PyDelphin 1.8.1 documentation. I agree, it can be confusing.

Let me know if you have any further questions.

1 Like

I think that’s the best way, probably. Thanks again!