POS-label output in PyDelphin


#1

I would like to see POS labels as output when I input a sentence using PyDelphin, so it is like a POS-Tagger.
In the PyDelphin docs, there’s “Tokens and Token Lattices” (https://pydelphin.readthedocs.io/en/latest/tutorials/walkthrough.html#tokens-and-token-lattices) which is similar to what I want but when I tried using INDRA, the output did not show POS information.
Does anyone (maybe Mike) knows how to get the POS labels in PyDelphin?


#2

These are available either in a [incr tsdb()] profile or in an ACE response with a suitably recent version of ACE and PyDelphin (it depends on the --tsdb-stdout option, which as been available since 0.9.24). E.g., to get them from the ERG using ACE in PyDelphin, do something like this:

>>> from delphin.interfaces import ace
>>> response = ace.parse('/home/goodmami/grammars/erg-trunk/erg.dat', 'Abrams chased the dogs to the street.')
NOTE: parsed 1 / 1 sentences, avg 5947k, time 0.03251s
>>> for token in response.tokens('initial').tokens:
...     print(token.form, token.pos)
... 
Abrams [('NNP', 1.0)]
chased [('NNP', 1.0)]
the [('DT', 1.0)]
dogs [('NNS', 1.0)]
to [('TO', 1.0)]
the [('DT', 1.0)]
street [('NN', 1.0)]
. [('.', 1.0)]

But this is dependent on the grammar having token mapping rules. Jacy, for example, does not give POS tags by default:

>>> response = ace.parse('/home/goodmami/grammars/jacy/jacy.dat', '太郎 が 道 まで 犬 を 追い かけ た .')
NOTE: parsed 1 / 1 sentences, avg 4037k, time 0.02698s
>>> for token in response.tokens('initial').tokens:
...     print(token.form, token.pos)
... 
太郎 []
が []
道 []
まで []
犬 []
を []
追い []
かけ []
た []

(with Jacy you’ll get some POS tags if you use YY mode and pre-tokenize with mecab, but these will be mecab-style tags, like “名詞-人名:太郎-たろう”, though I think Jacy might manipulate those a bit)

There’s a little bit of info on the wiki about chart mapping, but not about specifically about token mapping for POS tags: http://moin.delph-in.net/ChartMappingSetup


#3

Thank you very much for the reply and the information, Mike! :slight_smile:
I would like to develop INDRA with chart mapping/token mapping rules for POS tags in the future but for now, the labels which can be extracted from the parse trees (or from labels.tdl) are enough, I think. I tried:

>>> from delphin.interfaces import ace
>>> response = ace.parse('ind.dat', 'Budi dan saya mungkin akan mengejarnya di rumah jika dua anjing besar itu telah tidur')
NOTE: parsed 1 / 1 sentences, avg 21400k, time 0.07290s
>>> x = response.result(0).tree()
>>> x
['S', ['NP-T', ['PROPN', ['Budi']], ['NP-B', ['CCONJ', ['dan']], ['PRON', ['saya']]]], ['VP', ['VP', ['VP', ['ADV', ['mungkin']], ['VP', ['AUX', ['akan']], ['VP', ['VERB', ['VERB', ['mengejar']]], ['PRON', ['-nya']]]]], ['PP', ['ADP', ['di']], ['NP', ['NOUN', ['rumah']]]]], ['SCONJ', ['SCONJ', ['jika']], ['S', ['NP', ['NOUN', ['NUM', ['dua']], ['NOUN', ['NOUN', ['anjing']], ['ADJ', ['ADJ', ['besar']]]]], ['DET', ['itu']]], ['VP', ['AUX', ['telah']], ['VP', ['tidur']]]]]]]

I got the result similar to what I want. I need to process the result a bit so that I can get something like:
Budi ‘PROPN’
dan ‘CCONJ’
saya ‘PRON’
mungkin ‘ADV’
akan ‘AUX’
mengejar ‘VERB’
-nya ‘PRON’
di ‘ADP’
rumah ‘NOUN’
jika ‘SCONJ’
dua ‘NUM’
anjing ‘NOUN’
besar ‘ADJ’
itu ‘DET’
telah ‘AUX’


#4

Something like this maybe (note: Python 3 only)?

>>> def nodelabels(subtree):
...     labels = []
...     label, *children = subtree
...     for child in children:
...         if len(child) == 1:
...             labels.append((child[0], label))
...         else:
...             labels.extend(nodelabels(child))
...     return labels
... 
>>> nodelabels(x)
[('Budi', 'PROPN'), ('dan', 'CCONJ'), ('saya', 'PRON'), ('mungkin', 'ADV'), ('akan', 'AUX'), ('mengejar', 'VERB'), ('-nya', 'PRON'), ('di', 'ADP'), ('rumah', 'NOUN'), ('jika', 'SCONJ'), ('dua', 'NUM'), ('anjing', 'NOUN'), ('besar', 'ADJ'), ('itu', 'DET'), ('telah', 'AUX'), ('tidur', 'VP')]

#5

Thank you very much, Mike! :smiley: