POS-label output in PyDelphin

I would like to see POS labels as output when I input a sentence using PyDelphin, so it is like a POS-Tagger.
In the PyDelphin docs, there’s “Tokens and Token Lattices” (https://pydelphin.readthedocs.io/en/latest/tutorials/walkthrough.html#tokens-and-token-lattices) which is similar to what I want but when I tried using INDRA, the output did not show POS information.
Does anyone (maybe Mike) knows how to get the POS labels in PyDelphin?

These are available either in a [incr tsdb()] profile or in an ACE response with a suitably recent version of ACE and PyDelphin (it depends on the --tsdb-stdout option, which as been available since 0.9.24). E.g., to get them from the ERG using ACE in PyDelphin, do something like this:

>>> from delphin.interfaces import ace
>>> response = ace.parse('/home/goodmami/grammars/erg-trunk/erg.dat', 'Abrams chased the dogs to the street.')
NOTE: parsed 1 / 1 sentences, avg 5947k, time 0.03251s
>>> for token in response.tokens('initial').tokens:
...     print(token.form, token.pos)
... 
Abrams [('NNP', 1.0)]
chased [('NNP', 1.0)]
the [('DT', 1.0)]
dogs [('NNS', 1.0)]
to [('TO', 1.0)]
the [('DT', 1.0)]
street [('NN', 1.0)]
. [('.', 1.0)]

But this is dependent on the grammar having token mapping rules. Jacy, for example, does not give POS tags by default:

>>> response = ace.parse('/home/goodmami/grammars/jacy/jacy.dat', '太郎 が 道 まで 犬 を 追い かけ た .')
NOTE: parsed 1 / 1 sentences, avg 4037k, time 0.02698s
>>> for token in response.tokens('initial').tokens:
...     print(token.form, token.pos)
... 
太郎 []
が []
道 []
まで []
犬 []
を []
追い []
かけ []
た []

(with Jacy you’ll get some POS tags if you use YY mode and pre-tokenize with mecab, but these will be mecab-style tags, like “名詞-人名:太郎-たろう”, though I think Jacy might manipulate those a bit)

There’s a little bit of info on the wiki about chart mapping, but not about specifically about token mapping for POS tags: http://moin.delph-in.net/ChartMappingSetup

Thank you very much for the reply and the information, Mike! :slight_smile:
I would like to develop INDRA with chart mapping/token mapping rules for POS tags in the future but for now, the labels which can be extracted from the parse trees (or from labels.tdl) are enough, I think. I tried:

>>> from delphin.interfaces import ace
>>> response = ace.parse('ind.dat', 'Budi dan saya mungkin akan mengejarnya di rumah jika dua anjing besar itu telah tidur')
NOTE: parsed 1 / 1 sentences, avg 21400k, time 0.07290s
>>> x = response.result(0).tree()
>>> x
['S', ['NP-T', ['PROPN', ['Budi']], ['NP-B', ['CCONJ', ['dan']], ['PRON', ['saya']]]], ['VP', ['VP', ['VP', ['ADV', ['mungkin']], ['VP', ['AUX', ['akan']], ['VP', ['VERB', ['VERB', ['mengejar']]], ['PRON', ['-nya']]]]], ['PP', ['ADP', ['di']], ['NP', ['NOUN', ['rumah']]]]], ['SCONJ', ['SCONJ', ['jika']], ['S', ['NP', ['NOUN', ['NUM', ['dua']], ['NOUN', ['NOUN', ['anjing']], ['ADJ', ['ADJ', ['besar']]]]], ['DET', ['itu']]], ['VP', ['AUX', ['telah']], ['VP', ['tidur']]]]]]]

I got the result similar to what I want. I need to process the result a bit so that I can get something like:
Budi ‘PROPN’
dan ‘CCONJ’
saya ‘PRON’
mungkin ‘ADV’
akan ‘AUX’
mengejar ‘VERB’
-nya ‘PRON’
di ‘ADP’
rumah ‘NOUN’
jika ‘SCONJ’
dua ‘NUM’
anjing ‘NOUN’
besar ‘ADJ’
itu ‘DET’
telah ‘AUX’

Something like this maybe (note: Python 3 only)?

>>> def nodelabels(subtree):
...     labels = []
...     label, *children = subtree
...     for child in children:
...         if len(child) == 1:
...             labels.append((child[0], label))
...         else:
...             labels.extend(nodelabels(child))
...     return labels
... 
>>> nodelabels(x)
[('Budi', 'PROPN'), ('dan', 'CCONJ'), ('saya', 'PRON'), ('mungkin', 'ADV'), ('akan', 'AUX'), ('mengejar', 'VERB'), ('-nya', 'PRON'), ('di', 'ADP'), ('rumah', 'NOUN'), ('jika', 'SCONJ'), ('dua', 'NUM'), ('anjing', 'NOUN'), ('besar', 'ADJ'), ('itu', 'DET'), ('telah', 'AUX'), ('tidur', 'VP')]

Thank you very much, Mike! :smiley:

Hi Mike,

Given an input (from FFTB treebanking result):
12@0@-1@-1@-1@-1@-1@-1@-1@-1@(0 subj-head 0.000000 0 2 (0 Adi 0.000000 0 1 ("Adi" 5 "token [ +FORM \\"Adi\\" +FROM \\"0\\" +TO \\"3\\" +ID diff-list [ LIST cons [ FIRST \\"0\\" REST list ] LAST list ] +POS pos [ +TAGS null +PRBS null ] +CLASS non_ne [ +INITIAL luk ] +TRAIT token_trait +PRED predsort +CARG \\"Adi\\" ]")) (0 menggonggong 0.000000 1 2 ("menggonggong" 6 "token [ +FORM \\"menggonggong\\" +FROM \\"4\\" +TO \\"16\\" +ID diff-list [ LIST cons [ FIRST \\"1\\" REST list ] LAST list ] +POS pos [ +TAGS null +PRBS null ] +CLASS non_ne [ +INITIAL luk ] +TRAIT token_trait +PRED predsort +CARG \\"menggonggong\\" ]")))@@@[ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: no-tensed E.ASPECT: non-perf-and-prog E.MOOD: mood ] RELS: < [ named_rel<0:3> LBL: h4 CARG: "Adi" ARG0: x3 [ x SPECI: bool COG-ST: cog-st PNG.PERNUM: pernum ] ] [ "proper_q_rel"<-1:-1> LBL: h6 ARG0: x3 RSTR: h7 BODY: h8 ] [ "_menggonggong_v_rel"<4:16> LBL: h1 ARG0: e2 ARG1: x3 ] > HCONS: < h0 qeq h1 h7 qeq h4 > ICONS: < > ]@

Can I use pydelphin and the grammar ind.dat to get an output:
[('Adi', 'PROPN'), ('menggonggong', 'VERB')]
?

Hi @goodmami, I wonder if I could get from the token objects the lemmas too… is that the case?

What you have is the [incr tsdb()] row containing the derivation tree and you want the node labels from the phrase structure tree which should be in the column just before the MRS, but the field is blank.

PyDelphin does not have a way to reconstruct the feature structure from a derivation, and I don’t think ACE can do it from the command-line interface so there’s currently no enhancement to PyDelphin’s ACE interface which would allow this. But in principle you should be able to read in a derivation, reconstruct the feature structure, and output the phrase structure tree, but maybe with the LKB instead of PyDelphin + ACE.

It would be easier if FFTB could output the labelled tree when it produces the derivation. Maybe @sweaglesw knows if this is possible?

1 Like

If you have the parse table in your profile, you can extract the tokens there:


```>>> from delphin import itsdb
>>> ts = itsdb.TestSuite('../../grammars/erg-trunk/tsdb/gold/mrs')
>>> ts['parse'][0]['p-input']   # "initial" tokens
'(1, 0, 1, <0:2>, 1, "It", 0, "null", "PRP" 1.0) (2, 1, 2, <3:9>, 1, "rained", 0, "null", "VBD" 1.0) (3, 2, 3, <9:10>, 1, ".", 0, "null", "." 1.0)'
>>> ts['parse'][0]['p-tokens']  # "internal" tokens
'(42, 1, 2, <3:10>, 1, "rained.", 0, "null") (44, 1, 2, <3:10>, 1, "rained.", 0, "null") (45, 1, 2, <3:10>, 1, "rained.", 0, "null") (46, 0, 1, <0:2>, 1, "it", 0, "null") (47, 0, 1, <0:2>, 1, "it", 0, "null")'

These are YY token lattices, so if you want more programmatic access, PyDelphin can help:

>>> from delphin import tokens
>>> lattice = tokens.YyTokenLattice.from_string(ts['parse'][0]['p-input'])
>>> lattice.tokens[0]
YyToken(id=1, start=0, end=1, lnk=<Lnk object <0:2> at 140462602992280>, paths=[1], form='It', surface=None, ipos=0, lrules=['null'], pos=[('PRP', 1.0)])
>>> lattice.tokens[0].form
'It'
>>> lattice.tokens[0].pos
[('PRP', 1.0)]

The form isn’t exactly a lemma, but it’s about the closest you’ll get from those.

@arademaker is that what you meant?

1 Like

Hi Mike,

Thank you for the reply! Yes, it would be easier if FFTB could output the labelled tree. I can see the labels in the FFTB annotation page but I cannot find them in the output.

Woodley @sweaglesw , do you have any idea how I can get the labels from the output?

well, I didn’t really get what is the difference between the initial tokens and the internal tokens in your last example of code. Actually, I also don’t know much yet about the token mapping rules. So these are all new information that I will have to read more about.

Moreover, what is the difference between getting POS and FORM from the token object and from the lattice.tokens[0] objects?

I was really talking about the real lemmas of the words. Once the grammar recognize a form it makes the morphology right? and some point the grammar is able to understand that rained is the simple past of rain and associate it to the right predicate in the lexicon entry. what structure can contain the lemma besides the predicates in the MRS?

BTW, I am missing a lot a class diagram and a more global view of the classes and methods available in pydelphin library. Hope that you don’t give up from writing that article about pyDelphin…

Hi David,

I have updated the ‘recons’ tool in the acetools under SVN with a new -L options which recomputes the labeled trees for a profile and stores it back. I didn’t give it a version or make a binary release yet. Does that sound like it solves your problem? Also, are you in a position to compile a copy from SVN to try? The repository in question is:

svn co http://sweaglesw.org/svn/libtsdb/trunk libtsdb

cd libtsdb

make recons

./recons -g …/indra.dat …/myprofile -L

I would make a backup of “myprofile” first, in case something doesn’t work as planned :slight_smile: If it is not convenient for you to build your own I can make you a binary without much trouble.

-Woodley

The lemma is not going to be available from the token objects. Those objects represent the surface form only, together with annotations like what character position the token was found at, capitalization information, and POS tagger output. I believe what you are looking for is the data present on the lexical entry. The lexical entry that is triggered for a token “rained” will be named something like rain_v1 and will have [ ORTHO < “rain” >, SYNSEM.LKEYS.KEYREL.PRED “_rain_v_1_rel” ] in its feature structure. The data from the feature structure is hard to get at from pydelphin though. You could grab the lexeme name “rain_v1” pretty easily from the derivation tree, although from a theoretical standpoint it is not meaningful (i.e. not guaranteed to bear any particular relationship to the lemma, although in practice it ?always? does). If you want to be cleaner, you can use the lexical entry name as a key to look up the ORTH value in the lexicon (erg/lexdb.rev is a tab separated machine readable version that would make this very straightforward).

Dear Woodley,

Thank you very much for your help!
I have tried

svn co http://sweaglesw.org/svn/libtsdb/trunk libtsdb

cd libtsdb

make recons

But I got this error message:

~/tools/libtsdb$ make recons
gcc -g -O2 recons.c -lace -ltsdb -o recons
recons.c:8:18: fatal error: tsdb.h: No such file or directory
compilation terminated.
Makefile:38: recipe for target 'recons' failed
make: *** [recons] Error 1

I checked the file tsdb.h is there, in the same libtsdb directory.
Could you please help me with this?

Hi David,

Sorry – that was really only an outline of the required steps, and it seems I left a bit out. To be able to do that compilation, you would need to have libtsdb and libace both compiled and installed first, and unless you have root on the system, you would have to find a way to tweak the installation settings and also the compiler flags to make it work in a local installation – none of which is that much fun.

I’ve gone ahead and compiled a new binary and bundled it into the acetools release, so if you download the following you should get a working “recons” tool:

http://sweaglesw.org/linguistics/acetools/acetools-x86-0.9.30.1.tar.gz

Let me know if you have any luck with that!
Woodley

Dear Woodley,

Thank you very much! It worked.
However, out of 715 items, 27 items couldn’t be reconstructed, either because “reunification failed” or “no matching root symbol”, e.g.

|item 3 result 0|reunification failed|
|item 95 result 0|no matching root symbol|

but when I checked the annotation page, I could see the POS labels there in the tree.

Hum, interesting. Even the example “rained” I didn’t find in any lexdb.* file. I am also curious about the size of the ERG lexicon. The lexdb.rev file contains 55013 lines only. This makes ERG lexicon smaller than Wordnet Lexicon, am I right? Or maybe this lexdb.rev file is not up-to-date?

$ rg rained lexdb.*
lexdb.rev_key
29369:strained_a1	danf	2003-10-31 16:00:00-08	strained
29370:strained_a1	danf	2004-09-27 13:14:31.872398-07	strained
32655:unrestrained_a1	danf	2004-12-18 08:02:03.316058-08	unrestrained
32656:unrestrainedly_adv	danf	2004-11-27 08:40:25.59655-08	unrestrainedly

lexdb.rev
17614:flame-grained_a1	danf	2006-06-10 19:35:24.826647+00	f	aj_-_i_le	flame- grained	"_flame-grained_a_1_rel"	\N\N	\N	\N	\N	\N	con	\N	\N	\N	\N	\N	\N	\N	\N	\N	EN	US	\N\N	\N	\N	1	LinGO
17615:flame-grained_a1	danf	2006-08-22 16:53:16.691019+00	t	aj_-_i_le	flame- grained	"_flame-grained_a_1_rel"	\N\N	\N	\N	\N	\N	con	\N	\N	\N	\N	\N	\N	\N	\N	\N	EN	US	\N\N	\N	\N	1	LinGO
46586:strained_a1	danf	2006-06-10 19:35:24.826647+00	f	aj_-_i_le	strained	"_strained_a_1_rel"	\N	\N\N	\N	\N	\N	con	\N	\N	\N	\N	\N	\N	\N	\N	\N	EN	US	\N	\N\N	\N	1	LinGO
51417:unrestrained_a1	danf	2006-06-10 19:35:24.826647+00	f	aj_-_i_le	unrestrained	"_unrestrained_a_1_rel"	\N	\N\N	\N	\N	\N	voc	\N	\N	\N	\N	\N	\N	\N	\N	\N	EN	US	\N	\N\N	\N	1	LinGO
51418:unrestrainedly_adv	danf	2006-06-10 19:35:24.826647+00	f	av_-_i-vp_le	unrestrainedly	"_unrestrained_a_1_rel"	\N\N	\N	\N	\N	\N	voc	\N	\N	\N	\N	\N	\N	\N	\N	\N	EN	US	\N\N	\N	\N	1	LinGO

But the number of lines with ORTH in the lexicon.tdl file is even smaller

$ grep ORTH lexicon.tdl  | wc -l
   38259

You will only find base forms in the lexicon, not inflected forms like “rained.”

A few tens of thousands sounds like the right number for the size of the hand-built lexicon, yes. Completely regular open class words are handled by the part-of-speech tagger and “generic” lexical entries instead of the curated lexicon. Incidentally, the lemmatization approach I outlined won’t work for those…

rained is transformed to rain through a lexical rule, and there is no particpal adjective, so there is no need to list it in the lexion.
Irregular morphological forms are in irregs.tab. There is a way of generating all the inflected forms in the lkb, but I can’t remember how, …

You mean in ERG? But I can see cases transformed to verbs, like the one below

SENT: I have been agitated long enough by your nonsense.
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: pres MOOD: indicative PROG: - PERF: + ]
RELS: < [ pron<0:1> LBL: h4 ARG0: x3  ]
 [ pronoun_q<0:1> LBL: h5 ARG0: x3 RSTR: h6 BODY: h7 ]
 [ _agitate_v_1<12:20> LBL: h1 ARG0: e2 ARG1: x8  ARG2: x3 ]
 [ _long_a_2<21:25> LBL: h1 ARG0: i9 ARG1: e2 ]
 [ _enough_x_comp<26:32> LBL: h1 ARG0: e10  ARG1: i9 ]
 [ def_explicit_q<36:40> LBL: h11 ARG0: x8 RSTR: h12 BODY: h13 ]
 [ poss<36:40> LBL: h14 ARG0: e15  ARG1: x8 ARG2: x16  ]
 [ pronoun_q<36:40> LBL: h17 ARG0: x16 RSTR: h18 BODY: h19 ]
 [ pron<36:40> LBL: h20 ARG0: x16 ]
 [ _nonsense_n_1<41:50> LBL: h14 ARG0: x8 ] >
HCONS: < h0 qeq h1 h6 qeq h4 h12 qeq h14 h18 qeq h20 >
ICONS: < e2 topic x3 > ]

but also cases where they are analyzed as adjectives

SENT: The interesting story made a compelling point.
[ LTOP: h0
INDEX: e2 [ e SF: prop TENSE: past MOOD: indicative PROG: - PERF: - ]
RELS: < [ _the_q<0:3> LBL: h4 ARG0: x3 RSTR: h5 BODY: h6 ]
 [ _interesting_a_for<4:15> LBL: h7 ARG0: e8 ARG1: x3 ARG2: i9 ]
 [ _story_n_of-about<16:21> LBL: h7 ARG0: x3 ARG1: i10 ]
 [ _make_v_1<22:26> LBL: h1 ARG0: e2 ARG1: x3 ARG2: x11  ]
 [ _a_q<27:28> LBL: h12 ARG0: x11 RSTR: h13 BODY: h14 ]
 [ compound<29:46> LBL: h15 ARG0: e16  ARG1: x11 ARG2: x17  ]
 [ udef_q<29:39> LBL: h18 ARG0: x17 RSTR: h19 BODY: h20 ]
 [ _compel_v_1<29:39> LBL: h21 ARG0: e22  ARG1: i23 ARG2: i24 ]
 [ nominalization<29:39> LBL: h25 ARG0: x17 ARG1: h21 ]
 [ _point_n_of<40:46> LBL: h15 ARG0: x11 ] >
HCONS: < h0 qeq h1 h5 qeq h7 h13 qeq h15 h19 qeq h25 >
ICONS: < > ]

I copied the examples from https://grammar.yourdictionary.com/parts-of-speech/adjectives/what-is-a-participial-adjective.html. I am not sure about the reputation of this website. But this same page says that not all participal adjectives derived from verbs.