Spanish clitics and Freeling tokenizer/morphological analyzer

I am trying to understand how clitics are supposed to work in the SRG.

The SRG currently relies on Freeling for tokenization and morphological analysis and POS tagging. Then I put that Freeling output into YY form and feed it to ACE.

In the original grammar which relies on a specially created binary for freeling interface, the clitics work:

Screen Shot 2023-04-14 at 11.36.44 AM

As you can see, a rule called +PP3CNA00 is involved. I can find that rule in the grammar but I don’t understand how it is supposed to work. Here’s what it looks like:

+PP3CNA00 := 
%suffix (vmlo2 vmlo2) 
pp3cna_ilr.

What I don’t understand is what kind of YY input I should have in order for this suffix rule to fire.

Right now, with the way I call Freeling, I get the following:

(1, 0, 1, <0:6>, 1, "costar" "cuesta", 0, "vmip3s0", "vmip3s0" 0.9419642857142858) (2, 1, 2, <7:12>, 1, "creer" "creer", 0, "vmn0000", "vmn0000" 1.0) (3, 2, 3, <12:14>, 1, "lo" "lo", 0, "pp3msa0", "pp3msa0" 1.0) (4, 3, 4, <14:15>, 1, "." ".", 0, "fp", "fp" 1.0)

This is probably not the right thing however I don’t understand what the right thing should be.

One relevant thing I found is in the Freeling data files:

## -------------- ENCLITIC PRONOUNS ----------------------
##	clitics	only	admited	after	gerund,	0		or	imperative
## If you want to admit them after any form (e.g. ancient forms "Díjoselo", "Viósela")
## you should replace the pattern "^V.[GNM]" below with just "^V"
lo	*	^V.[GNM]	*	1	1	0	L	1	$$+lo:$$+PP
los	*	^V.[GNM]	*	1	1	0	L	1	$$+los:$$+PP
la	*	^V.[GNM]	*	1	1	0	L	1	$$+la:$$+PP
...

This is in the afixos.dat file which I do believe I am passing to freeling, because I have a morphological analyzer setting called “affixes” enabled.

@arademaker , you know Freeling pretty well, do you have ideas about what I am doing wrong in terms of using freeling?

All, regardless of freeling, what kind of result is the grammar expecting, based on those regular expressions above? How would a suffix vmlo2 materialize? (Although if I replace lo with vmlo2 in the YY string and add this +PP3CNA00 tag instead of the one it’s getting, I don’t have any success either.)

Here’s how I am using freeling currently:

        self.la=pyfreeling.lang_ident(self.DATA+"common/lang_ident/ident-few.dat")
        # create options set for maco analyzer. Default values are Ok, except for data files.
        self.LANG="es"
        self.op= pyfreeling.maco_options(self.LANG)
        self.op.set_data_files( "",
                           self.DATA + "common/punct.dat",
                           self.DATA + self.LANG + "/dicc.src",
                           self.DATA + self.LANG + "/afixos.dat",
                           "",
                           self.DATA + self.LANG + "/locucions.dat",
                           self.DATA + self.LANG + "/np.dat",
                           self.DATA + self.LANG + "/quantities.dat",
                           self.DATA + self.LANG + "/probabilitats.dat")

        # create analyzers
        self.tk=pyfreeling.tokenizer(self.DATA+self.LANG+"/tokenizer.dat")
        self.sp=pyfreeling.splitter(self.DATA+self.LANG+"/splitter.dat")
        self.mf=pyfreeling.maco(self.op)

        # activate mmorpho odules to be used in next call
        self.mf.set_active_options(umap=False, num=True, pun=True, dat=False,  # select which among created
                              dic=True, aff=True, comp=False, rtk=True,  # submodules are to be used.
                              mw=False, ner=True, qt=False, prb=True )  # default: all created submodules are used

        self.tg=pyfreeling.hmm_tagger(self.DATA+self.LANG+"/tagger.dat",True,1)

...
           # now suppose lin is a sentence:
            s = self.tk.tokenize(lin)
            s = self.sp.split(sid,s,False)
            s = self.mf.analyze(s)
            s = self.tg.analyze(s)

Affixes and dictionary search are enabled, the affixes file is given… I don’t know what else.

I am studying freeling docs but they are vast and not really targeting python API users, and so far I’m unable to understand how this should be done.

I have a strong suspicion that in such cases, the input to the grammar should include something like:

"creer+lo" "creerlo", 0, "vmn0000+PP3CNA00"

but that doesn’t work, not surprisingly, because such a rule (“vmn0000+PP3CNA00”) is not in the grammar. I am missing some important piece here, there was either some postprocessing, or something else I don’t see. In particular about how many tokens should be here and if there is only one token, then how should the two tags work together.

I have the old grammar working but I don’t understand how to debug this there either.

Potential clues from the source code for the original freeling interface (I keep forgetting there is source code for that, because it lives in a separate folder in the logon tree):

    wstring tag = a->get_tag();
    wstring alemma = a->get_lemma();
    wstring clitics;
    toXML(alemma);
    
    if (a->is_retokenizable()) {  // clitics
      list<word> rtk=a->get_retokenizable();

      // verb tag
      list<word>::iterator r=rtk.begin();
      tag = r->get_tag();
      r++;
      // clitics tags
      while (r!= rtk.end()) {
	clitics += L"<str>+"+r->get_tag()+L"</str>";
	r++;
      }
    }

...
...
...

    result += L"      <edge source=\""+util::int2wstring(pos)+L"\" target=\""+util::int2wstring(posf)+L"\">\n";
    result += L"        <fs type=\"token\">\n";
    result += L"           <f name=\"+FORM\"><str>"+form+L"</str></f>\n";
    result += L"           <f name=\"+FROM\"><str>"+util::int2wstring(start)+L"</str></f>\n";
    result += L"           <f name=\"+TO\"><str>"+util::int2wstring(finish)+L"</str></f>\n";    
    result += L"           <f name=\"+STEM\"><str>"+stem+L"</str></f>\n";    
    result += L"           <f name=\"+TAG\">"+rid+L"</f>\n";
    if (not clitics.empty()) result += L"           <f name=\"+CLIT\" org=\"list\">"+clitics+L"</f>\n";
    result += L"        </fs>\n";
    result += L"      </edge>\n";   
  }

We think that in any case, in order to get an analysis where creer and lo are one word, they must be one token at the level of YY-input (even if Freeling tokenizes them into two).

The wiki says:

Each token in the above example has the following format:

(id, start, end, [link,] path+, form [surface], ipos, lrule+[, {pos p}+])

It would appear that the correct YY-input might then be (scroll all the way to the right):

(1, 0, 1, <0:6>, 1, "costar" "cuesta", 0, "vmip3s0", "vmip3s0" 0.946188340807175) (2, 1, 2, <7:14>, 1, "creer" "creerlo", 0, "vmn0000" "+pp3cna00", "+pp3cna00" 1.0) (3, 2, 3, <14:15>, 1, "." ".", 0, "fp", "fp" 1.0)

Indeed I do get a parse, yay! Although I am not seeing the surface forms where I want them but I hope this is a minor issue:

Full ACE output:

(1, 0, 1, <0:6>, 1, "costar" "cuesta", 0, "vmip3s0", "vmip3s0" 0.946188340807175) (2, 1, 2, <7:14>, 1, "creer" "creerlo", 0, "vmn0000" "+pp3cna00", "+pp3cna00" 1.0) (3, 2, 3, <14:15>, 1, "." ".", 0, "fp", "fp" 1.0)
SENT: (yy mode)
[ LTOP: h0 INDEX: event2 [ event SORT: semsort E.TENSE: pres E.ASPECT: aspect E.MOOD: ind SF: prop ] RELS: < [ "_costar_v_rel"<-1:-1> LBL: handle1 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: event2 ARG1: handle7 [ handle SORT: semsort ] ARG2: ref-ind8 [ ref-ind SORT: ani_bpart_hum_soc PNG.PN: pernum PNG.GEN: gender PRONTYPE: prontype DEF: bool DIVISIBLE: bool ] ]  [ "_creer_v_rel"<-1:-1> LBL: handle3 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: event12 [ event SORT: semsort E.TENSE: nontense E.ASPECT: no-aspect E.MOOD: mood SF: prop ] ARG1: ref-ind8 ARG2: ref-ind13 [ ref-ind SORT: semsort PNG.PN: 3sg PNG.GEN: masc PRONTYPE: prontype DEF: bool DIVISIBLE: bool ] ] > HCONS: < h0 qeq handle1 handle7 qeq handle3 > ] ;  (1715 hd-sb_c -1.536990 0 3 (1710 hd_optcmp-v_c -0.119250 0 1 (1709 vmip3s0 0.000000 0 1 (1708 v_ppa*_sbj-vp-inf-oc_dlr 0.000000 0 1 (7 costar_v-pp_a-sbj_cp_p_sub 0.000000 0 1 ("costar" 1 "token [ +FORM \"costar\" +FROM \"0\" +TO \"6\" +ID diff-list [ LIST cons [ FIRST \"1\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"vmip3s0\" REST null ] +PRBS cons [ FIRST \"0.946188\" REST null ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]"))))) (1714 hd-pt_c -1.417740 1 3 (1712 +PP3CNA00 -1.417740 1 2 (1711 vmn0000 -1.417740 1 2 (1693 v_acc_dlr 0.000000 1 2 (9 creer_v-np_rcp 0.000000 1 2 ("creer" 2 "token [ +FORM \"creer\" +FROM \"7\" +TO \"14\" +ID diff-list [ LIST cons [ FIRST \"2\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"+pp3cna00\" REST null ] +PRBS cons [ FIRST \"1.000000\" REST null ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]"))))) (1713 fp 0.000000 2 3 (17 fstop_pt 0.000000 2 3 ("." 3 "token [ +FORM \".\" +FROM \"14\" +TO \"15\" +ID diff-list [ LIST cons [ FIRST \"3\" REST list ] LAST list ] +POS pos [ +TAGS cons [ FIRST \"fp\" REST null ] +PRBS cons [ FIRST \"1.000000\" REST null ] ] +CLASS token_class +TRAIT token_trait +PRED predsort +CARG string ]")))))
NOTE: 1 readings, added 1054 / 593 edges to chart (453 fully instantiated, 27 actives used, 235 passives used)	RAM: 6281k

As usual, special thanks to @Dan who said yesterday in the GE meeting: “Surely it is documented somewhere how to write YY when you have more than one lexical rule”. It ended up so simple in the end!

:tada: :tada: :tada:

Good investigative work! As you put together some documentation on the wiki for your Spanish grammar reboot effort, please remember to add the link for that wiki page, presumably DELPH-IN Garage

1 Like

@Dan you mean, a link to SrgTop? Should it be added to… the GrammarCatalogue? But none of the links there appear to be working at the moment?

@olzama i think @Dan means this wiki page

Hmm well that’s the PetInput wikipage but I am not sure where I am supposed to add it, still…