Handling UNKnown words with ACE+SRG

(Note: this is a distinct issue from Using generic lexical entries with the SRG+ACE)

With old SRG (the logon version), which uses Freeling 3.0, using the logon LKB, I can parse sentences with things like proper nouns:

Screen Shot 2022-10-03 at 12.47.13 PM

With the ACE setup, where we updated the Freeling tags in the grammar to the Freeling 4.0 version and we run Freeling separately and then convert its output to YY mode, I have enabled generics (which I am not fully sure are even relevant here, given that Freeling provides a tag which can then in turn be found in the grammar and linked to a lexical type via the Freeling “stem”) but I still cannot parse e.g. proper nouns:

(1, 0, 1, <0:9>, 1, "pitágoras" "Pitágoras", 0, "NP00SP0", "NP00SP0" 1) (2, 1, 2, <10:15>, 1, "ladrar" "ladró", 0, "VMIS3S0", "VMIS3S0" 1) (3, 2, 3, <16:17>, 1, "." ".", 0, "Fp", "Fp" 1)
NOTE: lexemes do not span position 0 `pitágoras'!
NOTE: post reduction gap
SKIP: (yy mode)

The Freeling tag for proper nouns has changed but that is reflected in the grammar:

; -- proper names
np00sp0 :=
%suffix (np00sp0 np00sp0)
np00sp0_ilr.

and:

; -- named entities
np00sp0_ilr :=  infl-ltow-rule & 
  [ SYNSEM.LOCAL.CAT.HEAD noun & [ KEYS.KEY named_rel ] ].

and:

pname := n_-_pn_le & 
  [ STEM < "np00sp0" > ].

Does anyone know what else I am missing in this setup I am trying to use with ACE?

Anybody has any ideas about this?.. (P.S.: I hope it’s not email issues preventing replies from being posted, but if so please email me!)

I tried using exactly the same YY input you posted with the olzama-dev branch of SRG and got a successful parse:

$ ~/cdev/ace/ace -g srg.dat -y --yy-rules -1Tf (1, 0, 1, <0:9>, 1, "pitágoras" "Pitágoras", 0, "NP00SP0", "NP00SP0" 1) (2, 1, 2, <10:15>, 1, "ladrar" "ladró", 0, "VMIS3S0", "VMIS3S0" 1) (3, 2, 3, <16:17>, 1, "." ".", 0, "Fp", "Fp" 1) SENT: (yy mode) [ LTOP: h0 INDEX: event2 [ event SORT: semsort E.TENSE: ppast E.ASPECT: aspect E.MOOD: ind SF: prop ] RELS: < [ named_rel<-1:-1> LBL: handle4 [ handle SORT: semsort ] CARG: string WLINK: list CFROM: *top* CTO: *top* ARG0: ref-ind3 [ ref-ind SORT: non-temp PNG.PN: 3sg PNG.GEN: masc_or_fem PRONTYPE: not_pron DEF: bool DIVISIBLE: - ] ARG1: semarg9 [ semarg SORT: semsort ] ] [ "_generic_v_rel"<-1:-1> LBL: handle1 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: event2 ARG1: ref-ind3 ARG2: ref-ind13 [ ref-ind SORT: semsort PNG.PN: 3per PNG.GEN: gender PRONTYPE: not_pron DEF: bool DIVISIBLE: bool ] ] [ udef_q_rel<-1:-1> LBL: handle14 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: ref-ind13 RSTR: handle18 [ handle SORT: semsort ] BODY: handle19 [ handle SORT: semsort ] ] [ "_generic_n_rel"<-1:-1> LBL: handle20 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: ref-ind13 ] > HCONS: < h0 qeq handle1 handle18 qeq handle20 > ] NOTE: 1 readings, added 9148 / 8687 edges to chart (1206 fully instantiated, 178 actives used, 1352 passives used) RAM: 90062k

I did have to apply the changes to mentioned to the grammar by hand, as they don’t seem to be in the git repository…?

Here are the complete diffs I have against the git repo:

$ git diff | cat
diff --git a/ace/config.tdl b/ace/config.tdl
index adf4806..f5dd149 100644
--- a/ace/config.tdl
+++ b/ace/config.tdl
@@ -19,7 +19,95 @@ version                   := "../Version.lsp".
 
 irregular-forms 	  := ../irregs.tab.
 
-quickcheck-code           := "../ace/ace-qc.txt".
+;quickcheck-code           := "../ace/ace-qc.txt".
+
+:begin :instance.
+
+qc_unif_set := *top* &
+[ ARGS.SYNSEM.LOCAL.CAT.HEAD "0" #| 1062708 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS "1" #| 762446 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.MOD "2" #| 547322 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SUBJ "3" #| 335982 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.KEYS.KEY "4" #| 318034 |#,
+  ARGS.SYNSEM.LOCAL.CAT.MC "5" #| 316814 |#,
+  ARGS.INFLECTED "6" #| 287620 |#,
+  ARGS.SYNSEM.LOCAL.CONT.HOOK.INDEX "7" #| 165854 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.CLTS "8" #| 152099 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SPR "9" #| 130597 |#,
+  ARGS.SYNSEM.NON-LOCAL.SLASH.LIST "10" #| 120750 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.MOD.FIRST.LOCAL.CAT.HEAD "11" #| 119531 |#,
+  ARGS.SYNSEM.NON-LOCAL.SLASH "12" #| 112842 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.MOD.FIRST.LOCAL "13" #| 110827 |#,
+  ARGS.SYNSEM "14" #| 109174 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.MOD.FIRST.LOCAL.CAT.VAL.SPR "15" #| 100989 |#,
+  ARGS.SYNSEM.LOCAL.AGR.PNG.PN "16" #| 92179 |#,
+  ARGS "17" #| 86641 |#,
+  ARGS.SYNSEM.LOCAL.CONT.RELS.LIST.REST.FIRST.PRED "18" #| 81556 |#,
+  ARGS.SYNSEM.LOCAL.COORD-STRAT "19" #| 77483 |#,
+  ARGS.SYNSEM.LOCAL.CONT.RELS.LIST.FIRST "20" #| 75468 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.VFORM "21" #| 75277 |#,
+  ARGS.SYNSEM.LOCAL.COORD "22" #| 51355 |#,
+  ARGS.SYNSEM.NON-LOCAL.REL "23" #| 50937 |#,
+  ARGS.SYNSEM.LOCAL.CONT.RELS.LIST.FIRST.PRED "24" #| 44912 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.KEYS.ALTKEY "25" #| 40093 |#,
+  ARGS.SYNSEM.NON-LOCAL.SLASH.LAST "26" #| 35079 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.FIRST.LOCAL "27" #| 33524 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.REST "28" #| 33108 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.TAM.MOOD "29" #| 33010 |#,
+  ARGS.SYNSEM.NON-LOCAL.SLASH.LIST.FIRST.CAT.HEAD "30" #| 25870 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.INV "31" #| 25009 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.FIRST.LOCAL.CAT.MC "32" #| 21633 |#,
+  ARGS.SYNSEM.LIGHT "33" #| 18786 |#,
+  ARGS.SYNSEM.LKEYS.KEYREL "34" #| 17935 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.FIRST.OPT "35" #| 17913 |#,
+  ARGS.SYNSEM.LOCAL.CAT.POSTHEAD "36" #| 16988 |#,
+  ARGS.SYNSEM.PUNCT.RPUNCT "37" #| 16080 |#,
+  ARGS.SYNSEM.NON-LOCAL.QUE "38" #| 15197 |#,
+  ARGS.SYNSEM.NON-LOCAL.SLASH.LIST.FIRST.CAT.HEAD.KEYS.KEY "39" #| 11901 |#,
+  ARGS.SYNSEM.LOCAL.CONT.HOOK.INDEX.SORT "40" #| 9742 |#,
+  ARGS.SYNSEM.LKEYS.KEYREL.PRED "41" #| 8621 |#,
+  ARGS.SYNSEM.LOCAL.AGR.PNG.GEN "42" #| 6062 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SUBJ.FIRST "43" #| 5783 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SUBJ.FIRST.OPT "44" #| 5574 |#,
+  ARGS.SYNSEM.LOCAL.CONT.RELS.LIST.REST.REST.FIRST.PRED "45" #| 5134 |#,
+  ARGS.SYNSEM.LKEYS.KEYREL.ARG0 "46" #| 4956 |#,
+  ARGS.SYNSEM.LOCAL.CONT.HOOK.INDEX.PNG.PN "47" #| 4586 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SPR.FIRST.OPT "48" #| 4384 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.FIRST.CLTCZD "49" #| 3876 |#,
+  ARGS.ALTS.VCALT "50" #| 3858 |#,
+  ARGS.SYNSEM.LOCAL.CONT.RELS.LIST.REST.FIRST.ARG0.SORT "51" #| 3809 |#,
+  ARGS.SYNSEM.LOCAL.CONT.RELS.LIST.REST.FIRST.ARG0.PNG.PN "52" #| 2857 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.REST.FIRST.LOCAL "53" #| 2557 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.CLTS.REST "54" #| 2222 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.FIRST.NON-LOCAL.SLASH "55" #| 2216 |#,
+  ARGS.ALTS "56" #| 2197 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.KEYS.ALT2KEY "57" #| 2068 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.FIRST.NON-LOCAL.SLASH.LIST "58" #| 1988 |#,
+  ARGS.SYNSEM.NON-LOCAL.SLASH.LIST.FIRST.CAT.VAL.SPR "59" #| 1953 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SPR.FIRST.LOCAL.CAT.HEAD.KEYS.KEY "60" #| 1860 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SUBJ.FIRST.NON-LOCAL.SLASH.LIST "61" #| 1855 |#,
+  ARGS.SYNSEM.LOCAL.STR.HEADING "62" #| 1827 |#,
+  ARGS.SYNSEM.MODIFIED "63" #| 1721 |#,
+  ARGS.SYNSEM.LOCAL.AGR.DIVISIBLE "64" #| 1208 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.VOICE "65" #| 1180 |#,
+  ARGS.SYNSEM.LOCAL.CONT.HOOK.INDEX.E.TENSE "66" #| 1018 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.MOD.FIRST.LOCAL.CONT.HOOK.XARG.PNG.PN "67" #| 1008 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.COMPS.FIRST.NON-LOCAL.SLASH.LAST "68" #| 978 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SPEC.FIRST.LOCAL.CAT.HEAD.KEYS.ALTKEY "69" #| 745 |#,
+  ARGS.ALTS.CAUS "70" #| 701 |#,
+  ARGS.SYNSEM.LOCAL.CONT.RELS.LIST.REST.REST.FIRST.ARG0.PNG.GEN "71" #| 518 |#,
+  ARGS.ALTS.IMPERS "72" #| 367 |#,
+  ARGS.SYNSEM.LOCAL.CONT.HOOK.INDEX.E.MOOD "73" #| 315 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL.SPR.FIRST.LOCAL.CONT.RELS.LIST.REST "74" #| 201 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.AUX "75" #| 104 |#,
+  ARGS.SYNSEM.LKEYS.ALTKEYREL.PRED "76" #| 76 |#,
+  ARGS.SYNSEM.PUNCT.LPUNCT "77" #| 29 |#,
+  ARGS.SYNSEM.LOCAL.CAT.VAL "78" #| 26 |#,
+  ARGS.SYNSEM.LOCAL.CAT.HEAD.MOD.FIRST.PUNCT.RPUNCT "79" #| 26 |# ].
+
+:end :instance.
+
+quickcheck-instance           := qc_unif_set.
 
 ;post-model-path           := "english-postagger.hmm".
 
diff --git a/generics.tdl b/generics.tdl
index 3de654c..cb94d89 100644
--- a/generics.tdl
+++ b/generics.tdl
@@ -1,8 +1,12 @@
+pname := n_-_pn_le &
+  [ STEM < "np00sp0" > ].
+
+
 ;;; Hey, emacs (1), this is -*- mode:tdl; Coding: utf-8; -*-
 
 
 n_-_mc_ge := n_-_mc_le &
-[ STEM < *top* >,
+[ STEM < "n_-_mc_ge" >,
   SYNSEM.LKEYS.KEYREL.PRED "_generic_n_rel" ].
 
 ;n_pp_mc_ge := n_pp_mc_le &
@@ -10,17 +14,17 @@ n_-_mc_ge := n_-_mc_le &
 ;  SYNSEM.LKEYS.KEYREL.PRED "_generic_n_rel" ].
 
 v_np_ge := v_np*_le &
-[ STEM < *top* >,
+[ STEM < "vp_np_ge" >,
   SYNSEM.LKEYS.KEYREL [ PRED "_generic_v_rel" ] ].
 
 aj_-_i_ge := aj_-_i_le & 
-[ STEM < *top* >,
+[ STEM < "aj_-_i_ge" >,
   SYNSEM.LKEYS.KEYREL.PRED "_generic_a_rel" ].
 
 av_-_i-sm_ge := av_-_i-sm_le &
-[ STEM < *top* >,
+[ STEM < "av_-_i-sm_ge" >,
   SYNSEM.LKEYS.KEYREL.PRED "_generic_x_rel" ].
 
 av_-_i-vm-spd_ge := av_-_i-vm-spd_le &
-[ STEM < *top* >,
+[ STEM < "av_-_i-sm_ge" >,
   SYNSEM.LKEYS.KEYREL.PRED "_generic_x_rel" ].
diff --git a/inflr.tdl b/inflr.tdl
index 5a42357..51e8a49 100644
--- a/inflr.tdl
+++ b/inflr.tdl
@@ -1,4 +1,10 @@
 
+; -- proper names
+np00sp0 :=
+%suffix (np00sp0 np00sp0)
+np00sp0_ilr.
+
+
 ;;; Hey, emacs (1), this is -*- mode:tdl; Coding: utf-8; -*-
 ;;;
 ;;;  Montserrat Marimon  
diff --git a/irtypes.tdl b/irtypes.tdl
index 1ff3319..5eb8a00 100644
--- a/irtypes.tdl
+++ b/irtypes.tdl
@@ -1,4 +1,8 @@
 
+; -- named entities
+np00sp0_ilr :=  infl-ltow-rule &
+  [ SYNSEM.LOCAL.CAT.HEAD noun & [ KEYS.KEY named_rel ] ].
+
 ;;; Hey, emacs (1), this is -*- mode:tdl; Coding: utf-8; -*-
 ;;; 
 ;;;  Montserrat Marimon
diff --git a/srg.tdl b/srg.tdl
index 50b395d..71061e9 100644
--- a/srg.tdl
+++ b/srg.tdl
@@ -40,9 +40,9 @@
 :include "lexicon".
 :end :instance.
 
-;:begin :instance :status generic-lex-entry.
-;:include "generics".
-;:end :instance.
+:begin :instance :status generic-lex-entry.
+:include "generics".
+:end :instance.
 
 ;;
 ;; grammar rules and lexical rules (instances of status rule)

Thanks, Woodley, for looking into this!

Hmmm.

In your diffs, the lines starting with + are from the repo?.. I first assumed they would be yours (because the quickcheck diffs appear something that is not in the repo to me too (adding them did not help me), but then the versions of generics.tdl, irtypes.tdl, all appear like they already are in the repo. I had not (until now) checked in the diff for srg.tdl but I did have that diff locally all this time (and now I checked it in anyway).

I am still getting no parse for this Pitagoras sentence somehow…

Not sure what can be the difference between your setup and mine. Do you mind sending me your grammar by email?

Which version of ACE were you using? I am using 0.9.34.

Thanks again!

Thanks for sending me the grammar via email, @sweaglesw !

(It appears that your grammar is somehow a mix of the one that is currently in the olzama-dev branch of the repo and the older version(?) For example, generics.tdl is updated but inflr.tdl and irtypes.tdl is not? Not sure how that happened, assuming you pulled the same dev version from the repo…)

Comparing our versions, it appears that I can get the same result as you by adding the following to generics.tdl:

pname := n_-_pn_le &
  [ STEM < "np00sp0" > ].

It’s encouraging that I can get the same result and in particular that I can get a parse.

However, I note that this leads to every token being mapped to a generic entry, somehow. Not only in sentences which have personal names in them but also in just any sentence.

That does not seem right; does anyone have ideas about why this is and how to fix it?

To resummarize the issue:

I am currently getting all lexical entries as generic entries, no matter what I parse. What could be the reason for that and how to proceed?

Screen Shot 2022-10-21 at 3.58.33 PM

I pulled my tree directly from the github link you shared with me, and then made edits until it seemed to work. I added the pname entry to my generics.tdl file since you had posted that you had created such an entry. I didn’t think deeply about whether it was a good idea ;-). Did you add yours to the main lexicon instead of the generics lexicon by any chance? That would certainly give different results.

Entries in the main lexicon are triggered by their STEM value. “Pitagoras” will never match “np00sp0”. If you want it to, you will have to have your script that makes the YY format change the stem for those proper nouns to “np00sp0”. Something along these lines is probably how this used to work in the old days. I can’t recall how the original orthography would have made it onto the CARG of the named_rel in this world – at least in the case where YY input is used, this is not something ACE ever supported.

Fast forwarding to the “modern” era…

Entries in the generic lexicon are instead triggered by their token feature structure. The parser tries to postulate every generic lexical entry on every token in the input. The grammar is supposed to either make sure those token feature structures are incompatible in all but the cases where the generic lexical entry is actually wanted, or use the so-called lexical filtering phase to remove unwanted generic lexical entries.

Getting all that token feature structure stuff to work properly may be a bit of an exercise, since SRG does not currently seem to have the necessary feature geometry.

Yes. (Still mysterious about the code version! I am sure the files in the olzama-dev branch have all these changes… Oh well).

I didn’t exactly add the entry, rather I updated the existing entry to have the Freeling 4.0 tag instead of the old Freeling tag. It used to be np00000 and now it is np00sp0. The entry lives in lexicon.tdl, in the version of the grammar that I got from the logon tree and started updating with respect to the Freeling tags.

OK. Then the task at hand is to make ACE work with the unknown words with the SRG properly. Which is to say, entries that cannot be mapped to anything in the lexicon should trigger generic entries but the ones that can be found in the lexicon should not trigger generic entries. It is of course crucial that I make it work, otherwise the grammar is not usable.

You are saying SRG is currently lacking the proper “feature geometry”:

How do I start working on this? I need to understand exactly what I would need to do, in order to decide whether it’s doable with the time and resources that I have.

Looking at the YYToken objects in old SRG treebanks, they include the following structure:

YYToken(id=4, start=3, end=4, lnk=<Lnk object  at 140347593146000>, paths=[0], form='NP00000', surface='Madrid', ipos=0, lrules=['$np00000'], pos=[])

Here, the surface form “Madrid” is explicitly marked to form NP00000 (this is an older version of the NP00SP0 tag).

Would I be closer to my goal if I had such YYToken objects? These are the old treebanks which work with the older version of the grammar, using the LKB and the prebuilt Freeling interface binary; I need something like that in place for updated grammar and ACE… But at least to know the intermediate representations that are required would be great.

Ah hah, I see. That suggests that the previous solution to this problem was indeed to have the script that converts FreeLing to YY rewrite the form. You then would be aiming to produce a YY input looking more like this, I think?

(1, 0, 1, <0:9>, 1, "NP00SP0" "Pitágoras", 0, "NP00SP0", "NP00SP0" 1) (2, 1, 2, <10:15>, 1, "ladrar" "ladró", 0, "VMIS3S0", "VMIS3S0" 1) (3, 2, 3, <16:17>, 1, "." ".", 0, "Fp", "Fp" 1)

With that input, and with all the generics disabled, and after editing the pname in the main lexicon, I get a lexical gap for the punctuation token – but if I delete it I get a parse that is close to what you want:

[ LTOP: h0 INDEX: event2 [ event SORT: semsort E.TENSE: ppast E.ASPECT: aspect E.MOOD: ind SF: prop ] RELS: < [ named_rel<-1:-1> LBL: handle4 [ handle SORT: semsort ] CARG: string WLINK: list CFROM: *top* CTO: *top* ARG0: ref-ind3 [ ref-ind SORT: non-temp PNG.PN: 3sg PNG.GEN: masc_or_fem PRONTYPE: not_pron DEF: bool DIVISIBLE: - ] ARG1: semarg9 [ semarg SORT: semsort ] ] [ "_ladrar_v_rel"<-1:-1> LBL: handle1 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: event2 ARG1: ref-ind3 ] > HCONS: < h0 qeq handle1 > ]

Notably deficient is the underspecified CARG value on the named_rel. At least with the current ACE, I don’t think there is any way to make that link without going to the token feature structure universe.

If you want to tackle the task of updating the SRG to the modern token paradigm, here’s approximately what that would involve:

  • add feature geometry defining a token; lexical entries get an extra feature which is a list of tokens realized thereby. In the ERG this feature is introduced on word_or_lexrule, so that it appears automatically on all lexical entries and can be propagated through lexical rules for reference, but disappears once syntax begins. Take a look at the file tmt.tdl in the ERG; you may be able to copy pretty much all of that over wholesale.
  • tell ACE about said feature geometry and enable token mapping; the sections of the ERG’s ace/config.tdl with subheadings “token settings” and “lattice mapping settings” can likely be copied unchanged if you used the ERG’s tmt.tdl.
  • modify the types or entries in generics.tdl to constrain the values of the tokens they span. The simplest thing would, I believe, be to just have each of them stipulate a value for TOKENS.+LIST.FIRST.+TNT.+MAIN.+TAG indicating what FreeLing tag should trigger it. That will prevent those generic tokens from firing for tokens that have a different tag.
  • assuming you make pname a generic lexical entry, add a reentrancy from its KEYREL.CARG to its TOKENS.+LIST.FIRST.+FORM

The above is just my guess at what it might take to get this to work. I am probably missing some things!

2 Likes

Thank you very much, @sweaglesw !

We just had a meeting with Francis and Dan and decided that indeed implementing the proper feature geometry is probably the best way forward. Your instructions are very useful. I will try to work on this in the next few days.

I need this working of course not only for proper names but for unknown words generally. Hopefully it will all work out in the end :).

So, to clarify:

In the state in which the SRG is now (no token mapping), using ACE with YY input, with generic lexical entries enabled in the grammar (but nothing special added there, just the typical generic entries, i.e. no entry for the proper names Freeling tag), is it expected to get every entry parsed as a generic entry?

This is what I am observing, even on input which contains no personal names and no rare words, just mi gato duerme for example.

I want to make sure I am at the right starting point.

(42, 0, 1, <0:2>, 1, "mi" "mi", 0, "dp1css") (43, 1, 2, <4:8>, 1, "perro" "perro", 0, "ncms000") (44, 2, 3, <9:15>, 1, "dormir" "duerme", 0, "vmip3s0")
SENT: (yy mode)
[ LTOP: h0
INDEX: event2 [ event SORT: semsort E.TENSE: pres E.ASPECT: aspect E.MOOD: ind SF: prop ]
RELS: < [ _mi_q_rel<-1:-1> LBL: handle4 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: ref-ind3 [ ref-ind SORT: semsort PNG.PN: 3sg PNG.GEN: gender PRONTYPE: not_pron DEF: bool DIVISIBLE: bool ] RSTR: handle8 [ handle SORT: semsort ] BODY: handle9 [ handle SORT: semsort ] ]
 [ poss_rel<-1:-1> LBL: handle10 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: individual14 [ individual SORT: semsort ] ARG1: ref-ind3 ARG2: ref-ind15 [ ref-ind SORT: entity PNG.PN: pernum PNG.GEN: gender PRONTYPE: prontype DEF: bool DIVISIBLE: bool ] ]
 [ pronoun_q_rel<-1:-1> LBL: handle16 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: ref-ind15 RSTR: handle20 [ handle SORT: semsort ] BODY: handle21 [ handle SORT: semsort ] ]
 [ pron_rel<-1:-1> LBL: handle22 [ handle SORT: semsort ] WLINK: c5 CFROM: *top* CTO: *top* ARG0: ref-ind15 ]
 [ "_generic_n_rel"<-1:-1> LBL: handle10 WLINK: list CFROM: *top* CTO: *top* ARG0: ref-ind3 ]
 [ "_generic_v_rel"<-1:-1> LBL: handle1 [ handle SORT: semsort ] WLINK: list CFROM: *top* CTO: *top* ARG0: event2 ARG1: ref-ind3 ARG2: semarg31 [ semarg SORT: semsort ] ] >
HCONS: < h0 qeq handle1 handle8 qeq handle10 handle20 qeq handle22 > ]
NOTE: 1 readings, added 1686 / 1225 edges to chart (135 fully instantiated, 108 actives used, 133 passives used)	RAM: 14363k

I clarified above that I am asking about the grammar without any modifications. So, with no additional entries anywhere, just the generics.tdl imported in the grammar, is it normal to get every word from the input come out as a generic entry, given that I haven’t yet implemented token mapping? Tagging @Dan @bond and @sweaglesw but please let me know if my question is unclear. I will then give excerpts illustrating parts of the grammar, I guess.

I think what you see is a good starting point for adding the token mapping machinery for unknown word handling (since right now there is nothing stopping those generic entries from being added to the chart). Once you have those rules as we discussed, in the “tmr” subdirectory (along with the file “tmt.tdl” which defines the token-mapping types), you will need to add a little more information to your grammar so that you can discard a generic entry when, for example, you already have an entry for that stem defined in your lexicon. This filtering of generics is done in the ERG via rules defined in a file called “lfr.tdl” which we did not discuss earlier. You can decide how you want to encode in your grammar the contrast between native (known) and unknown words; you’ll need some feature which is true for one set and false for the other. I doubt that you’ll want the same contrast that the ERG uses, via the ONSET feature, since this is grammar-specific, so choose a new attribute on the type word, say [NATIVE bool], and set its value to + for ordinary lexical types, but - for the generic types. Then use that distinction in the filtering rules in lfr.tdl, which are called after the rules defined in the “tmr” subdirectory. I hope this will be clear as you make the additions to your grammar.

1 Like

What about unknown verbs and non-proper-name nouns, etc?.. Those will just have normal tags e.g. VMIP3P0 for “plural 3 person present tense” etc.

In other words, it will be tags which should not always trigger generic entries. Should I still be using them (and then there would be some other stage (lexical filtering?..) to ensure correct behavior), or am I misunderstanding this step?

The fact that I used a proper name as an example was an accident :sweat_smile: I’d like to focus on unknown words generally for now.

@sweaglesw also notes (but discourse email server doesn’t seem to like that):

“It is expected that generic lexemes are licensed for every input token. It is NOT expected that that would block the normal readings offered by the grammar. What you should be seeing is massive ambiguity caused by those highly promiscuous generics, on top of regular analyses that are there when you don’t include the generic lexical entries (at least for sentences that are in scope for the grammar).”

The above is indeed what I am seeing; I just forgot I was calling ACE with a -1 flag which would only leave one reading on display.