Dump lexical entries (all inflected forms)

I highjacked a topic in LKB-FOS: new version available supporting chart mapping - #2 by arademaker! Sorry! I would like to copy here to expand on the topic and add more questions about the initial idea of dumping all lexical entries from a grammar.

@johnca suggested one initial code in LKB-FOS: new version available supporting chart mapping - #10 by johnca. But I need more than just the generated form, as i said in LKB-FOS: new version available supporting chart mapping - #11 by arademaker. Using the original output-derived-instance-as-tdl function, I got a file with entries like the ones below after loading our PorGram (https://github.com/LR-POR/PorGram):

PASSAR_V10_0 :=
  nom-gen-goa-ditransitive-verb-lex & [ STEM   cons & [ FIRST  "passar",
           REST   null ],
...

:begin :instance.

PASSAR_V10_15 :=
  pres-subj-1sg-lex-rule & [ STEM   cons & [ FIRST  "PASSE",
           REST   list ],
...

So I need to capture

PASSAR_V10_0 nom-gen-goa-ditransitive-verb-lex "passar"
PASSAR_V10_15 pres-subj-1sg-lex-rule "PASSE"
  1. surely LKB already has all the accessors functions for the fs structure, right? How to get the most specific type and the string value of the attribute STEM.FIRST?
  2. why the inflected (derived) entries are printed in uppercase?
  3. what *maximal-lex-rule-applications* controls? what is the order of applications?

OK, I see. The output you want corresponds more closely to the option :ebl than :tdl. I suggest the following:

(defun output-for-ebl (orth fs ostream rule-list base-id base-fs ostream2)
  (declare (ignore fs base-fs ostream2))
  (format ostream "~&~A ~:A ~S~%" base-id (reverse rule-list) orth))

(let ((*maximal-lex-rule-applications* 2))
  (output-lex-and-derived :ebl "~/expanded.lex" nil))

A sample of output (from ERG 2018) is:

HIRE_V1 () "hire"
HIRE_V1 (v_3s-fin_olr) "HIRES"
HIRE_V1 (v_psp_olr) "HIRED"
HIRE_V1 (v_pst_olr) "HIRED"
HIRE_V1 (v_prp_olr) "HIRING"
HIRE_V1 (v_prp-nf_olr) "HIRIN"
...
HIRE_V1 (v_v-out_dlr w_sqright_plr) "OUT-HIRE'"
HIRE_V1 (v_v-out_dlr w_sqleft_plr) "'OUT-HIRE"
HIRE_V1 (v_v-out_dlr w_hyphen_plr) "OUT-HIRE-"
...

Derived entries are uppercase because morphological rules and inputs are canonicalised.

The parameter *maximal-lex-rule-applications* is described in the LKB User Manual and in the source code file main/globals.lsp:

the number of lexical rule applications which may be made before it is assumed that some rules are applying circularly

If lexical rules can apply to their own outputs then morphological generation will never stop. Setting
*maximal-lex-rule-applications* will force it to stop eventually.

If PorGram contains any derivational or punctuation lexical rules, you’ll probably want to disable them otherwise you’ll get extraneous forms analogous to 'OUT-HIRE above.

Hum, thank you. Now I have to understand what ERG is setting up that PorGram is not. When I use the code above after loading ERG, I have the outputs you have shown above. If I load PorGram after ERG without closing LKB, I can also generate:

ESTRANHO () "estranho"
ESTRANHO (MASC-LEX) "estranho"
ESTRANHO (A-SG-LEX) "estranho"
ESTRANHO (A-PL-SUFFIX) "ESTRANHOS"
ESTRANHO (FEM-SUFFIX) "ESTRANHA"
ESTRANHO (MASC-LEX A-SG-LEX) "estranho"
ESTRANHO (MASC-LEX A-PL-SUFFIX) "ESTRANHOS"
ESTRANHO (FEM-SUFFIX A-SG-LEX) "ESTRANHA"
ESTRANHO (FEM-SUFFIX A-PL-SUFFIX) "ESTRANHAS"
...

But if I open LKB and load PorGram directly, my output only contains the base forms:

ESTRANHO () "estranho"
...

Besides that, another problem is that the outputs above show some redundancy in our lexicon… See that we have ‘estranho’ produced by MASC-LEX and A-SG-LEX and (MASC-LEX A-SG-LEX).

Hi @johnca, if possible, I would like to understand the dump above, mainly why the ‘estranho’ form was produced four times.

Our understanding is that some forms are ‘incomplete’.

ESTRANHO (A-SG-LEX) "estranho" shows only the partial inflection of number,

ESTRANHO (MASC-LEX) "estranho" shows the partial inflection of genre.

Does it make sense? I am trying to understand the code output-lex-and-derived, unsure if I can eventually control it to produce only the final forms or if we need to improve the lexicon modeling in PorGram. BTW, what does EBL stand for?

In general it’s a bad idea to load a grammar into the LKB on top of a different one that’s already loaded. See LkbFaq · delph-in/docs Wiki · GitHub

Although the LKB is generally robust to loading revised versions of the same grammar, loading a completely different grammar into an LKB session may cause problems (because of incompatible globals or user-fns files…

However, it does help me diagnose the behaviour you report!

The :ebl option ignores lexical rules that are not inflectional - as computed by the function inflectional-rule-p in io-general/outputsrc.lsp. This function returns true only if the parameter *lex-rule-suffix* has a string value and the name of the rule the function is called on ends with this string. The ERG sets this parameter to the empty string; PorGram does not set it so it retains its default value of nil.

Therefore, when you load PorGram into a fresh LKB session, inflectional-rule-p returns false for all lexical rules so output-lex-and-derived with option :ebl ignores them all; if you load ProGram on top of the ERG, inflectional-rule-p returns true for all lexical rules so all of them may be applied.

EBL stands for explanation-based learning, and I think a morphological generation dump was used in the paper Applying Explanation-based Learning to Control and Speeding-up Natural Language Generation.

Regarding the partial inflection ‘incomplete’ forms issue, I haven’t got any ideas so I’m afraid I can’t advise.