LKB-FOS: new version available supporting chart mapping

I’ve just made a new binary distribution of LKB-FOS (the download link is on the LkbFos wiki page). Source code is in the repository.

This new version supports chart mapping. I’ll talk about it a bit at the upcoming DELPH-IN Summit. Below is a summary of the changes, copied from the README.

John

  • Implemented token mapping and lexical filtering, based on the Adolphs et al. chart mapping paper. See the ‘LKB-FOS Update’ presentation at the DELPH-IN Summit 2022 for instructions on how to enable it. (Briefly, grammars need to set the relevant parameters listed in src/main/globals.lsp, load generic LEs from a sub-lexicon called “gle”, and read in token mapping and lexical filtering rules with read-token-mapping-file-aux and read-lexical-filtering-file-aux respectively). Post-generation mapping rules will be added in a subsequent release.
  • Added :chart-mapping to *features*, allowing the grammarian to use #+:chart-mapping and #-:chart-mapping in script files etc to control whether to load chart mapping rules and set associated variables. Also changed the LKB version number to 5.6.
  • Added a new parameter *show-incomplete-lex-rule-chains*, which controls whether the parse chart window does or does not show chains of lexical rule applications that are incomplete (due to a requisite rule failing to apply).
  • The default for *gen-start-symbol* is `sign’, which could lead to even bare lexical entries being considered as potential generator results. This previously triggered an error since the LKB didn’t expect to consider a lexical entry as a potential result. Fixed.
  • Fixed a bug where processing test suite profiles in [incr tsdb()] gave up on the first item.
  • Trying to run LKB-FOS inside Emacs 27 onwards gave the message ‘Package cl is deprecated’, and lkb mode was not entered. Fixed.
  • Recent versions of the LKB-FOS Linux binary required glibc 2.28. Old versions of Linux, e.g. Ubuntu 18.04, come with glibc 2.27 or less. Fixed by building the binary in an even older version of Linux, relying on glibc backward compatibility.
  • Several internal improvements, including: reducing the amount of indirection in parser data-structures; maintaining dag arcs in an ‘almost sorted’ order to speed up unification and subsumption; and fixing instances of memory not being released promptly after an error.

Thank you, @johnca. Those are all fantastic improvements to LKB. Considering the new functionalities, what are the missing features in LKB presented in ACE?

One additional question…

We have used a full-form morphological dictionary (http://github.com/LR-POR/MorphoBr) to produce the lexicon entries of our Portuguese Grammar (http://github.com/LR-POR/PorGram). Does LKB have any functionality to dump all lexical forms for the lexicon, applying all possible lexical rules? That way, I could double-check our implemented procedure, ensuring that we can produce the MorphoBr from the PorGram lexicon. Does it make sense?

I too am enthusiastic about the arrival of this new machinery, and can’t wait to try it out. I made one attempt to get it running with the ERG even before the summit tutorial, and think I got the relevant files loaded and the globals set, but the parser does not seem to come up with the generic entries it would need when I try a sentence with a token that needs chart-mapping. If you have a toy grammar that illustrates the machinery in action, I’d be glad to see if I could learn from it. Otherwise, I’ll try to be patient for another week or two.

Apologies for the too-brief, cryptic explanation of how to enable chart mapping. @Dan , I think your problem with the generic entries might be because you’re loading them as a lexicon rather than as a sub-lexicon, i.e. they should be loaded with read-cached-sublex-if-available.

This is the full set of changes and additions that I’ve been using with ERG 2018 and 2020:

;;; rpp/setup.lsp
(setf *repp-interactive* '(:tokenizer :xml :ascii :quotes :wiki :quotes :gml :html))

;;; lkb/globals.lsp
(setf *parse-ignore-rules* nil)

;; token type:
(defparameter *token-type*               'token)

;; paths in token fs:
(defparameter *token-form-path*          '(+FORM))
(defparameter *token-id-path*            '(+ID))
(defparameter *token-from-path*          '(+FROM))
(defparameter *token-to-path*            '(+TO))
(defparameter *token-postags-path*       '(+TNT +TAGS))
(defparameter *token-posprobs-path*      '(+TNT +PRBS))

;; path to token feature structures in lexical items:
(defparameter *lexicon-tokens-path*      '(TOKENS +LIST))
(defparameter *lexicon-last-token-path*  '(TOKENS +LAST))

;; paths in chart mapping rules:
(defparameter *chart-mapping-context-path*  '(+CONTEXT))
(defparameter *chart-mapping-input-path*    '(+INPUT))
(defparameter *chart-mapping-output-path*   '(+OUTPUT))
(defparameter *chart-mapping-position-path* '(+POSITION))
(defparameter *chart-mapping-jump-path*     '(+JUMP))

;;; lkb/script.lsp
(read-cached-lex-if-available 
  (list
    (lkb-pathname (parent-directory) "lexicon.tdl")))
(read-cached-sublex-if-available
  "gle" (lkb-pathname (parent-directory) "gle.tdl"))

(loop for file in '(
      "tmr/gml.tdl" "tmr/ptb.tdl" "tmr/spelling.tdl" "tmr/ne1.tdl" "tmr/split.tdl"
      "tmr/ne2.tdl" "tmr/class.tdl" "tmr/ne3.tdl" "tmr/punctuation.tdl"
      "tmr/pos.tdl" "tmr/finis.tdl")
    do
    (read-token-mapping-file-aux (lkb-pathname (parent-directory) file)))
(read-lexical-filtering-file-aux (lkb-pathname (parent-directory) "lfr.tdl"))

@arademaker , good questions. My view is that the LKB and ACE have different goals, so one should not expect feature parity. The LKB focuses more on supporting grammar development, while ACE is probably better suited to be part of a text processing pipeline. (However in some scenarios, this distinction might be less clear-cut).

Here are some features that the LKB doesn’t have: interface to a PoS tagger, ability to run ubertagging models, treebank construction from parse forests, robust processing of extra-grammatical inputs, and disambiguation using grandparent features (the LKB currently only uses parent features).

On the other hand, distinctive features of the LKB are: close integration with [incr tsdb()], interface to emacs, built-in menu-driven GUI, convenient browsing of type hierarchies, and fast loading of very large type hierarchies.

Regarding dumping lexical forms, I think there’s something suitable - I’ll check and respond separately.

Thanks, @johnca. I had figured out the sub-lexicon issue, so the only missing piece for me was the setting of repp-interactive. Now I can happily get most of the ERG chart-mapping rules to apply. What I am still missing is the treatment of unknown words determined by POS tags. Does the new machinery include use of a tagger (presumably TnT for the ERG), and if so, do I need to do something further to invoke it so the tag properties are present on the tokens, for the relevant chart-mapping rules to apply?

Oh, sorry @johnca, now I read your response to @alexandre and see that pos-tagging is not part of the story. No worries. The new functionality you’ve enabled will be most welcome.

Can LKB understand the YY input mode? That would allow an external pos tagger to give enough information to ERG to instantiate the generic LE, right?

There are a few ways in which pos tagging could be integrated into the LKB. (In fact code already exists that invokes TnT, Genia, Chasen and RASP – originally to interface [incr tsdb()] with PET).

One way of adding tagging would be for the LKB to start up a TnT sub-process and send each sentence to it (like ACE’s --tnt-model option). Another way would be to implement an HMM tagger within the LKB (like ACE’s english-pos-tagger configuration parameter). Alternatively, tagging could be supported more loosely, by making it easy to run to enrich an input representation before processing proper actually starts (as in @rademaker 's suggestion).

I could look into this, but my first question is: to what extent would adding pos tagging make life better for a grammar developer? A further question is: how tightly should tagging be integrated to make it convenient to use in grammar development?

@arademaker , you asked about dumping all lexical forms after the application of all possible lexical rules. Here’s a code snippet that you could adapt to do this:

(defun output-derived-instance-as-tdl (str fs stream lex-name idno)
  (declare (ignore fs lex-name idno))
  (format stream "~&~A~%" str))

(let ((*maximal-lex-rule-applications* 2))
  (output-lex-and-derived :tdl "~/expanded.lex" nil))

This limits the number of lexical rule applications to 2 in order to stop the output blowing up exponentially. You could define output-derived-instance-as-tdl differently to output other information, or you might have to change output-lex-and-derived to get what you want.

1 Like

thank you @johnca, I will try to run that code before the Summit, so I can ask for help if I can’t make it. Concretely, I want to dump the lexicon from LKB to a format similar to our MorphoBr

I suspect the closest I can get is something like

babas baba n-pl-suffix [...]

the inflected form, the lemma, and the rules that were applied to produce the inflected form.

On the question of the utility of having a tagger for grammar development, I don’t see an urgent need for it for work on the ERG, but other grammar developers may find it more important. While it is convenient to take a grammatically problematic example directly from a corpus and present it to the LKB, it’s usually more practical to simplify the sentence to best expose the problematic construction(s) in the example, so substituting known lexical entries for any out-of-vocabulary ones is easy enough.
As for tightness of integration for a tagger if supported, it would be good to have it parallel to what PET and ACE support, allowing the choice of a tagger so the relevant chart-mapping rules can define the mapping of elements of that tagger’s tagset to generic lexical entries.