Morphological interfaces

Continuing the discussion from Reparsing and updating a treebank keeping previous decisions:

“Still under development” is probably generous. That paper was a conceptual one that grew out of a planned project that was never funded…


Is my guess about the compatibility with finite-state morphology right?

Somehow related to it… I am now learning that SRG is not the first grammar that depends on external pre-processing tools. So the internal morphology support from LKB (*) is already recognized as insufficient for rich morphological languages.

So now I am also curious about the possible interfaces between pre-processors and processors (LKB, Ace, Pet, etc.). In LKB-FOS: new version now available, @johnca and @olzama mentioned the SPPP (LkbSppp · delph-in/docs Wiki · GitHub) that is said to be superseded by SmafTop · delph-in/docs Wiki · GitHub. But Smarf page does not mention the YY format. We don’t have a page describing the YY format, only three grammar-specific pages talking about YY input and the Ace YY-related information. In PetInput · delph-in/docs Wiki · GitHub I found that YY format seems to supersede Smarf and SPPP, am I right? But this last link does not talk about SPPP. I guess we need to revise in the wiki all about the Lattice-Based Input formats. A global revision of the pages would facilitate newcomes. Does anyone can add something here about what is the relation between SPPP and YY? Is the YY the unique survivor and the winner in the natural selection among the input format species?

My motivations for this discussion are:

  1. how to move forward in the PorGram since I am antecipating that LKB (*) morphology will be insuficient for Portuguese.

  2. In the glosstag corpus, I don’t want to reinvent the wheel. I need a format to hold the sense annotation and its projection in the semantic representation and/or syntatic output too. In the current stage, senses are annotated on top of tokenized and POS tagged input. The inputs are also processed with ERG and the final semantic representation can be linked back to the surface+sense level. But the game will change once I move to WordNet 3.1 and edit the glosses.

  3. We need to keep the documentation of DELPH-IN alive. @EricZinda is motivated to that, but instead of starting from scratch describing the overhall processing from an end-user consumer perpective, I fell like many pages in the wiki need to be revised, consolidated and also (why not) deleted.

(*) I hope you noticed that I am not finding the right way to describe the morphological processing support builtin the LKB and described in the “Implementing Typed Feature Strucuture Grammars” Section 5.2. Can we say something about being it somehow the HPSG-based morphology approach DELPH-IN morphology approach given that MATRIX somehow generate TDL files following the description in the @AnnC 's book? In, it is the “Morphophonology in Morphosyntax” right? Is there a theoretical-based morphology approach? Sorry if this is confusing…

Yes, I think so. FOMA is an open-source equivalent of XFST and finite state techniques are generally a good match for morphophonology (with the sole exception of open-class reduplication).

Yes, my understanding is that it was never developed as a cross-linguistically robust approach to morphophonology. Berthold Crysmann has explored its limits and routinely gets it to do things that I would have thought were impossible, though.

Please don’t delete. A big warning at the top that the page is deprecated would be a much better alternative, ideally with pointers to more current versions.

1 Like

FWIW: The current approach I’ve been advocating in the documentation working group is:

  • The Wiki is meant as a history of the project and maintains topics of all types. It is the “Filing Cabinet” of Delph-In. A good place to find anything you’d like, but not necessarily well organized and including many historical docs that are out of date.
  • The Documentation Site (consider this a beta: still being reviewed, etc) is curated to include the most current pages and carefully maintained to have a professional look, consistency, a particular scope, table of contents, etc. It is the “public face” if you are using and learning about DELPH-IN. It gets built and can pull from different sources: the wiki, other github projects, pages built from grammar source, etc

I think this approach allows us to more effectively highlight the docs that are “current” and ameliorate the need to remove outdated stuff that we may want to keep for historical purposes. Pages that need to be revised or consolidated will still need that work done, though.


The best results nowadays for lemmatization in NLP are obtained by supervised approaches, as it happens for the large majority of NLP tasks (except when there is a very specific application with a very ad-hoc fine-tuned model). If you check the state-of-the-art of NLP tasks, that is the current trend of the field, not my personal claim.

I accidentally joined this above discussion just to suggest some datasets and tools for lemmatization of Portuguese.