Generates in lkb/runs out of RAM in ace (clausal mods marked by adverbs)

@sweaglesw and @johnca – if you are interested, I have a grammar from 567 this quarter where there are some sentences that will generate in the LKB but not in ace. The issue seems connected with the semantically empty adverbs associated with one variant of the clausal mods library in the Matrix. I wasn’t able to create a smaller example in the time I had available, but the students have given me permission to share the grammar with you if you want to look at it…

Not a direct response, but because I’m seeing a fair number of related emails, perhaps we could have a session at the Summit about trying to pin down formally what the generator is expected to do? I think there are a number of cases where we could/should have flags that allowed more general but less efficient generation.

2 Likes

I’d be interested (but unlikely to have much expertise to contribute)!

Sounds like a great idea. Ann, could you volunteer to lead the session?

Happy to lead.

This will be most useful if people contribute examples of the generator(s) behaving in ways they don’t expect/like for a particular grammar. I can try and collate the examples I get them ahead of time. I don’t think I will be able to look at individual grammars and find out what’s happening, though.

I was able to verify that ACE is spinning out arbitrarily long sentences with extra copies of semantically empty adverbs. The problem is not that ACE refuses to generate the desired string, but rather that it gets bogged down in an infinite loop of producing longer and longer constituents that could have been part of realizations. @johnca, does LKB have an intentional mechanism for preventing this? Here is an example of an input that ACE considers grammatical, and therefore tries to include in its enumeration during generation:

貓 睏 就 就 就 就 就 就 就 就 狗 睏

… and similarly with any number of 就 in that position, whereas the desired output has just one. I have no way of having an opinion as to whether such strings are truly grammatical in Southern Min.

I think it would be a mistake to analyse Hokkien/Southern-Min 就/tō as semantically empty, although I can see why it’s an easy analysis, since many of its discourse functions don’t have an “obvious” semantics.

The sentence is “grammatical” in the same way as: “that’s, you know, you know, you know, you know, you know, a problem”, which the ERG can parse, apparently also with a reading where all the "you know"s are semantically empty.

This particular analysis (from the clausal mods library) of the customization system links an EP with the presence of the adverb, but introduces the EP through a phrase structure rule, so the adverb itself is empty.

Oh I see! In which case, I suppose there would be some feature to track whether the adverb has been seen, and that feature could presumably also be used to block multiple instances of the adverb. (Assuming we want to block this.)

1 Like

Indeed, the analysis in the customization system can be improved so that sentences with multiple copies of the adverb are blocked.

As for the Summit topic that @AnnC proposed: what is the desired behavior of the generator(s) when a grammar does this? It seems similar but not identical to the case where an individual rule spins, which (I think) leads to the “probable runaway rule” error.

1 Like

There is a bug in the LKB generator that prevents such sentences being generated. The bug is difficult to explain, but very briefly it’s caused by inadvertent overuse of a long-standing technique for maximising subgraph sharing in feature structures (as described by Malouf et al. 2000, pp. 33-36).

I’ve fixed the bug, and now, using normal generation settings the LKB gets bogged down in the same way as ACE, hypothesising longer and longer constituents that might form part of a realization. However, it’s relatively simple to get the LKB to perform a breadth-first search of the generation space and return complete realisations in order of length - stopping when resource limits are exceeded. With this setting and a limit of 10K chart edges it produces the following strings.

貓 睏 就 狗 睏
貓 睏 就 就 狗 睏
貓 睏 就 就 就 狗 睏
貓 睏 就 就 就 就 狗 睏
貓 睏 就 就 就 就 就 狗 睏
貓 睏 就 就 就 就 就 就 狗 睏
貓 睏 就 就 就 就 就 就 就 狗 睏
貓 睏 就 就 就 就 就 就 就 就 狗 睏
貓 睏 就 就 就 就 就 就 就 就 就 狗 睏
貓 睏 就 就 就 就 就 就 就 就 就 就 狗 睏
貓 睏 就 就 就 就 就 就 就 就 就 就 就 狗 睏
貓 睏 就 就 就 就 就 就 就 就 就 就 就 就 狗 睏

(If anyone is interested I can explain how the grammarian could turn on this breadth-first search strategy). The bug fix will be in the next release of LKB-FOS.

I’m interested!

I am interested too.

You can get this breadth-first search in LKB-FOS (and in principle also in older LKB implementations) by setting four parameters, as follows:

(setq *gen-first-only-p* 12)
(setq *maximum-number-of-edges* 10000)
(setq *unpacking-scoring-hook* nil)
(setq *gen-scoring-hook*
      #'(lambda (x)
          (case (first x)
            (:lexicon 1)
            (:rule (/ 1 (length (g-edge-leaves (third x)))))
            (:active (/ 1 (+ (length (g-edge-leaves (second x)))
                             (length (g-edge-leaves (third x)))))))))

These parameter settings direct the LKB to: produce up to 12 realisations; give up searching for realisations once 10000 chart edges have been created; not attempt to rank the realisations in terms of likelihood; and bias the search towards shorter realisations first.

Once either of the two limits have been reached the realisations will be accessible from the “Redisplay realisation” command in the “Generate” menu (and programmatically as the value of the special variable *gen-record*). Currently, realisations are not output as soon as they are found, but this behaviour could be changed with a modest amount of programming.

Note that in the case of the Southern Min grammar, the bug fix I mentioned is necessary for lexical items with empty semantics to work properly. Also, there are a couple of incorrect parameter settings in the grammar’s script file - @ebender I’ll email you about that shortly.

Thank you!