Slow processing with append lists?

We (@ecconrad @olzama and I) have discovered a case where grammars created by the most recent version of the Grammar Matrix are significantly slower to process with (using ace) than grammars produced from the same choices file but the pre-append list version of the Grammar Matrix. On inspection of sample sentences, the global ambiguity is the same and the parse charts look the same (so same number of passive edges; the profiles produced by art aren’t recording this info for us so we can’t verify that way).

The grammars have the same set of rules, and though they are not identical, we suspect that the problem is more likely to be in the cost of unification rather than in say extra active edges. The switch over to append lists isn’t the only change between the two versions of the Matrix, but it is a big one. I’m wondering whether there is some optimization for unification with diff-lists which doesn’t automatically roll over to append-lists or (worse) if there is something inherently less efficient about the append-lists.

@sweaglesw and @guyemerson would you be interested in investigating whether the append-lists are in fact the source of the slow down? @ecconrad has sample grammars & profiles she can share.

Yes, append lists will be inherently slower, because the feature structures are larger. I haven’t investigated how much slower.

For lists like SLASH, this is the price of having actual lists instead of diff-lists.

On the other hand, for the semantic lists (RELS, HCONS, ICONS), I don’t expect that anyone wants to check the length of the lists or manipulate them in unusual ways. For these lists, it’s possible to have a “middle ground” type of list, which is what I originally presented at the Oslo summit, i.e. sticking to diff-lists (so it won’t be slower), but with a nice syntax for diff-list appends. As long as no one’s trying to break the MRS algebra, making this switch should speed up the grammar without affecting functionality. (But at the maintenance/conceptual cost of having two different append operations.)

I am wondering why I did not have the kind of problems that @ecconrad is having, with any of my dissertation grammars. Would be nice to identify the difference… Why is it that for me it was just fine but for her it takes prohibitively long in some cases…

It is puzzling — I would expect your testsuites to have longer sentences than the problematic ones that Liz is encountering. It would be great to have a way to investigate in more detail.

One obvious difference between my grammars and the AGG ones is the number of lexical rules. But could that be relevant?..

I would expect the slowdown from using append lists to be measurable and perhaps noticeable depending on what gets deleted, but probably not severe enough to result in intolerable wait times in cases which were snappy with diff-lists.

Anything can happen though. In my estimation, dramatic performance problems are most likely to come from either (1) way more edges in the chart or (2) way bigger AVMs. The former can be identified by looking at charts or the passive edge count. The latter can often be identified by looking at a text-based printout of an AVM (renderings usually have collapsed or even completely hidden parts) or, if the number of passive edges is comparable, comparing total memory usage.

Another technique that may or may not be useful in this case is ACE’s built-in time profiler, which is enabled with the -i flag. This will cause ACE to give a breakdown of execution time by various operations upon termination, and comparing these between two versions of a grammar can in some cases shine light on the problem.

I would be surprised if the slowdown from append lists is intolerable. How big is the difference?

If the append operation makes a big difference to unification speed, this would suggest to me that the lists are a substantial part of the feature structure. Perhaps this could be true for very long sentences. For short lists (no matter the amount of ambiguity or number of rules), I wouldn’t expect append lists to cause much of a problem.

(Perhaps there could be a difference in some other part of the parsing algorithm, which I’m less familiar with, like packing or something. I would still be surprised. But I’ve been surprised before.)

I used the -i option per the suggestion and these were the results:

TIMING FOR yaq_kph (grammar customized with previous matrix version)
read_result (all)               	4.273s	for 29299 events = 145.8µs per event
fgets results                   	3.633s	for 30781 events = 118.0µs per event
record_result                   	0.606s	for 29299 events = 20.7µs per event
read_result: strlen             	2.4ms	for 29970 events = 0.1µs per event
read_result: strstr             	27.6ms	for 29299 events = 0.9µs per event
parse table write               	1.8ms	for 224 events = 8.2µs per event


TIMING FOR yaq_ecc (grammar customized with newest matrix version)
read_result (all)               	32m 19s	for 22152 events = 87.6ms per event
fgets results                   	32m 18s	for 23635 events = 82.0ms per event
record_result                   	1.028s	for 22152 events = 46.4µs per event
read_result: strlen             	7.5ms	for 22823 events = 0.3µs per event
read_result: strstr             	44.4ms	for 22152 events = 2.0µs per event
parse table write               	2.1ms	for 224 events = 9.4µs per event
1 Like

Ok, well that is a profile of ART’s internal processes. It doesn’t look very revealing to me. I was actually trying to suggest getting a profile of ACE instead. You would need to run ACE directly with the -i, and without using ART. It doesn’t need to necessarily be on the full set of inputs; any input or few inputs that you notice a substantial speed difference on would be just as good. Incidentally, is the big slowdown on just a few items with good performance on most others, or is it across the board?

Oh my bad, here’s the output from ACE using -i:

parsing : 1h 59m 24s for 1 events = 1h 59m 24s per event
REPP : 137.0µs for 1 events = 137.0µs per event
token mapping : 56.0µs for 1 events = 56.0µs per event
lexical lookup : 3.7ms for 1 events = 3.7ms per event
orthographemic exploration : 0.9ms for 1 events = 0.9ms per event
lexical parsing : 1.6ms for 1 events = 1.6ms per event
noun-pc1_lrt2-lex : 417.0µs for 8 events = 52.1µs per event
noun-pc5_lrt1-suffix : 124.0µs for 1 events = 124.0µs per event
noun-pc4_lrt5-lex : 109.0µs for 3 events = 36.3µs per event
noun-pc9_lrt1-suffix2 : 98.0µs for 2 events = 49.0µs per event
noun-pc27_lrt2-lex : 23.0µs
lexical filtering : 1.0µs for 1 events = 1.0µs per event
übertagging : 1.0µs for 1 events = 1.0µs per event
chart parsing : 199.2ms for 1 events = 199.2ms per event
passive: vp5-top-coord : 81.6ms for 157 events = 0.5ms per event
packing : 61.9ms for 1088 events = 56.9µs per event
passive: comp-head : 9.5ms for 23 events = 415.0µs per event
lexeme : 6.7ms for 19 events = 351.3µs per event
passive: subj-head : 5.6ms for 135 events = 41.7µs per event
passive: vp5-bottom-coord : 5.3ms for 185 events = 28.8µs per event
active: vp5-top-coord : 5.2ms for 153 events = 33.8µs per event
active: comp-head : 4.4ms for 280 events = 15.8µs per event
passive: decl-head-opt-subj : 4.4ms for 100 events = 43.8µs per event
passive: basic-head-opt-comp : 3.8ms for 8 events = 479.8µs per event
unpacking : 1h 59m 24s for 1 events = 1h 59m 24s per event
mrs extraction : 1h 59m 15s for 20897 events = 342.4ms per event
EPs : 1.032s for 20897 events = 49.4µs per event
HCs : 22.4ms for 83128 events = 0.3µs per event

This was for one particularly problematic sentence. When looking at cpu time stats in the LKB for a full test suite profile it seemed like all the sentences with parse results took longer in the new matrix version, but it’s only noticeable on sentences with many readings.

Ok, now that is much more informative. As you can see, the majority of the parsing took very little time, and basically the whole 2 hours was spent in the “unpacking” phase. This means one of three things that I can think of. The first possibility is that you asked for exhaustive unpacking (didn’t give a limit for how many results to unpack) and there is some serious combinatorics present. It looks like you got 20,000 plus results for one sentence, which is a lot, but maybe not enough to justify 2 hours of parse time. As the profile shows, that’s only unpacking 3 results per second.

The second possibility that comes to mind is that packing is producing a lot of garden paths that prove to be inconsistent upon unpacking. The most likely way this could happen is if something new was added to the packing restrictor list in the ACE configuration file. For instance, if some append-list related feature were put on the restrictor, I expect it could result in unconstrained SLASH values, which would then lead to a chart full of erroneous edges which take a long time to sort through in unpacking. Maybe compare the ACE configuration files between the two grammars if you suspect this could be happening.

The last possibility is that the feature structures are getting really really really big, but only during unpacking. I don’t see any hooks in ACE for printing out those structures during unpacking, so that might be hard to diagnose. If you want to send me the guilty grammar and the input in question I could check that.

After checking, the config.tdl files for both are identical (assuming that’s what is meant by ACE configuration files).

Where should I send the grammar and input?

Thanks for sharing the grammar and inputs. I was able to reproduce the slow behavior and take a quick look at it. Of the situations I outlined above, the first two (excess ambiguity and packing garden paths) are not the culprit. The third (enormous feature structures) is definitely happening and accounts for at least a large portion of the slowdown. I’m not sure whether to blame the slowdown 100% on that or also partly fact that the unifier is somewhat less efficient when it has to impose extra constraints at a GLB (i.e. a situation where x & y is neither x nor y, and adds new constraints not present on either x or y). This situation is relatively rare in most processing (things like ocons use it), but used heavily in append-lists.

The root-level feature structures for the input measured in the profile above are winding up about 6 times larger for the append-list grammar than the previous one. This is due to the fact that the append lists are carrying along their computation history. In a different thread and a few months back, we had an interesting discussion about whether it was safe to discard those histories using the deleted-daughters setting. I’m not sure there was a conclusive answer – discarding leads to faster processing and improved behavior in coordination, if I recall correctly, but also precludes lexical threading. I know the matrix folks decided to at least try going down the no-lexical-threading road, but I don’t know how far that work has gone, or whether it is represented in the grammar at issue?

After a bit more pointless searching in the unpacker AVM manipulation, I noticed what was visible all along in the profile @ecconrad posted: the full 2 hours is not only in the unpacking portion of the game, it’s in the MRS extraction portion of the unpacker. That was really surprising.

It turns out the problem is caused by the MRS extraction code assuming an all-paths traversal of the CONT is relatively cheap. That assumption is apparently spectacularly violated by these history-preserving append lists. Just about 100% of the time seems to be inside of the dgreset_r() function, which is a utility used to clear temporary fields in AVMs without having to increment the generation counter (sorry if that’s technical mumbo jumbo to most of the current audience…).

I’m going to have to do a bit more digging to figure out why that call is there at all. Just now it looks unnecessary. Commenting it out and hoping for the best leads to enumerating the same number (28288) of readings with the diff-list grammar as the append-list grammar, and both pretty quickly, although certainly still showing a penalty for the larger AVMs: 1.7 seconds total runtime and 439MB used for the diff-list grammar and 3.5 seconds total runtime and 2.1GB for the append-list grammar on that one input that took 2 hours before.

2 Likes

In case anyone other than me is interested in how this particular problem scales…

Static analysis of exactly how many paths there are likely to be through an append-list when parsing an n-word sentence is not easy, but a lower bound isn’t too hard, if I’ve done things right. If a feature like RELS appends at each node in the derivation tree and the contents of every leaf node (and C-CONT) is <>, then a strict binary branching tree for n words has more than n^2 paths through the computation history in the RELS append list, and a strict right- or left-branching tree has more than 2^n paths. If the lists being appended are not empty, things get even worse in a hurry.

This is a count of how many ways there are to get to the various nodes in the append list, not a count of how many nodes there are (that number is still big, but much smaller). It’s not relevant for most algorithms, since you can usually do a mark and sweep sort of thing where you don’t revisit the structure under a node once you’ve already visited it once, but it was relevant (possibly for no good reason) to the procedure ACE was using in this mystery. ACE also uses this procedure when printing an AVM out to the terminal (which doesn’t happen during normal operation, but does during debugging or verbose logging), because in some circumstances it’s important to be able to print the current contents of an AVM without changing any of the scratch fields.

There is no longer lexical threading in the Matrix. At DELPH-IN, there was a theoretical (formal) objection against deleting the daughters, however, if I recall correctly. To the effect that it would be against the essence of the formalism?

1 Like

Thank you, @sweaglesw, this is really interesting!

The “middle ground” append-diff-lists I mentioned would be larger than normal diff-lists (because they still carry around their computation history), but smaller than append lists (because they wouldn’t create new list nodes).

The argument about deleted daughters was that it should only be used when it’s purely for efficiency. I was proposing using it in a way that would change what is licensed by a grammar, and I conceded that point. So I think the aim should be for parsing/generating speed to be tolerable without deleting the APPEND feature, with the option to delete it if desired (for efficiency). (And if lexical threading is used, this option isn’t available.) It sounds like it’s tolerable except for the algorithm used for printing AVMs to the terminal?

Could you say a bit more? With the exponential, is this considering ARGS as well as RELS?

1 Like

I was only considering appending the value of a single feature. The way I was counting for a right-branching tree where the append result is empty the whole way up is: first, find the most deeply embedded append-list. This is the feature’s value on the last word in the sentence. It is visible from the root as FEATURE.(APPEND.REST.FIRST.)^{n-1}. From here we can climb back up to the value in the root structure by following either NEW-LIST features or END-LIST features. Since the list is empty, in this case they will both always take us to the same place, though that isn’t really relevant. The following is an expression for all the ways to get down to that innermost nested list and then back up to the top again:

FEATURE.(APPEND.REST.FIRST.)^{n-1}.LIST.(NEW-LIST or END-LIST)^{n-1}

… and that expression represents 2^{n-1} paths, since the choice between NEW-LIST and END-LIST at each step is arbitrary and there are n-1 steps to climb to the top. The second to last word is equally deeply nested, so climbing to the top from its computation history (at FEATURE.(APPEND.REST.FIRST)^{n-2}).APPEND.FIRST) yields another 2^{n-1} paths, for a grand total of 2^n. You can probably get another 2^{n-1} by adding up all the earlier words’ contributions as well.

Correct – for algorithms that only traverse all the nodes once (i.e. just about any sensible algorithm) it is not an issue. There may be ways to work around the need for the all-paths algorithms occasionally employed by ACE, as well. It hasn’t been a pain point in the past.

1 Like

Oh I see! I hadn’t thought about that.

It will be painful to combine this with differentiable feature structures (the “plumes” I presented in Oslo), because those are based on “abstract feature structures” which are indexed by paths rather than nodes. It’s good to have advance warning about combinatorial explosions.

Having looked at the slow code some more, I think the reset was necessary, but the full path method was not. I have now replaced the slow reset in MRS extraction with a fast one and checked that into the ACE SVN repo.

1 Like