Mal rules for adding articles?

I’m curious if there are mal rules around that I could somehow compile into a grammar to insert missing articles like this:

"Open door" should be "open the door"
"eat apple" should be "eat the apple"
"put apple in barrel" should be "put the apple in the barrel"
"pacifier in bed" should be "put the pacifier in the bed"
"get safe" should be  "get the safe"

I’ve found that people really have a hard time typing full sentence queries and commands to a computer (using voice helps some)…

This is an interesting question. I also have some ungrammatical sentences in QA datasets like http://qald.aksw.org and http://lc-quad.sda.tech. Question like [1] below. Note also the lowercase in the proper nouns.

  1. * who was governor of minnesota when ankahee was released?
  2. who was the governor of minnesota when ankahee was released?

Ideas would be welcome. Once I add the determinant [2], Ace/ERG gives me 229 readings. For minnesota, ERG was able to recognize as a named entity (probably it is in the lexicon). But I would not expect ERG to know that Ankahee is the title of a movie: Ankahee (1985 film) - Wikipedia. LKB/ERG does not give me analysis for [2]… I guess the lkb/script in ERG source is not 100% compatible with the ace/config or ACE has some robustness not presented in LKB?

Yes, you can configure the ERG to accommodate missing determiners, but be careful what you wish for! More ambiguity is not always welcome.
Below are the instructions on how to enable the mal-rule for missing determiners of singular count nominals as in “put apple in barrel”. But be warned that adding this rule will cause the parser to give you surprising and unwelcome behavior for all but the shortest sentences, due to the amazing amount of additional ambiguity this relaxed constraint introduces. Feel free to try it, and maybe for short commands, it could work well, but you should test the revised grammar with a good variety of expected inputs.
To enable the rule:

  1. In the file erg/constructions.tdl, move the comment-out characters #| from above the definition for the rule hdn_bnp-rbst_c to the line immediately following the end of the defintion, i.e. after the line “[ RNAME bnpr ].”
  2. Save the file and recompile with ACE.
    You will now get a good parse for e.g. “We admire cat.”
    Note that the parse-ranking model has not been trained with any exposure to this rule (or any other mal-rules), so it will not do any reasonable thing in trying to rank parses using this rule.

As for your “governor” example, @arademaker, two things:
a. Your starred example (1) is in fact well-formed, since office-naming nouns like “governor” or “president” can be used predicatively as in your example without a determiner. The ERG gets this right for “president”, “director”, and several others, but sadly not yet for “governor” in the 2020 release.
(2) You’re right that the LKB does not yet employ the token-mapping machinery used by ACE to deal with unknown words, so it ignores the token “ankahee” which is not in the pre-defined lexicon for the ERG. I think there is some progress on enhancing the LKB for token mapping.

1 Like

Thanks @Dan, I was worried about that. I’ll give it a try.

Ironically, one solution I have been trying is to use GPT3 to correct human’s poor grammar so that a computer can understand it with pretty good success.

If I submit the the following template to GPT3’s davinci model using OpenAI.com, it has given me overwhelmingly good answers in my admittedly limited testing so far:

"Open door" should be "open the door"
"eat apple" should be "eat the apple"
"restart" should be "restart"
"help" should be "help"
"put apple in barrel" should be "put the apple in the barrel"
"pacifier in bed" should be "put the pacifier in the bed"
"get safe" should be  "get the safe"
"where is my grand children's house" should be "where is my grandchildren's house"
"go home" should be "go home"
"put boot table" should be "put the boot on the table"
"where the diaper bag" should be "where is the diaper bag"
"<user input>" should be:
<GPT3 puts output here>

For example:

put book floor -> put the book on the floor
take ball -> take the ball
look chest -> look in the chest

and interestingly:
who was governor of minnesota when ankahee was released? -> 
     who was the governor of Minnesota when Ankahee was released?

etc.

My prompt above is really just the first set of failures I found in my test suite so I imagine I can tune it to work better, but it worked pretty amazingly with almost zero work, when it works.

I do occasionally get the expected completely bizarre ones which are easy to filter out so far.

I’ll see how it works “in the field” but the smoke test was pretty impressive…

Yes, as @Dan says, there is progress on enhancing the LKB for token mapping. I’ve finished the implementation and it will be in the next release of LKB-FOS – around a week from now. The new version of LKB-FOS/ERG with token mapping (and lexical filtering) successfully parses “Who was the governor of Minnesota when Ankahee was released?”

1 Like

So for (1), my bad for not being a native English speaker. So, I just need to know how to add the missing lexical entry or properly change the lexical type of governor, right?

Can you share the step-by-step to use the GPT-3? I didn’t understand your description above. Your link is not working.

@arademaker, my mistake, it should have been openai.com (fixed above). I’ve made much more progress on testing it last time and can give you what I’ve got so far.

Easiest way to get started is to set up an account on openai.com and use the “Playground”. It is trivial (5 lines of code?) to get it working in your language of choice once it is doing nearly what you want, they have good docs for that.

My scenario is getting GPT3 to transform what I call the “computereze” language that people throw at my game (mostly just missing articles and using verb/object syntax like “take lamp”) into valid English that the ERG will parse well.

So far, I’ve used what OpenAI calls “text completion” to do this, which basically means literally writing down instructions, followed by examples, and then giving it what you want to be transformed and hoping it will follow the pattern.

I have to post filter what it gives me to detect when it goes off the rails. So far it’s just two rules:

  • if it has more than one line it is bogus
  • if it isn’t surrounded by quotes it is bogus

And it gives very few bogus answers with what I’m using now. Note that you also should run its free “content” filter over its results in case it goes really crazy and sends something you will regret back. The filter will flag rasist, sexist, etc etc stuff. Mostly. No guarantees. It is an experiment for me but it does feel scary to use this in production…

To try out what I’ve done, go to the playground, and use the following settings:

Model: text-davinci-002 (the richest model, I've had mixed success so far on others)
Temperature: 0 (we want no creativity or risks, just a consistent answer)

Leave the rest of the settings at their defaults

Below is the exact text I have been using for my purposes and has worked very well testing against over 1200 phrases (both ones that shouldn’t be corrected and ones that should). You literally paste every single line of it into the playground window, and then fix the last line to be the text you want corrected. So turn the last line from:

"<the text you want to test>" should be

into (for example):

"who was governor of minnesota when ankahee was released?" should be

And hit submit. The playground will “complete” the phrase with the correction.

As always, this stuff is an art, and I’ve noticed that my model no longer adds the “the” into your text, even though it works well for my purposes. I am not even really a beginner using these systems, but here’s what I did to fix the model when I hit a case like this: Add the phrase that didn’t work into the training set and try again with more data from my “treebank”. I kept doing that until I started getting consistent results that I wanted.

Here is my raw completion text (and this shows what it took to get it working like I wanted (so far)). All the phrases where the original and correction are basically the same are places where it screwed up its suggestion and I had to add the phrase in to get it right:

Turn short phrases into full English sentences but don't remove any important words. For example:
"Open door" should be "open the door"
"eat apple" should be "eat the apple"
"put apple in barrel" should be "put the apple in the barrel"
"pacifier in bed" should be "put the pacifier in the bed"
"get safe" should be "get the safe"
"give buttercup" should be "give the buttercup"
"drop backpack" should be "drop the backpack"
"put boot table" should be "put the boot on the table"
"where the diaper bag" should be "where is the diaper bag"
"frog green?" should be "The frog is green?"
"a diamond is blue" should be "A diamond is blue"
"the pen is in the diamond cave" should be "The pen is in the diamond cave"
"there is a pen" should be "There is a pen"
"a bottom is on the slug" should be "A bottom is on the slug"
"describe the rocks" should be "Describe the rocks"
"there is blue paint" should be "There is blue paint"
"blue paint is on the table" should be "Blue paint is on the table"
"a roof is wet" should be "a roof is wet"
"go home" should be "go home"
"restart" should be "restart"
"help" should be "help"
"is a book in the entrance?" should be "Is a book in the entrance?"
"put the diamond in Plage" should be "put the diamond in Plage"
"get the rock on the floor" should be "get the rock on the floor"
"put the crystal on the table where the safe is" should be "put the crystal on the table where the safe is"
"where is the diamond at?" should be "Where is the diamond at?"
"are you still in a cave?" should be "Are you still in a cave?"
"get a hand" should be "get a hand"
"read page 1" should be "read page 1"
"read page 2" should be "read page 2"
"turn page 1" should be "Turn page 1"
"look around" should be "look around"
"paint is on the table" should be "paint is on the table"
"go to a cave" should be "go to a cave"
"is a rock in the cave?" should be "Is a rock in the cave?"
"is a girl in the doorway?" should be "Is a girl in the doorway?"
"what is the keyhole on?" should be "what is the keyhole on?"
"get Plage." should be "get Plage"
"go through the safe" should be "go through the safe"
"leave cave" should be "leave the cave"
"there is a front on a safe" should be "there is a front on a safe"
"drop a rock" should be "drop a rock"
"go into the 1st cave" should be "go into the 1st cave"
"where is a living room" should be "where is a living room"
"where is my grand children's house" should be "where is my grandchildren's house"
"<the text you want to test>" should be

Good luck!

As for the missing lexical entry for “governor” in the 2020 ERG, you could add the following to the file lexicon.tdl, so you can get e.g. “… when Cyrenius was governor of Syria.”:

governor_prd_n1 := n_pp_c-prd-of_le &
[ ORTH < “governor” >,
SYNSEM [ LKEYS.KEYREL.PRED “_governor_n_of_rel”,
PHON.ONSET con ] ].

Update on using OpenAI as a “malrule fixer”: My instructions above use what OpenAI calls “few shot learning” to train the model, where you put examples right in the prompt every time. This costs more (because you get charged basically by the character) and isn’t the best training approach but it is good for initial testing. Worked great for my case.

I’ve now completed testing using what they call their “Fine Tuning” approach where you upload training data, they train the model on their servers and then you only send the text you really want to use as a prompt. Less text is cheaper, and the training approach gives better (or the same) results.

I also tried my treebank on a fine-tuned “Ada” model (as opposed to Davinci) which is a much smaller and much cheaper model and performs just as well with the data I’ve used so far. So, less text sent + cheaper base model + as good results = much cheaper to use overall (and supposedly faster but I haven’t measured that).

It did take a bit to get the training data right so here is a sample of what I used. It is the exact same data as the dataset above, but in the fine-tuning format:

  • You have to have a clear string to end the prompt and the completion, that’s why it has the \n\n###\n\n and END tokens in the strings.
  • I also found that if I didn’t surround the prompt and completion phrases with quotes it was much harder to detect if GPT3 was going off the rails. With quotes, I can check if the response has quotes around it and ignore if not
{"prompt": "\"examine me\"\n\n###\n\n", "completion":" \"examine me\" END"}
{"prompt": "\"examine backpack\"\n\n###\n\n", "completion":" \"examine the backpack\" END"}
{"prompt": "\"examine gate\"\n\n###\n\n", "completion":" \"examine the gate\" END"}
{"prompt": "\"examine fence\"\n\n###\n\n", "completion":" \"examine the fence\" END"}
{"prompt": "\"examine house\"\n\n###\n\n", "completion":" \"examine the house\" END"}
{"prompt": "\"open gate\"\n\n###\n\n", "completion":" \"open the gate\" END"}
{"prompt": "\"close gate\"\n\n###\n\n", "completion":" \"close the gate\" END"}
{"prompt": "\"examine bell\"\n\n###\n\n", "completion":" \"examine the bell\" END"}
{"prompt": "\"examine door\"\n\n###\n\n", "completion":" \"examine the door\" END"}
{"prompt": "\"examine intercom\"\n\n###\n\n", "completion":" \"examine the intercom\" END"}

...