Matching entity value against a large list of synonyms

raja.balasubramanian · June 12, 2023, 5:32pm

We have a service that provides a large list of synonyms as service response. An entity node is using this list of values from service response (list of enumerated items). The response is fed as input and the entity is identified from the utterance.

Example utterance - Get sales amount for EMEA.

EMEA is the region and the entity (REGION) has so many synonyms to match with. Approximately 1000 synonyms. Example - EMEA from the utterance is matched with service output synonyms like Europe, EMEA, Easter Europe, European Union etc.

The NLP engine matches the entity value from the list of synonyms. The only issue we observe is the time taken to match the entity value against synonyms. This impacts the overall performance of the bot. The bot takes almost 12 seconds to respond back due to large amount of synonyms data.

Is there any other way to reduce the time taken in matching the synonyms in the entity node?

StacyJPelletier · June 13, 2023, 8:06am

I don’t have the answer for you but am following to hear the response from someone who does. I am curious if the entity node recognizes Concepts in the entity synonyms? If so that may make it easier/faster, but I am not 100% certain.

andy.heydon · June 13, 2023, 8:40am

Yes, you can use concepts in an LoV entity’s synonyms, and that will be faster because there is less work to do.

The LoV entity has to perform a certain amount of pre-processing on its synonyms to get them into a format that is easy to match with. That is partly because of the different styles of synonyms and partly to set up the scoring mechanism.

For a static enumerated list that happens during the publish action and the results of the pre-processing are cached. The run-time matching to user’s utterance is then very quick.

But for dynamic lists, that pre-processing step must be performed at the time the entity becomes active, hence for very large lists with many synonyms that can take some time.

So some approaches to improve performance:

Use static lists if possible
Use concepts in the synonyms to reduce the number of actual words/synonyms per choice. Obviously that might need some changes to the service response or some post processing of the JSON
If you are just using the entity to get the word that represents a region from the user’s utterance then another approach can be to define all of the possible words in a concept and use a Custom entity instead (the “regex” expression in a Custom entity can be the name of a ~concept). You could have a series of optional custom entities, one for each region, or use a set of them in a composite.

Now I would be interested in knowing more about why the regions are dynamic. How much variation is there? How frequently does the list change?

StacyJPelletier · June 14, 2023, 5:47am

Thank you Andy! This is very helpful. One follow up question for you; in the entity synonyms, in the past I had to use double quotes to contain all the words if there is >1 word so it would match as desired. Can you tell me if there is a difference in processing an entity synonym of say Respectful Workplace Policy, when using double quotes in the entity synonym vs. not using double quotes (for list of enumerated items type)? I am not quite certain if/when double quotes are needed and under what circumstances. Thank you!

raja.balasubramanian · June 14, 2023, 6:40am

Thanks Stacy. Will try concepts. Making the synonym list static (JSON) gave a better result.

raja.balasubramanian · June 14, 2023, 8:31am

Thanks Andy. I have removed the dynamic list and made the list static. This gave a better performance.

REGION is just one of the entity. Our entity list has business units and metrics that often changes. So we wanted to use the dynamic listings instead of a static. With static we will end up updating the list every month and requires a release.

andy.heydon · June 14, 2023, 8:35am

Great, I am glad that you saw a performance improvement.

The notion of being able to cache the pre-processing for dynamic lists keeps coming up. There are many situations where it would not be appropriate, e.g. list of recent transactions, but there are other cases like yours where the data is sourced from a service but is not user specific and doesn’t change that often, so perhaps there can be some way for us to cache the data for a while. Cache invalidation of course being a big issue.

andy.heydon · June 14, 2023, 9:41am

I presume by “entity synonyms” you mean the synonyms in a List of Items entity. Every entity has a name, of course, and there is a field for entering synonyms for that name that is referenced by the Amend Entity processing, and that is strictly referred to as “entity synonyms”.

But assuming you mean a list of items synonyms then the difference between a double quoted set of words and unquoted is whether all of the words have to be matched as a single phrase or whether partial matches are allowed.

A sequence of words enclosed in quotes means that exact phrase must be entered by the user to match. It is therefore restrictive and limited and the probability of not matching rises for longer phrases. I generally recommend to not use quoted phrases unless absolutely necessary, and where every word is truly significant. Enclosing single words in quotes is redundant, the quotes do not provide any additional benefit in that situation.

When the synonym is not quoted then the matching process will allow for only a subset of the words to be present. The more words that are present will generate a higher score and therefore will be taken above other partial matches that match fewer or less significant words. Typically this approach is the most useful and offers the best chance to match something.

Language and how users communicate is infinitely flexible, therefore we should not try and be too strict in what we will understand. Users should not be penalized for not knowing the secret key.

karan.hinduja · June 28, 2023, 8:32pm

In continuation we have following queries:-

Kindly suggest what is the threashhold limit or count for entity type (List of Items)(enumerate & lookup)
As mention in the link the static approach is faster as the data is pre cache

Can we have this static data loaded from an external file or any other source which can be pre cache,
because we have an requirement to load the list whcih will refresh weekly or monthly.

andy.heydon · June 29, 2023, 4:39pm

There are no limits per se for a List of Items entity.

It would not be possible to load data from an external source for a List of Items entity. The data needs to be in an internal format, and it would not be prudent to trust that external data - part of the synonym processing is to clean it up and remove elements that could compromise the system through injection type attacks.