Sanitize Composite Entry Before Processing

Hi,

I have a Composite Entity that has a Sub-Entity that is of Address Type (I named it @RegAddressEntity) and a Sub-Entity of List type that has a list of street names (@StreetNameEntity)

When entering a address, if it looks like this “1 Stage Court Unit 502” It is working as expected, and since Stage is in the list of @StreetNameEntity it sets that Sub Entity to “Stage” however, if the input is followed by a comma Ex: “1 Stage Court, Unit 502” it goes to @RegAddressEntity and skips the first one in the pattern which is @StreetNameEntity (the comma is added automatically when using Voice/Smart Assist to enter the values)

Any help?

In the NLP Properties, the

I am slightly unclear on your composite definition, could you show the list of composite patterns please?

The comma is helpful in some situations, Address being one example, because it helps to partition complex phrasing into different parts. Street names are an open ended class and so one the things the Address entity is looking for is a number followed by something followed by a street type (“Court”) followed by a comma. That helps to give it confidence that this is a street address.

Without the comma it is difficult to know where to stop, because confusingly street types can be used as street names.

Hi @andy.heydon

I understand why there is a comma there but what ended up happening is that it was not matching the right patter in the composite entity

After finding this post (from you, thanks) I finally figured it out

I changed the pattern to count for an optional comma and it now works

This is the pattern now:
@StreetNumberEntity @StreetNameEntity {, court, road, street, court road street boulevard boulevard, drive drive, way way,} @JustANumberEntity

1 Like

Moishe, I’m glad you are getting closer, but I have a few more suggestions.

In the tokenization of a sentence, a comma is always split into its own word, so the variations in that pattern that have a comma at the end are not necessary.

You wouldn’t necessarily know this, but there is actually an internal concept (used primarily by the address parser) that contains a list of all the street types. It is named ~addr_streettype_en. If that is overkill (and it is large at 552 members) then alternatively use your own concept instead of a list of words like you have. Way easier to maintain and reuse.

I would also split the distinct comma and the street type as two distinct optional elements. In general you can think of everything in a {} group (or []) as synonyms, as replacements for each other. Comma is not an alternative for street. A user input could be either of them or both (comma after street).

Currently if both a comma and a street type are present in the input then you would be relying on the pattern variations that we create to add wildcards between pattern tokens because only one of them would be matched by the group.

And finally, if you do have two optional tokens, I would use the special marker between them to block the addition of wildcards. So:
@StreetNumberEntity @StreetNameEntity {~addr_streettype_en} *0 {,} @JustANumberEntity

PS. I am slightly surprised that you are not capturing the street type as a distinct entity. Seems unusual that the street name all by itself would be unique.