Why using a broad range regex for PII may cause issues with any intent identification (Bot not responding after enabling PII)

swagata.sengupta · January 25, 2022, 11:29am

Problem
I am setting a broad range PII pattern (like “\w” ). Then no utterance is being identified by the bot.

Explanation
Let us understand what a PII is - It is Personally Identifiable Information. Example - emails, SSN., etc. There is a difference between sensitive information and personally identifiable information. A password is sensitive but it does not identify a user on its own.
In a conversational world, a PII can be used as input to the bot at any given time. Users may enter an email even when the bot is not asking for it (not at an email entity prompt). Should the bot not identify it as a PII then? So, this is how the Kore.ai platform deals with PII - When PII is enabled, all inputs are ‘first’ sent to the PII engine for validation against defined PII. No matter the user is at the right entity prompt or not, the PII engine would identify the input if there is any PII pattern that matches and will return an obfuscated/redacted form to the NLP engine.

If you now use a very broad pattern like \w to match any entity you may have configured for the PII, all utterances will be matched against it. When any of the PII regex is so broad that it can capture hello Hi and other utterances, it masks everything and sends it to the platform for further processing. And the user may perceive the platform to be broken.

So, we recommend using a very specific regex pattern for any entity if you MUST use PII.