LabelboxJuly 26, 2023

How to detect and extract personal information from datasets for AI

Data breaches are the stuff of nightmares for most organizations. A database framework issue at Equifax resulted in the leakage of personal and financial data of nearly 150 million people in 2017, leaving those effected open to identity theft, fraud, and other crimes. It ultimately cost the company more than $400 million as part of a global settlement with the Federal Trade Commision and other government agencies. When social media giant Meta had a data leak in 2021 that involved over 500 million user accounts, along with their names, phone numbers, emails, locations, and more, the company was charged hefty fines and experienced significant losses in public trust. While the origins of these breaches were different, the impact was similar — and catastrophic.

With the AI revolution well underway, ensuring the security of personal data is a key concern for organizations taking advantage of AI solutions in their business practices. In this article, we’ll cover how organizations building and using AI can take steps to identify and remove personal information from their AI data and prevent them from causing costly and disastrous data breaches.

What is PII and where can we find it in our datasets?

Personal Identifiable Information (PII), as defined by the U.S. Department of Labor, is any information that could potentially identify an individual through direct or indirect means. This might include, but is not limited to, full names, street addresses, personal and professional emails, telephone numbers, and birth dates.

Data repositories filled with customer data, biometric data, third-party data, and user-generated content (UGC) serve as the lifeblood for enterprise AI solutions, but they are fraught with the risk of PII exposure. Customer data is a significant reservoir of PII. With each transaction, a customer shares a wealth of personal information, including their name, contact details, and payment information. Similarly, customer support interactions, online or offline, contribute to the accumulation of PII. Individuals engaging with an enterprise's digital platforms also often leave behind digital footprints in the form of IP addresses, geolocation data, and more, which all fall under the category of PII.

Biometric data adds another dimension to the PII spectrum. Advances in technology have led to a rise in the use of biometrics for identification and authentication purposes. Fingerprints, facial recognition, and voice patterns are just a few examples of biometric data that carry PII and demand stringent protection protocols. Third-party data and user-generated content (UGC) also carry a high probability of containing PII.

Harnessing these types of data for AI can help enterprises realize significant benefits, but those who want to build secure AI systems based on this data will need to have:

  • A thorough understanding of the data they are dealing with
  • The ability to discern whether the data used for training the models include PII
  • The tools to ensure that this information can be utilized in a legally compliant and ethical manner

PII is everywhere in an enterprise's data ecosystem. Organizations must cultivate a robust data governance strategy that can effectively identify, manage, and safeguard PII across diverse data sources. Doing so not only ensures compliance with data protection regulations but also fosters trust with customers, a crucial component of a successful business in today's data-driven world.

Examples of PII

Below is a small dataset of potential PII data an enterprise could collect and use for training their AI models. To ensure security, this data was generated using GPT-4 to ensure that all examples of PII would be fictitious.

"Sarah, welcome to Premium Health Plus! Your premium subscription is all set and active. Reach out to us at your convenience at sarahplus@email.com if you have any questions or need further assistance."

"What's up, folks? Robbie from Raleigh here. Bought this rad eCamera X recently and it's top-notch. Crystal clear pictures every time. Feel free to hit me up at robbies@email.com if you need some user tips!"

"Guess who just got her hands on the latest summer collection dress? This New Yorker! If you want to see how I styled it, follow me on Twitter @FashionistaMichelle."

"Have you met our superstar from the sales team yet? Benny, he's the go-to guy for all your queries. Drop him a line at benny.sales@enterprise.com or give him a call at his LA office."

"Do we all know Jane? She's our in-house Python whisperer. Born on a chilly New Year's Day in '85, she's been coding magic ever since. Reach out to her at jane.python@enterprise.com or catch her on her cell at (555) 555-1111."

"Hello there! My name is Nicholas and I wanted to share my experience with your software product. As a longtime resident of Chicago, I've been in the technology industry for nearly 20 years. Your software solution is one of the most intuitive I've seen in a long time. It was easy to download, install, and begin using right away. Not to mention the responsiveness of your support team. I had a couple of issues early on, but a quick email to your support (the support agent was nick_young_chi@example.com) was all it took for the team to respond. They even followed up on my direct line to ensure the problems were resolved. Well done! I look forward to seeing what else your team develops."

"Just received my eCamera X and I'm thrilled! The image quality is fantastic, and the user interface is very intuitive. I highly recommend this product to photography enthusiasts and professionals alike."

"Exciting news - the project is proceeding ahead of schedule! All team members have been delivering exceptional work and collaborating effectively. Thanks to everyone for their hard work and dedication. Looking forward to the upcoming project milestones."

How to detect PII in your data

Enterprises have long worked with customer data and other information that include PII, and they typically employ regular expressions, or regex, to detect PII among their datasets. Regex primarily relies on predefined patterns to search for PII — for example, an organization can identify NSFW language in the comments section of their articles easily via regex, because it's looking for specific words or phrases.

When it comes to searching through the complexities and subtleties of human language in unstructured text for personal identifiers, however, regex is less reliable. It lacks the capability to recognize context, synonyms, and the diverse ways in which PII can be presented in a text. This often leads to both false negatives, where PII is missed, and false positives, where non-PII data is incorrectly marked as PII.

In contrast, large language models (LLMs) and foundation models emerge as powerful tools that transcend these constraints, paving the way for enhanced PII detection and extraction. These models, trained on extensive datasets, can untangle intricate language patterns, understand context, and handle ambiguities, thereby offering a more nuanced approach to PII detection.

Watch the short video below to learn how to quickly and easily use Model Foundry to leverage GPT-4 to find instances of PII in a text dataset.

Below are a few ways in which LLMs outperform regex in this domain:

  • Contextual understanding: Unlike regex, LLMs possess an intrinsic understanding of language context. They can discern when a certain piece of information, such as a date or a number, is being used as PII or just a common data point. This substantially reduces the occurrence of false positives and negatives.
  • Learning from data: Foundation models learn from a vast corpus of data, absorbing the diverse ways in which PII can appear in texts. This makes them adept at identifying unconventional or subtle presentations of PII that regex might miss.
  • Adaptability: LLMs are designed to learn and adapt. As new forms of PII emerge or existing ones evolve, these models can be retrained to incorporate these changes, making them a more flexible and future-proof solution for PII detection.
  • Efficiency: LLMs can process large volumes of data more swiftly and accurately compared to regex. They can efficiently handle the expansive and varied data landscapes within enterprises, from customer data and biometric data to user-generated content and third-party data.
  • Less maintenance: Regex requires constant updates to keep up with the evolving nature of PII. LLMs, once trained, can handle a wide array of PII forms with minimal updates, reducing the maintenance load.

In essence, the sophisticated capabilities of LLMs and foundation models mark a paradigm shift in PII detection. They offer a more precise, context-aware, and efficient solution that can keep pace with the dynamism of PII and the ever-growing data repositories of enterprises. By adopting these models, businesses can not only ensure rigorous PII management but also unleash the full potential of their data in a secure, privacy-compliant manner.

Stay tuned for a follow-up blog post featuring a technical exploration of how enterprise AI teams can quickly and easily leverage LLMs to detect and extract personal information from their datasets with high accuracy.