Labelbox•July 26, 2023
How to detect and extract personal information from datasets for AI
Data breaches are the stuff of nightmares for most organizations. A database framework issue at Equifax resulted in the leakage of personal and financial data of nearly 150 million people in 2017, leaving those effected open to identity theft, fraud, and other crimes. It ultimately cost the company more than $400 million as part of a global settlement with the Federal Trade Commision and other government agencies. When social media giant Meta had a data leak in 2021 that involved over 500 million user accounts, along with their names, phone numbers, emails, locations, and more, the company was charged hefty fines and experienced significant losses in public trust. While the origins of these breaches were different, the impact was similar — and catastrophic.
With the AI revolution well underway, ensuring the security of personal data is a key concern for organizations taking advantage of AI solutions in their business practices. In this article, we’ll cover how organizations building and using AI can take steps to identify and remove personal information from their AI data and prevent them from causing costly and disastrous data breaches.
What is PII and where can we find it in our datasets?
Personal Identifiable Information (PII), as defined by the U.S. Department of Labor, is any information that could potentially identify an individual through direct or indirect means. This might include, but is not limited to, full names, street addresses, personal and professional emails, telephone numbers, and birth dates.
Data repositories filled with customer data, biometric data, third-party data, and user-generated content (UGC) serve as the lifeblood for enterprise AI solutions, but they are fraught with the risk of PII exposure. Customer data is a significant reservoir of PII. With each transaction, a customer shares a wealth of personal information, including their name, contact details, and payment information. Similarly, customer support interactions, online or offline, contribute to the accumulation of PII. Individuals engaging with an enterprise's digital platforms also often leave behind digital footprints in the form of IP addresses, geolocation data, and more, which all fall under the category of PII.
Biometric data adds another dimension to the PII spectrum. Advances in technology have led to a rise in the use of biometrics for identification and authentication purposes. Fingerprints, facial recognition, and voice patterns are just a few examples of biometric data that carry PII and demand stringent protection protocols. Third-party data and user-generated content (UGC) also carry a high probability of containing PII.
Harnessing these types of data for AI can help enterprises realize significant benefits, but those who want to build secure AI systems based on this data will need to have:
- A thorough understanding of the data they are dealing with
- The ability to discern whether the data used for training the models include PII
- The tools to ensure that this information can be utilized in a legally compliant and ethical manner
PII is everywhere in an enterprise's data ecosystem. Organizations must cultivate a robust data governance strategy that can effectively identify, manage, and safeguard PII across diverse data sources. Doing so not only ensures compliance with data protection regulations but also fosters trust with customers, a crucial component of a successful business in today's data-driven world.
Examples of PII
Below is a small dataset of potential PII data an enterprise could collect and use for training their AI models. To ensure security, this data was generated using GPT-4 to ensure that all examples of PII would be fictitious.
How to detect PII in your data
Enterprises have long worked with customer data and other information that include PII, and they typically employ regular expressions, or regex, to detect PII among their datasets. Regex primarily relies on predefined patterns to search for PII — for example, an organization can identify NSFW language in the comments section of their articles easily via regex, because it's looking for specific words or phrases.
When it comes to searching through the complexities and subtleties of human language in unstructured text for personal identifiers, however, regex is less reliable. It lacks the capability to recognize context, synonyms, and the diverse ways in which PII can be presented in a text. This often leads to both false negatives, where PII is missed, and false positives, where non-PII data is incorrectly marked as PII.
In contrast, large language models (LLMs) and foundation models emerge as powerful tools that transcend these constraints, paving the way for enhanced PII detection and extraction. These models, trained on extensive datasets, can untangle intricate language patterns, understand context, and handle ambiguities, thereby offering a more nuanced approach to PII detection.
Watch the short video below to learn how to quickly and easily use Model Foundry to leverage GPT-4 to find instances of PII in a text dataset.
Below are a few ways in which LLMs outperform regex in this domain:
- Contextual understanding: Unlike regex, LLMs possess an intrinsic understanding of language context. They can discern when a certain piece of information, such as a date or a number, is being used as PII or just a common data point. This substantially reduces the occurrence of false positives and negatives.
- Learning from data: Foundation models learn from a vast corpus of data, absorbing the diverse ways in which PII can appear in texts. This makes them adept at identifying unconventional or subtle presentations of PII that regex might miss.
- Adaptability: LLMs are designed to learn and adapt. As new forms of PII emerge or existing ones evolve, these models can be retrained to incorporate these changes, making them a more flexible and future-proof solution for PII detection.
- Efficiency: LLMs can process large volumes of data more swiftly and accurately compared to regex. They can efficiently handle the expansive and varied data landscapes within enterprises, from customer data and biometric data to user-generated content and third-party data.
- Less maintenance: Regex requires constant updates to keep up with the evolving nature of PII. LLMs, once trained, can handle a wide array of PII forms with minimal updates, reducing the maintenance load.
In essence, the sophisticated capabilities of LLMs and foundation models mark a paradigm shift in PII detection. They offer a more precise, context-aware, and efficient solution that can keep pace with the dynamism of PII and the ever-growing data repositories of enterprises. By adopting these models, businesses can not only ensure rigorous PII management but also unleash the full potential of their data in a secure, privacy-compliant manner.
Stay tuned for a follow-up blog post featuring a technical exploration of how enterprise AI teams can quickly and easily leverage LLMs to detect and extract personal information from their datasets with high accuracy.