Labelbox•July 26, 2023
Data breaches are the stuff of nightmares for most organizations. A database framework issue at Equifax resulted in the leakage of personal and financial data of nearly 150 million people in 2017, leaving those effected open to identity theft, fraud, and other crimes. It ultimately cost the company more than $400 million as part of a global settlement with the Federal Trade Commision and other government agencies. When social media giant Meta had a data leak in 2021 that involved over 500 million user accounts, along with their names, phone numbers, emails, locations, and more, the company was charged hefty fines and experienced significant losses in public trust. While the origins of these breaches were different, the impact was similar — and catastrophic.
With the AI revolution well underway, ensuring the security of personal data is a key concern for organizations taking advantage of AI solutions in their business practices. In this article, we’ll cover how organizations building and using AI can take steps to identify and remove personal information from their AI data and prevent them from causing costly and disastrous data breaches.
Personal Identifiable Information (PII), as defined by the U.S. Department of Labor, is any information that could potentially identify an individual through direct or indirect means. This might include, but is not limited to, full names, street addresses, personal and professional emails, telephone numbers, and birth dates.
Data repositories filled with customer data, biometric data, third-party data, and user-generated content (UGC) serve as the lifeblood for enterprise AI solutions, but they are fraught with the risk of PII exposure. Customer data is a significant reservoir of PII. With each transaction, a customer shares a wealth of personal information, including their name, contact details, and payment information. Similarly, customer support interactions, online or offline, contribute to the accumulation of PII. Individuals engaging with an enterprise's digital platforms also often leave behind digital footprints in the form of IP addresses, geolocation data, and more, which all fall under the category of PII.
Biometric data adds another dimension to the PII spectrum. Advances in technology have led to a rise in the use of biometrics for identification and authentication purposes. Fingerprints, facial recognition, and voice patterns are just a few examples of biometric data that carry PII and demand stringent protection protocols. Third-party data and user-generated content (UGC) also carry a high probability of containing PII.
Harnessing these types of data for AI can help enterprises realize significant benefits, but those who want to build secure AI systems based on this data will need to have:
PII is everywhere in an enterprise's data ecosystem. Organizations must cultivate a robust data governance strategy that can effectively identify, manage, and safeguard PII across diverse data sources. Doing so not only ensures compliance with data protection regulations but also fosters trust with customers, a crucial component of a successful business in today's data-driven world.
Below is a small dataset of potential PII data an enterprise could collect and use for training their AI models. To ensure security, this data was generated using GPT-4 to ensure that all examples of PII would be fictitious.
Enterprises have long worked with customer data and other information that include PII, and they typically employ regular expressions, or regex, to detect PII among their datasets. Regex primarily relies on predefined patterns to search for PII — for example, an organization can identify NSFW language in the comments section of their articles easily via regex, because it's looking for specific words or phrases.
When it comes to searching through the complexities and subtleties of human language in unstructured text for personal identifiers, however, regex is less reliable. It lacks the capability to recognize context, synonyms, and the diverse ways in which PII can be presented in a text. This often leads to both false negatives, where PII is missed, and false positives, where non-PII data is incorrectly marked as PII.
In contrast, large language models (LLMs) and foundation models emerge as powerful tools that transcend these constraints, paving the way for enhanced PII detection and extraction. These models, trained on extensive datasets, can untangle intricate language patterns, understand context, and handle ambiguities, thereby offering a more nuanced approach to PII detection.
Watch the short video below to learn how to quickly and easily use Model Foundry to leverage GPT-4 to find instances of PII in a text dataset.
Below are a few ways in which LLMs outperform regex in this domain:
In essence, the sophisticated capabilities of LLMs and foundation models mark a paradigm shift in PII detection. They offer a more precise, context-aware, and efficient solution that can keep pace with the dynamism of PII and the ever-growing data repositories of enterprises. By adopting these models, businesses can not only ensure rigorous PII management but also unleash the full potential of their data in a secure, privacy-compliant manner.
Stay tuned for a follow-up blog post featuring a technical exploration of how enterprise AI teams can quickly and easily leverage LLMs to detect and extract personal information from their datasets with high accuracy.