Labelbox•August 2, 2023
How to use LLMs to detect and extract personal data from AI datasets
As enterprise AI teams build more powerful models based on customer data, biometrics, and user generated content to automate and enhance their business processes, it’s more vital than ever to remove personal identifying information (PII) from their datasets. Otherwise, their AI systems can result in data leaks that put customers/users and the organization at risk. In the previous post of this blog series, we explored why extracting PII from datasets is important, where PII might come up in datasets, and how AI and data science teams typically identify PII with regular expressions, which can be unreliable when used on unstructured text data.
In this post, we’ll explore how you can easily leverage a large language model (LLM) to identify and extract PII from your datasets with much higher accuracy, efficiency, and speed.
Building LLM prompts for detecting PII
Our first step in using LLMs to detect personal information is to build prompts that will flag the right details in our data, usually called a taxonomy. In this example, these are the details we will be looking to detect in our dataset:
- Name: The complete name of an individual, including first name, middle name, and last name.
- Email Address: The personal email address of an individual.
- Social Security Number (SSN): The nine-digit number that the U.S. government issues to all U.S. citizens and eligible U.S. residents.
- Date of Birth (DOB): The full birth date of an individual, including day, month, and year.
- Residential Address: The full home address of an individual, including house number, street name, city, state, and ZIP code.
- Other PII: This category includes all other pieces of information that can potentially identify an individual.
For this paper I am using GPT-4 to maximize results. However, when thinking about PII use cases teams may want to use an open source model like LLaMa or Bert to solve a similar task. Below is a fine-tuned prompt for GPT-4 that returns several pieces of information regarding PII. The first part of the prompt asks the model to detect whether a specific piece of data contains any PII and to list all of these examples at the end. You can also provide this model with several pieces of information that include PII so the model has context to know what to look for.
The next prompts will let us know if the data contains any names, email addresses, social security numbers, dates, addresses and any other pieces of information. After we use all these prompts, we can then look for any additional notes.
"For each piece of data: "Example of enterprise data from before"
Please identify whether not the above text contains PII
Please identify and extract any instances where a person's name is given
Please identify all email addresses concealed in the following text.
Find any instances of a nine-digit identification number, typically given by the US government to its citizens. It would follow the format ###-##-####.
I need to find any dates that correspond to the day a person was born. They typically follow the format MM/DD/YYYY or DD/YY or other combinations. Can you find them in the text?
Find any locations that could be a person's home address in this text.
Please find any other examples of PII I might have missed
Once you identify this please answer this with a binary classification example and the exact example of the PII example. For the exact example please provide no extra information at this time. For the first question please list the PII examples. Then add any additional notes to the piece of data that may be important."
As we can see these prompts are far less complex than setting up unique regex expressions for PII detection and extraction. These prompts can be used in various LLMs regarding your preferences.
Examples of PII detection with an LLM
Let’s walk through a few examples ranking from easiest to most challenging to see how foundation models can expedite the PII detection process.
"Do we all know Jane? She's our in-house Python whisperer. Born on a chilly New Year's Day in '85, she's been coding magic ever since. Reach out to her at jane.python@enterprise.com or catch her on her cell at (555) 555-1111."
"Have you met our superstar from the sales team yet? Benny, he's the go-to guy for all your queries. Drop him a line at benny.sales@enterprise.com or give him a call at his LA office."
"Hello there! My name is Nicholas and I wanted to share my experience with your software product. As a longtime resident of Chicago, I've been in the technology industry for nearly 20 years. Your software solution is one of the most intuitive I've seen in a long time. It was easy to download, install, and begin using right away. Not to mention the responsiveness of your support team. I had a couple of issues early on, but a quick email to your support (the support agent was nick_young_chi@example.com) was all it took for the team to respond. They even followed up on my direct line to ensure the problems were resolved. Well done! I look forward to seeing what else your team develops."
"Just received my eCamera X and I'm thrilled! The image quality is fantastic, and the user interface is very intuitive. I highly recommend this product to photography enthusiasts and professionals alike."
Looking ahead, it is clear that the frontier of PII detection and extraction will be shaped by LLMs. By transcending the constraints of regex and embracing the sophistication of foundational models, we take a bold leap into a future where data privacy is safeguarded without compromising the immense value that our data holds. The intersection of AI and data governance heralds an exciting era of innovation, enhancing our capacity to manage data in an ethically responsible and legally compliant manner.
Harnessing the full potential of LLMs for PII detection, however, can be a challenge for AI teams who don’t already have infrastructure in place to easily connect these models to their AI development tech stack. With Labelbox’s data engine, you can explore and compare foundation models, use them for data processing, fine-tune them for your requirements, and more — making PII detection and management a streamlined, efficient, and accurate process. Watch the video below for a brief demo of how to use Labelbox for finding and removing PII from text data.
Get started today
As AI models become more powerful and rely on sensitive customer data, protecting privacy is paramount. While traditional methods like regular expressions can be unreliable, advanced techniques using LLMs offer a more accurate and efficient way to identify and remove PII.
If you're interested in leveraging Labelbox's tools for your data detection and extraction tasks, sign up for a free Labelbox account to try it out, or contact us to learn more.