logo
×

LabelboxAugust 2, 2023

How to use LLMs to detect and extract personal data from AI datasets

As enterprise AI teams build more powerful models based on customer data, biometrics, and user generated content to automate and enhance their business processes, it’s more vital than ever to remove personal identifying information (PII) from their datasets. Otherwise, their AI systems can result in data leaks that put customers/users and the organization at risk. In the previous post of this blog series, we explored why extracting PII from datasets is important, where PII might come up in datasets, and how AI and data science teams typically identify PII with regular expressions, which can be unreliable when used on unstructured text data.

In this post, we’ll explore how you can easily leverage a large language model (LLM) to identify and extract PII from your datasets with much higher accuracy, efficiency, and speed.

Building LLM prompts for detecting PII

Our first step in using LLMs to detect personal information is to build prompts that will flag the right details in our data, usually called a taxonomy. In this example, these are the details we will be looking to detect in our dataset:

  • Name: The complete name of an individual, including first name, middle name, and last name.
  • Email Address: The personal email address of an individual.
  • Social Security Number (SSN): The nine-digit number that the U.S. government issues to all U.S. citizens and eligible U.S. residents.
  • Date of Birth (DOB): The full birth date of an individual, including day, month, and year.
  • Residential Address: The full home address of an individual, including house number, street name, city, state, and ZIP code.
  • Other PII: This category includes all other pieces of information that can potentially identify an individual.

For this paper I am using GPT-4 to maximize results. However, when thinking about PII use cases teams may want to use an open source model like LLaMa or Bert to solve a similar task. Below is a fine-tuned prompt for GPT-4 that returns several pieces of information regarding PII. The first part of the prompt asks the model to detect whether a specific piece of data contains any PII and to list all of these examples at the end. You can also provide this model with several pieces of information that include PII so the model has context to know what to look for.

The next prompts will let us know if the data contains any names, email addresses, social security numbers, dates, addresses and any other pieces of information. After we use all these prompts, we can then look for any additional notes.

"For each piece of data: "Example of enterprise data from before"

Please identify whether not the above text contains PII
Please identify and extract any instances where a person's name is given 
Please identify all email addresses concealed in the following text. 
Find any instances of a nine-digit identification number, typically given by the US government to its citizens. It would follow the format ###-##-####. 
I need to find any dates that correspond to the day a person was born. They typically follow the format MM/DD/YYYY or DD/YY or other combinations.  Can you find them in the text? 
Find any locations that could be a person's home address in this text.
Please find any other examples of PII I might have missed

Once you identify this please answer this with a binary classification example and the exact example of the PII example. For the exact example please provide no extra information at this time. For the first question please list the PII examples. Then add any additional notes to the piece of data that may be important." 

As we can see these prompts are far less complex than setting up unique regex expressions for PII detection and extraction. These prompts can be used in various LLMs regarding your preferences.

Examples of PII detection with an LLM

Let’s walk through a few examples ranking from easiest to most challenging to see how foundation models can expedite the PII detection process.

"Do we all know Jane? She's our in-house Python whisperer. Born on a chilly New Year's Day in '85, she's been coding magic ever since. Reach out to her at jane.python@enterprise.com or catch her on her cell at (555) 555-1111."

Question

Answer

PII

Contains PII:

Yes

"Jane", "New Year's Day in '85", "jane.python@enterprise.com", "(555) 555-1111"

Person's Name:

Yes

“Jane”

Email Addresses:

Yes

"jane.python@enterprise.com"

(Social Security Number):

No


Birth Dates:

Yes

"New Year's Day in '85"

Home Address:

No


Other PII:

Yes

Phone Number, "(555) 555-1111"

Other Notes

The phone number is a common placeholder and may not represent actual PII. Context and real-world verification would be needed to determine if it's indeed PII.


"Have you met our superstar from the sales team yet? Benny, he's the go-to guy for all your queries. Drop him a line at benny.sales@enterprise.com or give him a call at his LA office."

Question

Answer

PII

Contains PII:

Yes

"Benny", "benny.sales@enterprise.com", "LA office"

Person's Name:

Yes

"Benny"

Email Addresses:

Yes

"benny.sales@enterprise.com"

(Social Security Number):

No


Birth Dates:

No


Home Address:

No 


Other PII:

Potentially

"LA office"

Other Notes

In this text, "LA office" may not be typically considered as PII since it refers to a business location rather than a personal address. 


"Hello there! My name is Nicholas and I wanted to share my experience with your software product. As a longtime resident of Chicago, I've been in the technology industry for nearly 20 years. Your software solution is one of the most intuitive I've seen in a long time. It was easy to download, install, and begin using right away. Not to mention the responsiveness of your support team. I had a couple of issues early on, but a quick email to your support (the support agent was nick_young_chi@example.com) was all it took for the team to respond. They even followed up on my direct line to ensure the problems were resolved. Well done! I look forward to seeing what else your team develops."

Question

Answer

PII

Contains PII:

Yes

"Nicholas", "Chicago", "nick_young_chi@example.com"

Person's Name:

Yes

"Nicholas"

Email Addresses:

Yes

"nick_young_chi@example.com"

(Social Security Number):

No


Birth Dates:

No


Home Address:

No

No

Other PII:

None

No

Other Notes

Although the text does contain some elements of PII (a first name, a city of residence, and an email address), none of these items on their own or in combination would typically be sufficient to uniquely identify an individual. 


"Just received my eCamera X and I'm thrilled! The image quality is fantastic, and the user interface is very intuitive. I highly recommend this product to photography enthusiasts and professionals alike."

Question

Answer

PII

Contains PII:

No


Person's Name:

No


Email Addresses:

No


(Social Security Number):

No


Birth Dates:

No


Home Address:

No


Other PII:

No


Other Notes

It's a good example of customer feedback that is useful for businesses but does not involve any privacy concerns.


Looking ahead, it is clear that the frontier of PII detection and extraction will be shaped by LLMs. By transcending the constraints of regex and embracing the sophistication of foundational models, we take a bold leap into a future where data privacy is safeguarded without compromising the immense value that our data holds. The intersection of AI and data governance heralds an exciting era of innovation, enhancing our capacity to manage data in an ethically responsible and legally compliant manner.

Harnessing the full potential of LLMs for PII detection, however, can be a challenge for AI teams who don’t already have infrastructure in place to easily connect these models to their AI development tech stack. With Labelbox’s data engine, you can explore and compare foundation models, use them for data processing, fine-tune them for your requirements, and more — making PII detection and management a streamlined, efficient, and accurate process. Watch the video below for a brief demo of how to use Labelbox for finding and removing PII from text data.


Get started today

As AI models become more powerful and rely on sensitive customer data, protecting privacy is paramount. While traditional methods like regular expressions can be unreliable, advanced techniques using LLMs offer a more accurate and efficient way to identify and remove PII.

If you're interested in leveraging Labelbox's tools for your data detection and extraction tasks, sign up for a free Labelbox account to try it out, or contact us to learn more.