Overview
Step 1: Connect your data to Labelbox with a few lines of code
Step 2: Leverage foundation models to instantly enhance your data
Step 3: Use powerful search capabilities to quickly find data
Step 4: Automatically classify data with foundation models and use human-in-the-loop QA for challenging cases
Step 5: Set it and forget it – automatically apply these rules to fresh, incoming data
Conclusion

Automatically label text with 96%+ accuracy using foundation models

Overview

In this guide, you’ll learn how to exponentially increase your labeling speed and efficiency by leveraging Labelbox and foundation models. We will be walking through a text classification task: identifying news articles that talk about sports.

Here's a high-level summary of the process:

Connect your data to Labelbox with just a few lines of code.
Labelbox leverages foundation models to automatically enrich your data.
Use the powerful search capabilities of Labelbox to quickly find articles with similar traits and classify them in one click. With the help of foundation models, you can instantaneously label large amounts of data. Pro tip: Combine a variety of search techniques, such as a similarity search, natural language search, keyword search, and investigate clusters of similar data, to boost your results.
While foundation models are a helpful starting point, they might not always correctly classify data, especially on challenging or rare data points. In this case, utilize human-in-the-loop labeling and QA by pre-labeling data using foundation models and sending it for your internal or external labeling team to review.
Automatically apply these rules to all new incoming data by creating a slice in Labelbox.

Now, let’s take a look at how we can do the above in Labelbox. As a sneak peek into the process, by leveraging foundation models, we managed to classify 88% of our news articles in minutes with a 96.5% accuracy rate. An additional 15% of our news articles were successfully pre-labeled using foundation models, with 85% accuracy, and sent for human review. This left us with only 493 data points that were missed by foundation models – a massive efficiency gain for any labeling team.

Step 1: Connect your data to Labelbox with a few lines of code

Since this is a classification task, our goal is to correctly have the model identify news articles about sports. We will be using the following Hugging Face dataset: ag_news which contains 120,000 articles, including 30,000 about sports, for our analysis.

To begin, let's connect this data to Labelbox. Simply retrieve the dataset from Hugging Face and integrate it with Labelbox in just a few lines of code.

from datasets import load_dataset
dataset = load_dataset("ag_news",split='train')

# iterate over the data
payload_imgs = []
counter = 0


# iterate over the data
payloads = []
global_keys = []
counter = 0

for data in dataset:

  text = data['text']
  label = data['label']
  global_key = "ag_news_" + str(counter)
  global_keys.append(global_key)

  # create payload for texts
  payloads.append({
    "row_data": text, 
    "global_key": global_key,
  })

  counter += 1


# create dataset in Labelbox
lb_dataset = client.create_dataset(name="ag_news") 

# add data in Labelbox
task = lb_dataset.create_data_rows(payloads)  
task.wait_till_done() # async
print(task.errors) # check errors

Step 2: Leverage foundation models to instantly enhance your data

Labelbox will automatically compute and store MPNet embeddings for your data. We are using this implementation available through Hugging Face.

Once your data has been uploaded, watch as Labelbox enriches your data with foundation model embeddings. These embeddings are powerful in that they can be harnessed to automatically label, or pre-label, your data.

If you don’t want to use the default embeddings by Labelbox, you can also upload custom embeddings from any other foundation model, with up to 100 custom embeddings for each data point.

Whether you’re using the default or custom embeddings, embeddings are helpful in curating and finding subsets of data that share similar characteristics. For instance, embeddings power Labelbox’s similarity search, natural language search, and 2D projector view. You can search and explore all of your data with tools that help you powerfully surface specific subsets of similar texts.

Step 3: Use powerful search capabilities to quickly find data

With powerful search capabilities in Labelbox, you can easily find and classify data that share similar characteristics. This is a special case of zero-shot and few-shot learning: the challenge is to find all examples of sports articles, based on zero (or a few) examples. With the help of foundation models, and minimal human signal, you can quickly label a lot of data in just a few clicks. The following are tools in Labelbox that help provide labeling signal to make it easy to automatically classify your data:

Zero-shot Labeling: Projector View for Classification

Labelbox allows you to visualize data clusters in 2D. For this example, we can see distinct clusters. By inspecting a few examples, we discover that some of the data clusters correspond to sports news articles. We manually select each cluster and tag it with "UMAP: sports. We intentionally leave out data points situated between clusters, as these represent challenging data points. This is expected since each labeling function won’t be perfect in isolation, and some data points are difficult and challenging.

We then repeat the process with t-SNE instead of UMAP and tag each sports cluster with "t-SNE: sports".

UMAP: a cluster of news articles about sports

t-SNE: a subcluster of news articles about basketball

Zero-shot Labeling: Keyword search

Labelbox enables you to search all data points that contain some keywords. We filter all data points that contain the following keywords: sport, sports, basketball, baseball, soccer, football, tennis, hockey. And tag these 5,990 texts as “Keyword search: sports”.

Zero-shot Labeling: Natural language search for classification

Labelbox enables you to conduct natural language searches on text. For example you can type in “news articles about sports” to surface all pieces of text about sports. Adjusting the similarity threshold will narrow the search to only relevant articles. For this use case, we filter for a similarity score higher than 0.85 and tag all of the 6,468 texts as “Natural language search: sports”.

A natural language search will surface thousands of news articles about sports. By adjusting the similarity score, we can keep the most confident zero-shot predictions.

Few-shot Labeling: Similarity search for classification

Labelbox also streamlines few-shot labeling. Quickly browse all your data in Catalog to surface 5 news articles about sports. For each of them, run a similarity search and tag the top results (e.g with a similarity score higher than 0.85) as “Similarity search: sports”. This provides us with 5 new labeling functions that surface sports news articles.

A similarity search example with an anchor article about college basketball. We can filter to keep the most confident few-shot predictions.

Combining different sources of signal: weak labeling

While each of these labeling signals is powerful on its own, you can combine multiple sources in Labelbox. This allows you to apply simple rules in a weak supervision fashion to further enhance your results. Integrate different labeling signals, such as similarity searches, natural language searches, and data clusters, to boost your outcomes. You can combine various filters by using the AND and OR functions.

Step 4: Automatically classify data with foundation models and use human-in-the-loop QA for challenging cases

High confidence data points: direct classification

Foundation models are highly confident about most data points. So much so that we can directly classify data points leveraging Labelbox’s bulk classification feature. With this new feature, you can specify and send your texts to a specific step of the labeling and review workflow. We can directly move these high-confidence data points straight to the “Done” task.

We classify thousands of texts in bulk, and send them to the “Done” task of our labeling project, in just a click, since foundation models are confident on those.

In practice, these high-confident data points are those that belong to:

The sports clusters, both with UMAP and t-SNE. But just how accurate are these predictions? To answer this question, we looked at the Hugging Face ground truths. 704 out of the 21,560 sports predictions are incorrect.
Or, the similarity search score to two or more anchors is higher than 0.85. This results in 3,548 sports classifications, all of which are accurate except 185.
The sports cluster in UMAP or t-SNE and a natural language search higher than 0.85. This results in 1,219 sports classifications, all of which are accurate except 58.

This method of surfacing high-confident data points enables us to directly classify 26,427 pieces of text - with only 947 errors - achieving an accuracy of 96.5%. Since 26,427 out of 30,000 sports articles have been classified directly by foundation models, the coverage is 88%.

What about the 947 errors? Upon closer inspection of these errors, it turns out that they are all related to sports, but in the context of News, World, or Science, and hence have been labeled on Hugging Face according to those categories instead of Sports.

Foundation models failed on 947 news articles. It turns out that these articles are all related to sports, but are classified on HuggingFace as World news, or Business news, or Science & Tech news.

Now, let’s move on to classify the remaining 12% of data rows, on which foundation models are less confident.

Low confidence data points: Human-in-the-Loop labeling

For some pieces of text, foundation models exhibit low confidence. We can bulk classify these data points in Labelbox, but move them to the “To Review” task. This will ensure a human is looped in and will review the classifications coming from foundation models.

We pre-label thousands of data points in bulk, and send them to “Review” in our labeling project, in just a click, since foundation models are moderately confident on these pieces of text.

In practice, these data points are those that belong to the sports cluster, with UMAP or t-SNE, and that haven’t been classified yet.

Using this approach, we managed to classify 4,723 additional data rows, with an accuracy of 85% (696 errors). We can send these low-confident data rows for Human-in-the-Loop review.

Results

	Direct classification with foundation models	Human in the loop with foundation models	Sports articles missed by foundation models
# of data rows classified	26,427	4,723	493
# of errors	947	696	-
Accuracy	96.5%	85%	-
Fraction of sports articles	88%	15%	1.8%

With powerful search capabilities and the bulk classification feature, we managed to classify 26,427 pieces of text (88%) in minutes, with 96.5% accuracy thanks to foundation models. An additional 4,723 data points (13.5%) have been pre-labeled with foundation models, with 85% accuracy, and sent for human review. This leaves only 493 data points talking about sports, missed by foundation models.

Step 5: Set it and forget it – automatically apply these rules to fresh, incoming data

With Labelbox slices, we automatically classify fresh, incoming news articles about sports.

For example, we can set up a slice that automatically surfaces all new pieces of text, that have been connected to Labelbox in the past week, that haven’t been classified yet. We can set the slice's criteria to include only text data rows where the natural language search for the prompt is "news articles about sports" and is higher than 0.85 (since we know that these data rows are very likely to be on sports).

With slices, you can easily surface and inspect any new, high-impact data that gets added to your data lake.

From there, it only takes one click to classify all of these text data rows as sports articles.

Conclusion

With powerful search capabilities and the bulk classification feature, we managed to classify 26,427 pieces of text (88%) in minutes, with 96.5% accuracy thanks to foundation models. An additional 4,723 data points (15%) have been pre-labeled with foundation models, with 85% accuracy, and sent for human review. This leaves only 493 data points talking about sports, missed by foundation models.

If you’re a current Labelbox user who wants to leverage any foundation model to supercharge your data labeling process in just a few clicks, try our bulk classification feature today or get started with a free Labelbox account.
If you’re interested in seeing how quickly you can label images leveraging foundation models, check out our guide on how to automatically label images with 99% accuracy.

Continue reading

Programmatically launch human data jobs for RLHF and evaluation

Learn how to harness the SDK to manage human data labeling jobs for RLHF and model evaluation. With just a few steps, you can set up the SDK, import various types of data, and launch, monitor, and export labeling projects programmatically, all while ensuring data quality and scalability.

Evaluating leading text-to-speech models

Discover how to employ a more comprehensive approach to evaluating leading text-to-speech models using both human preference ratings and automated evaluation techniques.

Metrics-based RAG Development with Labelbox

Learn how to optimize your Retrieval-Augmented Generation (RAG) applications by focusing on key metrics like context recall and precision.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads