LabelboxFebruary 23, 2022

How to create high-quality training data

In recent years, the machine learning (ML) community at large has shifted its focus from building models to the quality of data used to train the model. Improving the quality of training data, however, can mean something different for every use case, model, and even iteration cycle.

Some teams will need to focus on enhancing their data annotation pipeline with quality management methods, updated ontologies, and better collaboration. Others may need to look more closely at the performance of their model in training to better understand its specific needs. And some may need to simply work on expanding their training dataset to include certain classes and edge cases. In this post, we’ll look at all three areas of improving the quality of training data.

Improving annotations

For many ML teams, the biggest hurdle to building a high-performance model is creating quality labeled training data. It can be a challenge to ensure that each asset is labeled accurately and labeled in a way that will best “teach” the model what it needs to do.

To tackle these challenges, advanced ML teams have developed workflows that improve both labeling quality and the ontology as they better understand the use case and model requirements. While these workflows vary based on the problem at hand, they typically have these goals in common.

An iterative approach to producing training data

Many ML teams label their data in large chunks, or even all at once. This “waterfall” approach to AI data labeling makes it difficult to ensure accuracy, as data annotation requirements often evolve as an ML project progresses.

Instead, leading ML teams label their data in smaller batches. They give more supervision and feedback to labelers at the beginning of the project to ensure they understand the task at hand, build a strategy to tackle any edge cases that arise, and adjust the ontology as both labelers and other stakeholders develop a better understanding of the project.

An iterative approach that prioritizes two-way collaboration between the labelers and the ML team enables labelers to create higher quality training data over time.

Ways to correct and improve quality of labeled training data

Many advanced ML teams implement quality management workflows into their data annotation pipelines to evaluate and improve their training data. This can include benchmarking, which compares labeled assets to ground truth labels, consensus, which compares multiple labelers’ work on the same asset to ensure that everyone is performing the task the same way, and review queues, which enable other labelers or domain experts to review, approve, or leave feedback on labelers’ annotations.

Any of these workflows can go a long way to improving the quality of training data. Teams can also use metadata to track how a labeled asset has improved over time to assess the efficacy of these workflows.

Expert input

For some use cases in specialized fields such as healthcare or agriculture, a model needs to be trained on tasks that were traditionally done only by those with years of experience and training. Developing quality training data for these use cases may require these experts to do the labeling, which can be a lengthy and expensive process due to other demands on their time.

Genentech, a large biotech company, experienced this challenge first hand when training ML models to accurately read medical imagery such as radiology scans. Putting together a labeling team consisting entirely of trained radiologists and medical experts for medical image annotation was an almost impossible task, so the team came up with an alternative approach.

Their medical experts trained ordinary labelers on the task and reviewed their work in an iterative process. This workflow required less time from domain experts while still ensuring that the labeling team was producing high-quality training data. Learn more about Genentech’s labeling workflow.

Understanding your model

Developing high-quality training data often requires a more nuanced understanding of the model after every iteration. While ML teams may look at aggregate summary metrics to track model improvements and trends, many teams don’t drill further down into the data. As a result, they miss biases and performance variances across classes that may produce less favorable results unless the model is given data that rectifies them in the next iteration cycle.

Once a training cycle is complete, teams should look carefully at how the model performs with every class to identify potential edge cases, biases, and other issues. Teams should then choose the next training dataset according to their findings.

For example, when building a computer vision model that uses satellite imagery to find and track ships, an ML team may find that the model performs well with a certain class when the image comes from one satellite, but performance is lower when presented with a similar task on an image from a different satellite. If the team were to only look at aggregated data, this issue might slip their notice entirely, but drilling into the variance of performance within this class will help them identify it.

They can then include more images sourced from the second satellite in their next set of training data to improve performance and quality. Watch this on-demand webinar, How to diagnose and improve model performance, to learn more about this workflow.

Building a better dataset

Datasets compiled according to performance after each iteration will go a long way in helping your model improve in performance, but this workflow only applies to teams who have an abundant, diverse dataset to start.

Enterprise ML teams often have to work with limited, proprietary data. Teams may also have niche or highly specialized use cases requiring data that cannot be found among large, publicly available datasets. To address these challenges, advanced ML teams typically explore new ways to round out their labeled training datasets.

Synthetic data

ML teams are increasingly turning to synthetic data to meet their needs. Teams can build a model to generate this data or turn to a vendor for synthetic data. This approach has the added benefit of protecting IP or sensitive data during the model training process as the dataset won’t contain any real information, which can be important if a team’s labeling pipeline includes external teams or less secure infrastructure. Synthetic data can also simulate edge cases and conditions that aren’t represented in real data, helping teams fill in the gaps in their dataset.

However, producing high-quality synthetic training data can be a challenge. Machine-generated data can be unreliable, misrepresent the real world, and perpetuate biases that your team needs to reduce or eliminate with your model. While synthetic value can certainly provide value for specialized use cases, it’s important to weigh the advantages against potential pitfalls and come up with additional solutions to mitigate them.

Expanding data types and sources

Tackling a lack of data can require some creative problem solving for data scientists and ML engineers. Depending on the use case at hand, your team may be able to fill some of the gaps in data by casting a wider net. A computer vision use case, for example, may be augmented with tabular or text data.

You may also be able to use weak supervision methods to sort through a dataset and add metadata or other structure to unstructured data to find patterns within a dataset that contains multiple types of data. Watch our on-demand webinar for a technical demonstration on how you can use multiple data types.

Final thoughts on creating high-quality training data

Whether you are enhancing your labeling pipeline, trying to gain a more nuanced understanding of your model as it trains, growing your dataset, or committing to a combination of these improvements, Labelbox can help. Watch our short demo to learn how our training data platform can transform your labeling operations.