How to create a quality strategy for your data labeling project

The outcome of a labeling project is often a reflection of how it began. A project that was put together in a rushed or careless manner often leads to non-optimal or even unusable training data. High-quality training data is the ultimate goal of all labeling projects, so spending time to develop a quality strategy at the beginning of your labeling project is crucial.

While carving out time at the beginning of your labeling project might take additional time, it will ultimately save you time and effort on reviews and corrections downstream.

Quality strategy refers to how you ensure and maintain that your labeling project is producing high quality training data. Your quality strategy might include the use of automated quality monitoring features, manual review and feedback, insight gleaned from the monitoring of quality SLA, and above all regular two-way communication with your labeling.

Labelbox's quality features

Insight into labeling quality is crucial in not only understanding labeling progress, but also in maximizing labeling efficiency to better manage labeling time and cost. The Labelbox Annotation platform has several features that can assist with quality monitoring.

Benchmark

Benchmark is a Labelbox QA tool that enables you to designate a labeled asset as a "gold standard" and automatically compares all labels on that asset to the benchmark label.

You can create this label yourself or choose a "perfect" label done by your labeling team and mark it as a benchmark. The asset is then distributed to all labelers on the project and labelers will receive a benchmark score depending on how close their label is to the benchmark label.

Benchmark scores range from 0% (no match between the benchmark label and the labeler's annotations) to 100% (complete match between the benchmark and the labeler's annotations).

You can learn more about how to set up benchmark labels and how benchmark scores are calculated in our documentation.

How does benchmark support quality?

Benchmark measures labeling accuracy. You can use benchmark on tasks that have one objectively correct answer and measure labeler performance against a "gold standard".

It is helpful to determine a target benchmark score (per labeler or per label). The target percentage may depend on task complexity and quality expectations. A 90% benchmark score might be difficult to achieve on one task, but fall below quality expectations on another task. Therefore, you should think about and communicate your target benchmark score expectations to your labelers and reviewers.

Benchmark scores should be monitored throughout the labeling project. Monitoring at the labeler level can help identify labelers who may struggle more with the task than others, resulting in lower benchmark scores, and you can retrain them or remove them from the project accordingly. Tracking the average benchmark scores across the project can help your team identify general challenges with the task. For example, a drop in scores once a new batch of data has been added may point to a new pattern in the data that needs to be addressed with updated labeling instructions.

Consensus

The Consensus feature generates an agreement rate by comparing annotations provided by all labelers working on the exact same image.

When adding a new batch of data to your labeling project, you have the ability to enable or disable consensus for that batch, configure and set batch-specific priority, coverage, and the number of labels.

How does consensus support quality?

Consensus measures labeling consistency. Use consensus on tasks that are subjective where you want to go with a majority vote. You can also use it to identify data labeled with low consistency that needs to be checked by an expert.

As with benchmarks, there is not a universal "good" or "bad" consensus score percentage. A “good” consensus score percentage will depend on the complexity of the task along with your quality expectations. When used for truly subjective tasks, consensus scores will not indicate the quality of a label, but rather the level of consistency between your labelers' subjective opinions.

Consensus scores should be monitored throughout the life of a labeling project. Monitoring consensus scores at the labeler-level can help identify a labeler that might consistently disagree with other labelers, signaling that the labeler might not have a strong understanding of the task. Average consensus scores across the project can help measure overall consistency. Low scores on an objective task may point towards gaps in labeler training or can allude to ambiguous wording in labeling instructions, leaving room for different interpretations by your labelers.

Manually review data with custom workflows

While manual review and providing feedback can be somewhat time intensive, it is an important component in establishing quality in the early stages of your labeling project and should be a part of your quality strategy.

To ensure quality, labeling should begin on small batches with manual reviews. Once quality is established, manual reviews can taper off to only apply to a percentage of the data and you can start to rely more on manual reviews on the labeling team's side and review tools like benchmark or consensus.

In Labelbox Annotate, you can set up customized review steps based on your decided quality strategy in your project's Workflow tab. As you work with large, complex projects, having to review all labeled data rows becomes increasingly time-consuming and expensive. You can leverage workflows to create a highly-customizable, step-by-step review pipeline to drive efficiency and automation into your review process.

On the Labelbox Annotation platform, you can set up customized review steps as decided in your quality strategy via the Workflows feature. Upon creating a new project, the default setting includes one review step ("Initial review task") on 100% of labels.

You can create a "new review task" and can customize your review process with up to 10 review tasks. For example, you can add a second review step on a sample of data performed by your internal review team after the labeling team has already performed their review. This allows for greater QA before data rows are transitioned to "Done" and are ready to be used for production. You further customize your review process by routing labels that include a specific annotation type to a separate review step (e.g for all data rows with this annotation type to be performed by an expert reviewer).

For a more in-depth walkthrough and demonstration of workflows, please refer to our guide on how to customize your annotation review process.

How does manual review support quality?

Manual review can go hand-in-hand with sharing constructive feedback with individual labelers or the entire labeling team.

Rather than simply approving or rejecting a label, you can take the time to share written feedback on why a label was rejected. This can be extremely helpful at the beginning of a new task as your labeling team gets started on your project. The labeling team can then use this feedback to better understand your expectations, fix mistakes, and avoid similar mistakes moving forward.

Labeling data is an inherently collaborative process that requires continuous feedback between labelers and reviewers to ensure high-quality outcomes. You can make use of issues & comments on the Labelbox platform to quickly pinpoint and share written feedback directly on the label. This feature also supports two-way communication, allowing labelers to respond or ask clarifying questions.

SLA monitoring

Implementing a quality SLA for your project is optional. However, once the decision for an SLA is made, it requires consistent monitoring to be effective.

A labeling project with an SLA in place, such as a minimum labeling quality threshold to be reached, requires regular monitoring and review to determine whether the SLA has been met on a batch of data.

A labeling project with an SLA in place, i.e. a minimum labeling quality threshold to be reached, requires regular reviews from your side to determine whether the SLA has been met on a given batch of data.

How does SLA monitoring support quality?

While you could rely solely on the SLA agreement with your labeling partner to meet the required quality threshold, a rigorous review process provides an opportunity to further improve quality. During your review of each batch of data, you should consider the following:

Is one labeler making significantly more mistakes that others? If so, consider retraining or replacing that labeler.
Is there one error category that shows up more frequently than other errors? If so, consider fleshing out the labeling instructions for this error or put together a small training project and retrain labelers on this error.
Are there changes in the data from batch to batch that lead to poor quality due to lack of experience with this new/different data? If so, consider updating your instructions with new examples found in this new batch of data. Walk the labeling team through these new changes and let them know to ask questions if they are unsure of how to label this new data.
Are you finding errors that do not fit into one of the error categories established as part of the quality SLA? If so, consider adding a new error category for the next batch and highlight this change to your labeling team so that labelers and reviewers are aware of what to look out for.

Create communication channels

Direct communication is of utmost importance before and during the labeling and feedback process. Communication channels should be carefully chosen, set up, and communicated clearly to all involved parties ahead of starting a project.

Some common communication channels are listed below:

Chat programs (e.g. Slack)
Email
Shared documents (e.g. Google docs or slides)
Regular scheduled meetings (in person or remotely, ideally with screen sharing capabilities)
Communication features on the labeling platform (e.g. Updates and Issues on the Labelbox platform)

Once communication methods have been chosen, it is worthwhile to establish further guidelines around communication such as: availability, response times, definition of recipients/stakeholders for different topics, structure of messaging (e.g different chat groups or channels, multiple email threads versus one big thread), accessibility of different file formats shared (e.g Google docs versus MS doc). Initially, these details might not seem super important, however clear definitions of communication can aid with communication efficiency and affect the final outcome and quality of your project.

You can rely on a combination of communication channels, such as weekly Q&A or feedback meetings coupled with a group chat to quickly unblock progress between weekly meetings. Communication channels are useful in providing day-to-day status updates from the labeling team or aid in the announcement of new projects or data volumes by the project owner.

Your chosen method of communication should foster direct communication between the labeling team and a subject-matter expert on the project owner's side. Two-way communication is encouraged at all times during the labeling process as it will help streamline and accelerate the review/rework and feedback process.

To make communication efficient and direct, consider applying the following principles:

As they aren't as familiar with the data themselves, meet all questions by the labeling team with patience and understanding.
Monitor and respond to your chosen communication channels regularly, this will help unblock the labeling team in a timely manner.
Define how questions and feedback should be shared. For example, are screenshots or direct links to labels on the platform the most helpful?
Be sure to communicate the availability of all parties involved (time zone work hours, holidays, etc.).
Be mindful of any cultural differences. Labelers may be initially hesitant to ask you questions.

Communication is always a two-way street – encourage labelers to reach out to subject-matter experts with any questions or concerns and be sure to respond to questions and share feedback throughout the duration of the labeling task.

How does effective communication support quality?

Imagine receiving a set of tasks, with ambiguous instructions, and having no way of following up to ask clarifying questions. It wouldn't be a huge surprise if you misinterpreted the instructions based on your assumptions and ended up getting every task wrong. In this case, it would be incredibly useful to receive specific feedback that could be used to correct mistakes and understand how to correctly carry out the task.

This is where direct two-way communication can greatly help the labeling team gain assurance and learn more about your expectations. Understanding task instructions and correcting any labeling errors early on in the project's lifecycle can greatly help with labeling quality. Identifying any errors and misconceptions at the project's onset allows labelers to avoid making those same mistakes on future data rows.

While outlining your quality strategy is essential to ensuring that your labeling task is producing high quality training data, you'll also want to make sure that you're able to scale up your labeling operations without sacrificing quality.

A key question for many ML teams is how to maintain consistency and quality as team size or data volume grows. To learn more about how to scale your labeling operations while maintaining quality, please refer to the next guide in this series: how to scale up your labeling operations while maintaining quality.

Continue reading

Programmatically launch human data jobs for RLHF and evaluation

Learn how to harness the SDK to manage human data labeling jobs for RLHF and model evaluation. With just a few steps, you can set up the SDK, import various types of data, and launch, monitor, and export labeling projects programmatically, all while ensuring data quality and scalability.

How to harness AI for efficient video labeling

Learn how models like Gemini 1.5 Pro, Grounding DINO, and SAM can be used to pre-label videos.

How to turn LangSmith logs into conversational data with Labelbox

Use Labelbox's human & AI evaluation capabilities to turn LangSmith chatbot and conversational agent logs into data.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free