Labelbox•August 22, 2023
When the Dialpad team couldn’t meet the quality standards with their data labeling service as, for example, they struggled to scale their AI projects. After months of slow feedback loops with few performance gains, the team decided to implement a labeling operations solution that prioritized labeling quality and speed by providing observability features via a performance dashboard.
For many AI teams, getting high-quality training data is a challenge. Improving labeling quality is often integral to scaling up AI development. Labeling quality refers to the accuracy, consistency, and reliability of the annotations produced by human labelers or automated labeling models. It quantifies the extent to which the annotations align with the desired ground truth and meet the requirements of the specific labeling task. The quality of labeled data is considered the single most important factor in predicting the performance of the model.
In this blog post, we’ll explore how labeling quality can be measured and improved with the right observability practices for labeling operations.
To ensure the best business outcome, labeling operations need to strike the right balance between agility and quality, since both are essential to cutting costs while building a robust machine learning model. From an operational perspective for labeling operations, this optimization translates to the measurements:
In the following sections, we will define each measurement, show how labeling operations teams can visualize them, and explore what actions teams can take based on certain conditions.
Throughput, defined as the rate at which items pass through a process, is the foundational metric to consider in large scale labeling operations. The following need to be considered when measuring throughput:
The metrics for All (meaning all datarows in the project) and Done (meaning all labeled and approved datarows in the project) would both typically increase at a steady rate to keep the model iterations going. If there are cases where there is a mismatch between the rate of data rows entering and getting marked as done in the project, it could point to a potential gap in expectations of speed vs quality between the machine learning and the labeling operations teams. Some possible next steps are:
Efficiency, in the context of labeling operations, reflects on the turnaround time of quality labels. Efficiency includes considers the following measurements:
With these metrics, you can take the following actions:
There are two levels to measuring labeling quality that are considered best practices in labeling operations: human review and agreement.
Quality measured by human review is typically the rate at which labels are accepted or rejected when audited by human reviewers. Human reviews, often done by subject matter experts, can be expensive. It is thus important to ensure that your team reviews only when needed. Setting up a customizable review workflow can help balance quality requirements with review expenses.
Agreement based quality looks at agreement between labels. Teams can look at consensus scores (inter-annotator agreement between labels created on the same asset) and benchmark scores (agreement between labels created by annotators and a gold standard label).
With quality metrics, if the agreement rate is below your expected SLA (for example, 90%), you should dig deeper to understand error patterns with specific examples.
If you’re using Labelbox, you can easily filter your data to root out these error patterns. Once you have identified these error patterns, you can communicate feedback to your annotation teams and/or perform a similarity search to proactively identify potential issues in bulk to improve labeling quality.
Once you have set these probes up, they should be monitored continuously as new data gets added and new quality issues get surfaced. Labelbox’s performance dashboard empowers AI teams with near-real time performance metrics, including throughput, efficiency, and quality metrics at the aggregate level, as well as the ability to explore this the data using filters. This observability enables teams to not only ensure that all their labels are high quality, but also to accelerate their labeling process, cutting their overall annotation and review costs. Try it today for free.