Labelbox•September 12, 2024

Inside the data factory: How Labelbox produces the highest quality data at scale

In pursuit of AGI and beyond, data quality is not just a checkbox for frontier AI labs—it's the cornerstone of innovation and a critical competitive advantage. The quality of training data determines the success or failure of these cutting-edge models.

We passionately believe that during this decade, data quality will be the most important factor in advancing model capabilities. Very little is publicly discussed when it comes to measuring data quality and producing high-quality human data efficiently, so we wanted to shed light on our approach and share insights into the strategies and practices we use to achieve these standards.

In this post, we look deep inside the Labelbox AI data factory, revealing important tools, techniques and processes that are the bedrock for producing the highest-grade data at scale. We cover just some of the best practices we follow for measuring and managing data quality. We hope this post sparks some valuable insights, while keeping in mind that we utilize many more advanced strategies to operate the entirety of our modern AI data factory.

Measuring quality

Precision and accuracy are two foundational pillars for measuring data quality. Think of these as the dynamic duo for data quality measurement. Let's break them down in a way that's easy to understand:

Precision: Hitting a target consistently

Imagine you're playing darts. Precision is like throwing multiple darts and having them all cluster tightly together on the board. It doesn't matter if they're in the bullseye or not – what matters is that they're close to each other. In the world of AI data, precision means getting consistent results when collecting human opinions or preferences. It's about reliability and repeatability.

For example, if multiple people rate the same AI-generated text, high precision would mean their ratings are very similar to each other. This consistency is crucial because it shows that your data collection process is reliable and produces strong, clear signals.

Accuracy: Hitting the right target

Now, let's go back to our dart game. Accuracy is about hitting the bullseye – or whatever your specific target is. In AI data quality, accuracy means how close your collected data is to the "truth" or the desired outcome.

However, here's where it gets tricky with generative AI: often, there isn't a single clear-cut "right answer." In such cases, the model creator is the ultimate judge in deciding the right answer aligned to their view of the world. That's why we focus on how quickly we can adjust and improve our accuracy over time. It's like learning to aim better with each round of darts you play.

Why Both Matter

In the world of AI data quality, we need both precision and accuracy.

High precision ensures that our data is consistent and reliable. When generating human data for subjective tasks (human preferences, evals, RLHF), the human data factory must produce data with high precision. Every RLHF or evaluation task starts with instructions. Labeling instructions captures a clear point of view from the model creator’s perspective. Instructions also imply that there is a rating criteria and therefore generated data should indicate high precision. Low precision is most likely caused by poor instructions including low coverage of edgecase examples, or poor training, onboarding and execution of labeling projects in production.

Given this context, precision demonstrates a data factory's capability to produce data that has strong signal and consistency.

Good accuracy (or the ability to improve accuracy quickly) ensures that we're collecting our preferred data to train and evaluate our AI models in the way that we believe is most effective and correct. It is also essential to measure the rate of change of accuracy after a few rounds of calibration (feedback).

By measuring and improving both precision and accuracy, we can create a solid foundation for high-quality data – the fuel that powers better, more reliable AI systems. An ideal data factory is highly precise and can quickly calibrate to any desired accuracy level.

In the next two sections, we look at specific ways to measure the precision and accuracy of your data.

Precision metrics

Precision metrics focus on consistency and agreement among labelers, which are primarily derived from Labelbox's built-in consensus capability.Labelbox uses and tests the effectiveness of over 15 similar metrics for a wide range of supported annotations. Here we share some of our preferred metrics currently.

Inter-rater agreement (IRA)

Inter-rater agreement measures how much consensus there is between different raters (labelers) who are assessing the same data. While there are several methods to calculate IRA, such as Cohen's Kappa and Fleiss' Kappa, we'll focus on Krippendorff's Alpha due to its versatility and robustness in AI data labeling contexts.

Krippendorf’s alpha

Krippendorff's Alpha is a popular metric used to assess the agreement among raters because it works well for two or more raters, can handle missing data, and supports nominal, ordinal, and ranking data types. Its values range from -1 to 1, with the following interpretations.

Interpretation

Krippendorff's Alpha	Level of Agreement	Interpretation
Alpha = 1	Perfect agreement	All raters provided the exact same ratings for each item evaluated.
Alpha ≥ 0.80	Satisfactory agreement	Reliable ratings, acceptable for drawing conclusions based on the rated data.
Alpha between 0.67 and 0.79	Moderate agreement	Outcomes should be interpreted with caution, as the agreement is not strong enough for definitive conclusions.
Alpha < 0.67	Poor agreement	Data are often deemed unreliable for drawing conclusions, suggesting raters are not applying the coding scheme consistently or the scheme itself may be flawed or data is skewed.
Alpha = 0	No agreement	No agreement among raters beyond what would be expected by chance, similar to a random rating pattern.
Alpha < 0	Systematic disagreement	Raters are systematically inclined in opposite directions.

Standard deviation of ratings

Standard deviation measures the dispersion of a set of ratings from their mean (average) value. In the context of AI data quality, it quantifies how much variation or spread exists in the ratings given by different AI trainers for the same item or task.

Interpretation

Lower values indicate higher precision (more clustered ratings).
Higher values suggest more disagreement or variability among raters.
The scale of interpretation depends on the rating scale used.
It's sensitive to outliers, which might skew the interpretation in small sample sizes.

Percent agreement

Percent agreement is a straightforward measure of inter-rater reliability that calculates the proportion of times different raters agree in their judgments. This is particularly useful in classification tasks (enums).

Interpretation

Ranges from 0% to 100%, with higher percentages indicating better agreement.
Generally, values above 75-80% are considered good, but this threshold can vary based on task complexity.

Accuracy metrics

Accuracy metrics assess how close the labelers' responses are to the ground truth. These are primarily derived from Labelbox's benchmark feature.

For preference ranking, selection or side by side evaluation tasks, you often do not have an initial ground truth available. It must be created. One of the best ways to create it is using consensus to pick a winner and then verify with highly trusted humans. Again having high precision is paramount in creating ground truth and ultimately measuring accuracy.

Accuracy score

What it is: The proportion of correct responses compared to the ground truth.
How it's calculated: (Number of correct responses / Total number of responses) * 100
Interpretation: Ranges from 0% to 100%, with higher percentages indicating better accuracy.
Example: If a labeler correctly classifies 90 out of 100 benchmark tasks, their accuracy score would be 90%.

Mean absolute error (MAE)

What it is: The average absolute difference between predicted values and actual values.
How it's calculated: Sum of absolute differences between predictions and actual values, divided by the number of predictions.
Interpretation: Lower values indicate better accuracy. The scale depends on the range of the values being predicted.
Example: If labelers are rating the quality of AI-generated text on a scale of 1-10, MAE would show how far off, on average, their ratings are from the ground truth.

F1 score

What it is: A balanced measure of precision and recall, useful for classification tasks.
How it's calculated: 2 * ((Precision * Recall) / (Precision + Recall))
Interpretation: Ranges from 0 to 1, with 1 being the best possible score.

Example: Useful for tasks like sentiment analysis, where both correctly identifying positive sentiments (precision) and not missing any positive sentiments (recall) are important.

Metrics for various annotations

The choice of metric for evaluating data quality depends heavily on the type of annotation task. Different annotation types require different evaluation approaches to accurately assess precision and accuracy. Here are some of the most common annotation types and their corresponding metrics:

Task	Metrics	Why	Example
Likert scale responses	Krippendorff's Alpha	Well-suited for ordinal data, robust to missing data	Measuring the consistency of ratings on a 1-5 scale
Rankings	Kendall's Tau, ELO, Spearman's rank correlation	Assess agreement between rankings and handle relative comparisons effectively	Evaluating consistency in human preferences for AI-generated content or comparing model performance across different tasks
Categorical classifications	Cohen's Kappa (two raters), Fleiss' Kappa (multiple raters)	Designed for nominal data, account for chance agreement	Assessing inter-rater reliability in categorizing AI-generated text into genres
Free text responses	a) LLM as a judge, b) BLEU score (translation tasks), c) ROUGE score (summarization tasks)	Require sophisticated methods that understand semantic meaning and context	Evaluating the quality, relevance, or correctness of free-text responses to prompts
Binary classifications	Accuracy, F1 Score, Matthews Correlation Coefficient	Well-suited for tasks with two possible outcomes	Evaluating the performance of AI-generated images against a benchmark
Multi-label classifications	Hamming Loss, Jaccard Index	Can handle cases where multiple labels can be assigned to a single instance	Assessing the accuracy and overlap of assigned topic tags to an AI-generated article
Numerical predictions	Mean Absolute Error (MAE), Root Mean Square Error (RMSE)	Quantify the difference between predicted and actual numerical values	Measuring the accuracy of age estimates for people in AI-generated images

Managing quality

The adage "what gets measured gets managed" is particularly relevant in AI data quality. With real-time quality measurements at our disposal, the next challenge becomes how to effectively improve and manage quality. Below are some common scenarios in production that our teams have to intervene and correct.

Scenarios and strategies

Scenario	Strategy
Low precision	Implement more rigorous annotator training and clearer guidelines. Conduct calibration sessions to align annotators' understanding. Use concrete examples to illustrate edge cases and potential areas of confusion.
High precision, low accuracy	Revisit and refine the gold standard or ground truth. Employ expert review to identify systematic biases. Adjust training materials to address any misalignments between annotator output and desired outcomes.
Low precision, high accuracy	Foster more collaboration and knowledge sharing among annotators. Implement a peer review system where annotators can learn from each other's approaches. Encourage discussions about challenging cases to build a collective understanding.
Low accuracy	Conduct a comprehensive review of training materials and annotation guidelines. Consider retraining the entire team with a focus on accuracy. Implement a more rigorous quality control process, potentially involving multiple layers of review.

Beyond precision and accuracy: Operational efficiency and trust

While precision and accuracy are crucial for data quality, it's essential to consider other factors that influence the overall effectiveness and efficiency of AI data labeling processes. These additional indicators provide valuable insights into resource allocation, workflow optimization, and labeler reliability.

Operational efficiency indicators

Achieving high precision and accuracy at any cost is often undesirable, as AI teams typically operate within specific data budgets to achieve expected value in terms of new model capability. The following metrics help balance quality with efficiency:

Metric	Description
Labeling time	Measures the time taken by individual AI trainers to complete tasks.
Review time	Tracks the duration of the review process for labeled data.
Rework rate	The percentage of tasks that require revision after initial labeling or review.
Throughput	The number of tasks completed per unit of time (e.g., per hour or per day).

Notes: Outliers on either side of the mean for these metrics often reveal important insights. For example, consistently fast labelers with high accuracy might be candidates for more complex tasks, while those with long labeling times might need additional training or support.

Alignerr trust score

The Alignerr trust score is a sophisticated metric designed to evaluate and quantify the reliability of individual expert AI trainers (also known as an Alignerr). This multidimensional score incorporates various factors such as historical accuracy, consistency, task completion rate, and the ability to handle complex assignments.

In practice, the Alignerr trust score plays a crucial role in optimizing workflow and maintaining high data quality standards. High-trust AI trainers may be prioritized for more critical or complex tasks, while those with lower scores might receive additional training opportunities or be assigned to tasks with higher levels of oversight. This selective task distribution helps to improve overall data quality without necessarily increasing review overhead. Moreover, the trust score serves as a valuable feedback mechanism, providing AI trainers with insights into their performance while encouraging continuous improvement.

Operational aspects of AI data quality management

Ensuring high-quality data for AI training and evaluation goes beyond metrics and measurements. It requires robust operational processes, innovative technologies, and a skilled workforce. This next section explores key operational aspects that contribute to maintaining and improving data quality.

Multi-step review and rework

To enhance data quality, we employ a multi-step review and rework process that draws inspiration from proven scientific methods. One such approach is the double-entry method, commonly used in data entry to reduce errors:

1) Initial labeling: Two independent labelers perform the same task without knowledge of each other's work.

2) Comparison: The results are automatically compared to identify discrepancies.

3) Expert review: Where discrepancies exist, an expert reviewer examines both entries and makes a final determination.

4) Rework: If necessary, the task is sent back for rework with specific feedback.

This process significantly reduces the likelihood of errors and biases, as it requires multiple independent verifications before data is accepted. Additionally, we implement other quality control measures such as:

Random spot checks by senior AI trainers
Periodic recalibration sessions to ensure consistency across the team
Automated checks for logical inconsistencies or outliers

By incorporating these scientific approaches into our workflow, we can consistently produce high-quality data that meets the rigorous standards required for AI training and evaluation.

LLM as a judge

Leveraging the power of Large Language Models (LLMs) can greatly enhance our quality control processes, particularly for text-based tasks. We use fine-tuned LLMs to assess the similarity between annotator-provided explanations and ground truth responses. This approach offers several advantages:

1) Scalability: LLMs can process large volumes of text quickly, allowing for comprehensive quality checks.

2) Consistency: Unlike human reviewers, LLMs apply the same criteria consistently across all evaluations.

3) Semantic understanding: Fine-tuned LLMs can capture nuanced similarities in meaning, even when the exact wording differs.

Our process for using LLMs as judges involves:

1) Fine-tuning an LLM on a dataset of high-quality, expert-verified responses for specific task types.

2) Using the fine-tuned model to generate similarity scores between annotator responses and ground truth.

3) Flagging responses that fall below a certain similarity threshold for human review.

This LLM-assisted approach allows us to efficiently identify potential quality issues while reducing the workload on human reviewers, who can focus their attention on the most challenging cases.

Setting the standard for expert AI trainers

While software and AI technologies are critical for data quality management, the most significant quality gains often come from highly skilled AI trainers (aka human raters). These experts bring nuanced understanding and critical thinking skills that are essential for handling complex generative AI data.

The impact of expert AI trainers is particularly evident in areas requiring deep domain expertise, such as STEM fields, advanced coding, and teaching AI systems complex skills like planning and reasoning. When training AI models to perform complex mathematical proofs, optimize code, or develop advanced problem-solving strategies, human expertise often surpasses current AI capabilities.

With the Alignerr network, Labelbox maintains exceptionally high standards in our recruitment process, with an acceptance rate of just 3%. Our rigorous selection process includes:

1) Initial screening: Looking for advanced degrees in STEM fields, extensive coding experience, or backgrounds in cognitive science and AI development.

2) Skills assessment: Evaluating critical thinking, pattern recognition, and problem-solving abilities crucial for high-quality AI data annotation in complex domains.

3) Expertise tests: Simulating real-world scenarios in STEM problem-solving, code optimization, or designing complex reasoning tasks for AI.

4) Technical interviews: Assessing depth of knowledge in specialty areas and understanding of AI and machine learning concepts.

By investing in top-tier human expertise across crucial domains, we ensure our data quality exceeds what can be achieved through software and AI alone, providing our clients with a competitive edge in developing next-generation AI capabilities.

Curating mission-specific expert AI trainer teams

At Labelbox, we take a unique approach to team formation for each customer project. Rather than assigning available annotators ad hoc, we create and curate dedicated teams specifically tailored to each mission. This approach ensures deep familiarity with the project and fosters a sense of shared purpose among team members.

Key aspects of our team curation process include:

1) Dedicated team formation: We assemble a team of experts whose skills and experience align closely with the project's specific requirements.

2) Minimum hour commitment: Team members are required to dedicate a minimum number of hours to the project. This ensures they gain the necessary context and develop proficiency in the specific task domain.

3) Context building: Through intensive onboarding and ongoing training, we help the team build a comprehensive understanding of the customer's goals, challenges, and quality expectations.

4) Mission-specific motivation: We cultivate a shared sense of purpose within the team, aligning their efforts with the project's broader objectives and potential impact.

5) Continuous improvement: Regular feedback sessions and performance reviews help the team refine their approach and continuously enhance their skills.

This dedicated team model not only leads to higher quality outputs but also results in increased efficiency over time as the team develops deep expertise in the customer's specific domain. By creating a focused, motivated team with a strong grasp of the project's context, we ensure that each customer receives the highest level of service and the best possible results for their AI initiatives.

Bringing it all together

In this post, we explored some of the key ways that Labelbox helps customers capitalize on an AI data factory and our approach to delivering high-quality data at scale. This strategy includes:

1) Precision and accuracy metrics tailored to various annotation types

2) Adaptive quality management strategies for different scenarios

3) Operational efficiency indicators to balance quality with cost-effectiveness

4) The Alignerr trust score for optimizing workflow and maintaining high standards

5) Multi-step review processes and LLM-assisted quality control

6) A rigorous selection process for expert AI trainers

7) Curated, mission-specific teams dedicated to each customer's unique needs

What sets Labelbox apart is the scientific approach to data quality and operating an AI data factory at scale. Just last month, over 50 million annotations were created with over 200,000 human hours. By continuously monitoring and analyzing data quality as it's produced, we enable immediate interventions and adjustments. This real-time approach allows AI teams to:

Quickly identify and address quality issues before they compound
Provide instant feedback to AI trainers, fostering rapid improvement
Adapt to changing project requirements on the fly
Ensure consistent, high-quality outputs throughout the entire data production process

Our confidence in this system is so strong that we offer something truly unique in the industry: a data quality guarantee. Customers only pay for data that meets the agreed-upon Service Level Agreement (SLA) for quality, throughput, and efficiency. This guarantee underscores our commitment to delivering not just data, but value and results for our customers.

Data as the bedrock for AGI and beyond

As the AI landscape evolves and frontier labs continue their rapid pace of innovation to drive us closer towards AGI, Labelbox remains committed to providing the highest quality data and most efficient services, ensuring that data quality never becomes the limiting factor in the pursuit of transformative AI technologies. We’re continuing to invest heavily in cutting-edge data science and alignment techniques, pushing the boundaries of what's possible in data quality and service performance.

We hope you found this post helpful for gaining a deeper understanding of how a data factory helps ensure data quality and accelerate AI development process. If you're interested in learning more, feel free to sign up for a free Labelbox account to try out the platform, or contact our team to learn more.

References

We’ve compiled a list of articles and research papers that have influenced our approach.

1) [1801.02546] Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques and Assurance Actions

2) [2311.04345] A Taxonomy of Rater Disagreements: Surveying Challenges & Opportunities from the Perspective of Annotating Online Toxicity

3) Survey of agreement between raters for nominal data using krippendorff's Alpha

4) Assessing Data Quality of Annotations with Krippendorff Alpha For Applications in Computer Vision

5) Weighted Krippendorff’s alpha is a more reliable metrics for multi-coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation - ACL Anthology

6)[2212.09503] Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation Tasks

7) Thinking about High-Quality Human Data | Lil'Log

Continue reading

Michael Haag•March 11, 2025

Labelbox unveils integrated VS Code IDE: Generate sophisticated training code quickly

Labelbox integrates a full VS Code for the Web IDE into its platform, empowering AI trainers with a desktop-class coding experience for creating superior training data fast.

Esther Na•March 6, 2025

The power of human expertise: Transforming audio and multimodal STEM models with Labelbox Services

In this blog, learn about two AI lab customers who utilized Labelbox's top-tier AI trainers to drive innovation in their audio and multimodal STEM models.

Labelbox•February 25, 2025

How to generate industry-specific data for AI training with Labelbox

This guide will teach you how to generate domain-specific data with the Labelbox data factory to train your LLMs and AI models on industry-specific reasoning.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads

Inside the data factory: How Labelbox produces the highest quality data at scale

Measuring quality

Precision metrics

Inter-rater agreement (IRA)

Krippendorf’s alpha

Standard deviation of ratings

Percent agreement

Accuracy metrics

Accuracy score

Mean absolute error (MAE)

F1 score

Metrics for various annotations

Managing quality

Scenarios and strategies

Beyond precision and accuracy: Operational efficiency and trust

Operational efficiency indicators

Alignerr trust score

Operational aspects of AI data quality management

Multi-step review and rework

LLM as a judge

Setting the standard for expert AI trainers

Curating mission-specific expert AI trainer teams

Bringing it all together

Data as the bedrock for AGI and beyond

References

Continue reading

Try Labelbox today