logo
×

Using Labelbox to improve data quality via AutoQA & advanced labeler review

Working with leading generative AI teams who are building frontier models and developing task-specific generative AI products, we've seen first hand the importance of data quality and robust QA processes in order to deliver performant models. The availability of human-evaluated data sets the companies apart in their AI offerings. In this solution accelerator, we will guide you through the different workflows and demonstrate how Labelbox can expedite the quality review process for creating better data for generative AI use cases.

The generative AI use cases we’ll focus on for this guide is multi-turn conversations, similar to what you’d find interacting with an LLM. As an introduction, Labelbox offers multimodal chat for two main options:

1. Live online-based evaluation

In this scenario, Labelbox provides an experience similar to a chatbot arena where multiple models or multiple versions of the same model are compared against each other. A labeler then ranks the responses to determine which one is better. For example, you can input a prompt and receive three different responses from different versions of a model, such as Gemini, GPT, or Claude. It's important to note that the labelers don’t know which response corresponds to which model to prevent bias.

Example:

  • Input a prompt.
  • Receive three responses from different model versions.
  • The labeler then ranks the responses based on quality (defined as per your business-specific needs).

2. Offline multimodal chat

This option allows you to upload existing multi-turn conversations to Labelbox for evaluation (e.g. accuracy, relevance, tone, etc.).

Example:

  • Upload a multi-turn conversation between a human and a chatbot.
  • Evaluate and label the conversation across different axes, such as relevance, factuality, and fluency.
  • This can be done on a per-message level or for the entire conversation.

By leveraging these workflows, teams can effectively utilize Labelbox for various chatbot-based use cases.

Prompt and response generation walkthrough

As a first step, let’s walk through a common workflow in Labelbox involving prompt and response generation, which is crucial for training a model. This process involves creating question and answer pairs and offers three main options: workforce-generated prompts and responses, guided prompt and response creation, and responding to uploaded prompts. 

For workforce-generated prompts, labelers input questions and corresponding responses, selecting the appropriate category for each pair.

As shown above, the guided prompt and response creation option uses an image, code snippet, video, or text as a basis for labelers to generate relevant question and answer pairs. 

Lastly, if you have existing prompts, labelers can evaluate these prompts on multi-turn conversations and provide suitable responses. This flexible workflow supports various input types, enabling efficient model training through comprehensive prompt and response generation.

After identifying the data to be labeled and setting up the labeling schema, which includes fields for the prompt, response, and category, you can attach detailed instructions via PDF. These instructions are accessible to labelers during the labeling process and labelers can start labeling from scratch by generating prompt-response combinations.

AI-assisted alignment (AI-assisted labeling)

Labelbox allows you to use a large language model or a custom model for pre-labeling by using our model-assisted labeling option. 

You can select any available model such as GPT, Gemini, Claude, etc or bring in your own custom one. By generating a preview, you receive pre-labels that labelers can then modify and correct without having to create labels from scratch. While this speeds up the labeling process, it can introduce bias. 

If your task requires unbiased labeling, you can enforce labelers to create labels from scratch without any model assistance. The platform offers flexibility, enabling you to choose between speed and accuracy based on your task requirements.

Labelbox AI-assisted alignment (AutoQA aka AI critic)

As a next step, let’s walk through how you can leverage Labelbox’s platform to utilize an LLM to act as an AI critic or judge to help auto QA your data during the review process. Labelbox employs an LLM to review prompt and response pairs, providing scores and feedback and critique on the quality of labels and why things were good or bad. This feedback helps identify insights and scores to figure out which labels require further review.

Scoring and feedback

As an example shown above, you can see that for this specific data row, the labeling score is 4.75,  and includes ideas for improving the reasoning. With this view, you can go through all of your different data rows and filter the data based on scores to identify labels that need additional scrutiny. For example, you might filter for scores below 4 and set a range, and move these labels to a custom queue for further review. Creating custom queues for specific score ranges, like between 3 and 4, allows for a more organized review and QA process.

Reviewers in the loop can view label instructions, scores, and improvement ideas for each label. This information is crucial for understanding the quality of your labels and making necessary adjustments. You can filter and select labels based on their scores and move them to appropriate queues for further action. As a best practice, the Labelbox team works closely with our customers to come up with the best scoring for your review and evaluation needs.

Furthermore, Labelbox allows you to create custom review workflows tailored to your specific needs. As shown below, you can define different metrics and criteria for scoring, such as BLEU scores, which is common specifically for Gen AI use cases around free-form text. 

This flexibility ensures that the review process aligns with all of your project requirements and quality standards.

Exporting your data

Once labels are finished with the review process, labels are moved to the done stage and are ready for export out of the Labelbox platform. You can export the data by selecting it and triggering the export JSON. This process ensures that you have high-quality data ready.

Because Labelbox tracks in-depth metrics around labeling operations (average label time, average review time, etc.) disaggregated by labeling member, you can very easily calculate additional inter-annotator agreement metrics, such as Krippendorff’s Alpha Score and Cohen’s Cappa for qualitative metrics (e.g. likert scales), beyond the off-the-shelf benchmark and consensus metrics that Labelbox already calculates.

Providing feedback to labelers and ensuring quality

Customers and reviewers can also easily collaborate and provide real-time feedback to labelers by raising any issues or instructions, and labelers can be notified to make necessary changes. As an example, you can instruct your labelers to “incorporate the feedback from the LLM to make this response better.”

The review and feedback loop helps maintain high-quality data and can be tailored to include as many review steps as needed to meet your quality control standards, allowing you to select how much data you want auto-QA’d, how much data you want Labelbox to internally review amongst themselves or whether you want to review it on your own. Labelbox provides full control and typically works with customers to determine how many review steps you want and what are the different scores you want to evaluate against.

This ongoing process ensures a consistent and efficient labeling engine; as new data comes in, it gets quickly labeled and if it meets the criteria, it moves to an appropriate queue which then gets reviewed by subject matter experts, and finally gets you the batch of high-quality labels delivered.

Get started today

We hope that this walkthrough gives you a better understanding of how Labelbox makes it easier than ever before to iterate quickly with real-time, granular visibility into labels and implement autoQA workflows for data quality. In addition, AI labs and generative AI companies can benefit from tapping into diverse pools of expertise in order to improve the underlying data and model performance with Labelbox. 

Ready to improve the data quality for your generative AI initiatives? Contact us to learn more and we’d love to hear from you.