Simplify the evaluation of complex prompts and responses
How does fact checking and prompt rating work in Labelbox?
Achieve further advanced reasoning with fact-checking and prompt rating
The future of AI starts with Labelbox

Esther Na•December 17, 2024

Advance LLM reasoning with advanced fact-checking and prompt rating tools

Large language models (LLMs) have made significant strides in recent years, but significant opportunities still exist to improve their reasoning and accuracy. Frontier models are expected to think critically, explain their logic, and produce reliable and accurate results.

To address these challenges, we are thrilled to announce two new features to assist AI teams in the advancement of frontier and task-specific models. We have expanded our multi-step reasoning tool to make it easy for raters to review the accuracy of each part of a complex response. In addition, a new prompt rating feature allows you to analyze prompts for compliance with specific guidelines to ensure raters spend time on valid responses and report poor prompts.

Read on to learn more about how these features can help improve your model’s critical thinking and generate more accurate responses. You can see them in action as well through the interactive demos below.

Simplify the evaluation of complex prompts and responses

Last month, we announced the release of a powerful new annotation type in our multimodal chat solution (MMC), multi-step reasoning. Multi-step reasoning improves LLM training by automating the breakdown of complex responses into smaller, manageable steps. Individual evaluators can then score and, when necessary, rewrite a specific step, leading to improved model understanding and more accurate outputs.

Our comprehensive Labelbox platform now includes these two key features:

Fact-checking tasks: Labelers can assess the accuracy of complex reasoning responses by guiding the Labelbox platform to automatically split complex responses into smaller, manageable pieces of information. Each piece of information can be individually rated—with options to include justifications and corrections for disputed claims.
Prompt rating tasks: Issues with the prompt itself can now be instantly flagged for not meeting pre-defined criteria, such as being unratable, false, offensive, controversial, or self-contained. Labelbox's customizable ontology also allows for additional criteria to be added. When a prompt is flagged, any required tasks associated with it becomes optional, giving labelers the ability to skip bad prompts and focus on high-value entries.

Labelbox's new features are the latest example of how the team is committed to generating the highest-quality data and model evaluations in the industry. By adding another powerful feature in our set of quality control tools, we can help you achieve greater precision in your data and develop more accurate AI models.

How does fact checking and prompt rating work in Labelbox?

With the addition of these new features in Labelbox’s multimodal chat editor, you can now easily determine the veracity of model responses as well as identify and flag any issues with a given prompt.

Here’s how to use the fact-checking feature in Labelbox’s platform:

Create a new project using the Multimodal chat task type and click to edit or create the ontology.
Go to “Message step tasks” and select the radio button next to “Factual.” Give the task a name and review the options. Click save when you are done to complete the ontology configuration.

Select the fact check statements task in the ontology setup to automatically classify your model’s response accuracy.

After choosing a model(s) to evaluate and clicking Start labeling, enter a prompt to generate a model response (or multiple responses if evaluating more than one model).
Once the response is generated, click on “Fact check statements” on the left-hand side of the screen if it is not already selected. The multimodal chat editor will automatically split responses into individual steps and allow you to classify them as “ Accurate”, “Inaccurate”, “Disputed”, “Unsupported”, “Can’t confidently assess”, or “No factual information”.
Evaluate and rate each step individually. If you select either “Accurate”, “Inaccurate” or “Disputed”, you will be asked to input additional justification.

If the step has been marked as either “Accurate”, “Inaccurate” or “Disputed”, the user is prompted to add a justification to the rating. For all other classifications, you will not be asked to add additional information.

Iterate through this process until all steps have been fact checked.

The new fact-checking feature provides a straightforward and effective process to generate high-quality and accurate responses.

See Labelbox’s new fact-checking feature in action here.

Here’s how to use the prompt rating feature in Labelbox’s platform:

Create a new project using the Multimodal chat task type, and click to create or edit the ontology.
Within the ontology configuration screen, add a “Prompt rating task” to the project. Enter a name for the task and then review and edit the options. Options can be configured using checklists, radio buttons, or free text fields. If any of these pre-defined criteria are selected during labeling, then the entire conversation will be marked unratable and can be skipped.

Configure a prompt rating task in the ontology setup to easily rate the quality of your prompt.

After choosing a model(s) to evaluate and clicking Start labeling, enter a prompt to generate a model response(s).
Once the response is generated, you can flag any issues with the prompt. If any of the pre-defined options for the prompt issue are selected, the red asterisk will be removed from the response task and the labeler will have the option to skip labeling for that response.

The red asterisk next to tasks under "Response tasks" indicates that labeling is still required, as a predefined option for "Issues with the prompt?" has not been selected. If an issue was flagged for the prompt, the asterisk would disappear, making labeling for that task optional.

By carefully crafting and evaluating prompts, we can significantly improve the overall quality and relevance of LLM outputs. In addition we can help improve the efficiency and utility of the time spent rating responses.

See Labelbox's new prompt rating feature in action here.

Achieve further advanced reasoning with fact-checking and prompt rating

By ensuring data quality and accuracy with our new quality control mechanisms, Labelbox can generate key datasets to train LLMs on complex reasoning and decision-making. Critical steps towards agentic reasoning that are supported by Labelbox’s fact-checking and prompt rating features include:

Directly improve accuracy: Fact-checking and prompt rating enhance LLM data quality by identifying and correcting inaccuracies and ensuring clear prompts.
Provide valuable human feedback: Both features help bridge the gap between human and machine intelligence by serving as human-in-the-loop processes that provide expert guidance to the model's learning workflows.
Refine reasoning: By providing tools for justifications and corrections, labelers enable the model to learn from its mistakes, resulting in more accurate and reliable responses.

The future of AI starts with Labelbox

The addition of fact-checking and prompt rating tools marks a major advancement in training LLMs for complex and agentic reasoning tasks. These quality control features enable granular rating and classification of both prompts and model responses, ensuring the generation of high-quality, accurate training data.

Want to learn more?

Try a quick, interactive tour into the demos for our fact checking and prompt rating features
Learn more about our multi-step reasoning feature and how it helps train LLMs to think more critically.

Contact our team anytime with questions or if you are ready to discuss your LLM training needs and how Labelbox might be able to help.

Continue reading

Ibrahim Muhammad•March 28, 2025

How to train and evaluate AI agents and trajectories with Labelbox

Learn how to use Labelbox's Multimodal Chat Editor for the key tasks of agent training and evaluation, using its new capabilities to evaluate and annotate the agent trajectories.

Labelbox•March 26, 2025

New AI models in Labelbox: Nova Pro, Gemini 2.0, Claude 3.7, Whisper, & more

Discover the latest AI frontier models supported by Labelbox. Compare them to your own model in our chat arena style editor or use them to create model assisted labels.

Michael Haag•March 11, 2025

Labelbox unveils integrated VS Code IDE: Generate sophisticated training code quickly

Labelbox integrates a full VS Code for the Web IDE into its platform, empowering AI trainers with a desktop-class coding experience for creating superior training data fast.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads