Introducing Recursion: the RL platform for enterprise specialist agents

Labelbox•June 30, 2025

Agentic AI: What it takes to build AI that acts

Agentic AI is redefining how models engage with the world, not just reacting, but by taking initiative. We're seeing a real-time shift toward systems that can navigate ambiguity, make informed decisions, and execute complex tasks with minimal guidance. Reaching that level of autonomy, however, requires more than just bigger models, but instead, sophisticated training techniques and access to rich, high-quality data.

At Labelbox, we’re partnering with leading AI labs to build the data infrastructure that makes this possible. Agentic systems demand rich, structured feedback on trajectories, programmatically verifiable outcomes, and task-specific evaluation pipelines that scale.

In this post, we highlight three recent, real-world projects we completed for leading AI labs that demonstrate what it takes to build and evaluate agentic behavior in real-world contexts:

Complex tool-use tasks that test a model’s ability to plan and adapt across multi-step API interactions
Structured reasoning challenges inspired by everyday planning scenarios
Multi-turn instruction-following benchmarks that push models to adapt to evolving goals with precision and consistency

These examples show how top AI teams are moving beyond understanding language and toward using it to reason, act, and assist autonomously.

#1 Simulating complex tool use: Teaching agents to plan, adapt, and act

To evaluate the planning and decision-making abilities of agentic models, we helped a team simulate real-world workflows like booking travel or managing retail returns. These workflows required agents to use APIs that manipulate structured databases by executing a series of tool calls based on evolving feedback.

We created five domain environments, each with its own database and five API interfaces. Every interface supported multiple API functions such as querying availability, inserting reservations, or modifying user data. For each interface, we built several multi-step tasks that tested the model’s ability to:

Interpret natural language instructions
Plan a series of tool calls
Adapt to the responses returned by the tools
Reach a programmatically verifiable end state

Example tasks:

"Reserve three hotel rooms with flexible check-in, and remove one if the total cost exceeds the corporate travel policy."
"Request a return for a damaged item, choose in-store drop-off, and resend the return label if not received in 24 hours."
"Book a rental car, then apply a loyalty discount and update the booking if the arrival terminal changes."

Evaluation and labeling specifications:

AI trainers defined API interfaces using specific coding languages, including argument types and expected outputs.
Each task required trainers to craft natural language prompts, outline the optimal sequence of API calls, and verify the resulting system state.
API sequences were designed to reflect conditional logic, adapt based on prior responses, including retries or strategy adjustments.

Creating this benchmark required two tiers of human expertise: one group focused on designing interfaces and defining tool schemas, while the other group created realistic task prompts and reference action plans. Each task represented a unique planning challenge, labeled by difficulty based on its ability to challenge baseline models and leading competitors.

In total, the benchmark included several interfaces, hundreds of API functions, and thousands of tool-use tasks across multiple domains.

The result was a robust testbed for evaluating how well agentic models can reason through multi-step goals, dynamically adapt their behavior, and achieve structured outcomes using tool interfaces.

#2 Verifying structured reasoning: Real-world planning at scale

Everyday consumer use cases like coordinating group travel or scheduling staff shifts require agents to solve constraint-rich problems. In this project, we incorporated the principles of constraint programming and built a reusable benchmark for evaluating how well models handle real-world planning, coordination, and decision-making tasks.

We began by developing unique domains, each with a core planning challenge. For each domain, we generated structured metadata, natural language prompt variants, and input permutations using specific data format-defined constraints. These inputs required the model to reason through multiple conditions simultaneously, including:

Temporal coordination across time zones
Filtering past, current, and future events
Scheduling within budget, availability, and role-based constraints

Example tasks:

"Help me coordinate a team meeting across five employees in three time zones, with no overlapping vacation days."
"Schedule a dinner reservation for a family with one vegan and two gluten-free members, prioritizing the earliest available time before 7pm."
"Build a semester plan that satisfies prerequisites, avoids time conflicts, and fits within 15 credit hours."

Evaluation and labeling specifications:

Trainers followed a standardized schema for defining input variables and constraints (e.g., roles, availability, time slots).
For each scenario, trainers structured inputs into natural language prompts and outlined the corresponding expected outputs.
Task complexity was assessed based on the number and interaction of constraint pairs present in the scenario (e.g., shift vs employee availability, skill vs role).

Each task had a clearly defined input structure and verifiable output format, enabling programmatic evaluation of correctness and seamless integration with RLVR training methods. We also mapped the number of interdependent constraint pairs to a complexity score (moderate, advanced, expert).

This benchmark has proven valuable for reinforcement learning workflows and long-horizon task evaluation. It gave our partner a scalable, systematic way to test structured reasoning across diverse planning problems.

#3 Benchmarking multi-turn instruction following: Adapting to evolving goals

Instruction-following is the core capability for any agentic system, but real-world use cases often involve multiple turns, where the user introduces new constraints or modifies earlier instructions. To evaluate this behavior, we developed a benchmark that focused on multi-turn adaptation with verifiable outcomes.

Each task simulated a conversation between a user and the model over multiple turns. The user would issue an initial prompt, then follow up with changes to formatting, structure, or content. The final prompt in each dialogue was intentionally challenging, requiring the model to reconcile all prior instructions to produce a coherent, complete response.

Example task flows:

Turn 1: "Summarize this article in 100 words." Turn 2: "Make the summary bullet points instead of a paragraph." Turn 3: "Add a bolded title and remove any statistics."
Turn 1: "Rewrite this email to be more formal." Turn 2: "Now make it shorter and include a call to action at the end."
Turn 1: "Extract product names from this review." Turn 2: "Only include products with positive sentiment."

Evaluation and labeling specifications:

Trainers created synthetic dialogues with evolving instructions, ensuring each turn reflected a realistic instruction change.
Each task included both model-generated responses and corrected reference completions.
For each turn, trainers categorized the instruction type (e.g., formatting, content, tone) and validated the final output through an automated verification process.

Trainers included both incorrect model responses and corrected versions that fulfilled all instructions. Each final turn was paired with an automated verification tool to programmatically assess instruction adherence.

With this benchmark, our AI lab partner could quickly identify weak points in instruction tuning pipelines and measure how reliably their models adapt to evolving goals without losing consistency or precision.

Benchmark your agents: Introducing the Agentic Leaderboards

As we continue pushing the boundaries of agentic AI, we’ve also launched a new way to evaluate and compare performance: the Labelbox Agentic search leaderboards.

These leaderboards offer an open, structured benchmark suite designed to measure how well models perform across a wide range of real-world agent tasks, from multi-step search and planning to dynamic instruction-following. We constructed hundreds of challenging questions designed to expose cracks in modern retrieval systems across a full spectrum of knowledge domains, including:

STEM questions
Recent news & current events
Historical & archival information
Faulty & adversarial prompts
Multi-language context
Specialized domain knowledge (i.e. law, medicine, finance, and other expert domains)

In our latest blog post, Benchmarking agentic search, we share how we designed this benchmark to evaluate search-based agents, along with key findings on model performance, verification strategies, and common failure modes.

Whether you're training new agents or comparing fine-tuning strategies, these leaderboards provide a scalable, standardized way to quantify progress and accelerate iteration.

What it takes to build agentic AI

Agentic AI depends on systems that can reason, plan, and take actions based on shifting goals. Building and evaluating these capabilities requires:

Complex task design grounded in realistic workflows
Rich metadata and verifiable outcomes
Human-in-the-loop processes that blend technical and domain-specific expertise

At Labelbox, we provide the flexible infrastructure to support agentic training and evaluation, from structured prompt design to rater management, outcome verification, and more.

As agentic systems become the foundation for AI assistants, copilots, and tool-using agents, this kind of data will become mission-critical.

Want to learn more about how we are helping leading AI labs build and evaluate agentic AI systems from the ground up? Check out our in-depth guide on designing agent workflows, collecting trajectory data, and building evaluation pipelines that scale with your agents.

Curious how we can help you train, evaluate, or fine-tune your next-gen AI agents? Contact us to learn more.

Continue reading

When benchmarks saturate, what comes next? Meta’s GIM pushes AI evaluation toward integrated reasoning

Meta Superintelligence Labs introduces GIM (Grounded Integration Measure), a benchmark shifting from isolated recall to integrated reasoning. It evaluates how models coordinate constraints, ambiguity, spatial logic, and epistemic judgment within a single problem.

Labelbox•May 20, 2026

Introducing EchoChain: An audio benchmark for reasoning under pressure in full-duplex dialogue

We introduce EchoChain to advance audio evaluation by testing Dual-Stream Reasoning in scenario-driven conversations with mid-speech interruptions, constraint updates, and shifting objectives. The benchmark measures whether models sustain coherent, adaptive intelligence in real time.

Smit Nautambhai Modi•March 4, 2026

The AI safety illusion: why current safety datasets fool us on model safety

AI safety is often judged by refusal rates, but our study of datasets like AdvBench and HarmBench shows these scores rely on obvious trigger words, not real adversarial intent. Remove the cues and the supposed safety collapses, revealing a stark gap between benchmarks and real world risk.

Shahriar Golchin•February 20, 2026

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free