logo
×

How a leading AI lab fuels agentic development with frontier data

Problem

Evaluating agentic models demands more than basic instruction-following; it requires assessing how models plan, reason, and adapt across complex, multi-step tool-use tasks. A leading lab needed a robust method to test their models in scenarios mirroring real-world ambiguity, such as booking intricate travel arrangements or managing nuanced retail returns. Success hinged on the models' ability to accurately interpret natural language, chain correct API calls based on evolving feedback, and achieve programmatically verifiable outcomes.

Solution

Labelbox created multiple domain environments, each with its own database and API interfaces, supporting 15-20 API functions. They built a high-volume of multi-step tasks for each interface, testing the model's ability to interpret natural language, plan tool calls, adapt to responses, and reach a verifiable end state. A two-tiered human expertise model, with trainers defining API interfaces and tool schemas, and another group creating realistic task prompts and ground truth action plans, ensured high-quality data and evaluation. This resulted in a benchmark of 25 interfaces, over 250 API functions, and more than 1,000 tool-use tasks across different domains.

Result

The benchmark enabled the lab to pressure-test agentic performance across a range of structured planning challenges, revealing when and how models adapt, retry, or change strategy based on tool feedback. Each task was scored for difficulty and grounded in programmatically verifiable outcomes, helping the team surface key gaps in reasoning and execution. With this foundation, the lab gained a clearer picture of how its state-of-the-art agentic models behave in complex, real-world tasks, accelerating both product development and model iteration.

Image

Building the foundation for autonomy

The journey to building truly autonomous agents begins with rigorous testing in environments that mirror the complexity of the real world. For a leading AI lab at the forefront of agentic AI development, this meant simulating intricate, multi-step workflows where agents had to interact with various tools and databases, adapting their actions based on dynamic feedback. Labelbox stepped in to help design and execute this ambitious evaluation framework.


Agentic systems thrive on rich, structured feedback, programmatically verifiable outcomes, and scalable, task-specific evaluation pipelines. Labelbox recently collaborated with a leading AI lab on a project that exemplifies the rigorous demands of building and evaluating agentic behavior in real-world contexts.


Designing for real-world complexity

Our collaboration started by defining the scope: creating realistic simulations of common human-computer interactions, such as booking intricate travel arrangements or managing the nuanced process of retail returns. These workflows were not linear; they demanded that agents use a series of APIs to manipulate structured databases, executing a sequence of tool calls that evolved based on the responses received. The challenge was multifaceted: the agents needed to understand natural language instructions, formulate complex plans, execute API calls accurately, and adapt dynamically to information received from the tools, all while striving for a programmatically verifiable end state.


To achieve this, we meticulously constructed multiple distinct domain environments. Each environment was a self-contained ecosystem, complete with its own database and uniquely designed API interfaces. These interfaces were robust, supporting many functions each, covering actions like querying availability, inserting reservations, updating user data, and modifying transaction details. The sheer volume and variety of these functions ensured a comprehensive test of an agent's capabilities.


For every single interface, we developed a high-volume of multi-step tasks. These tasks were crafted to rigorously test the agent's ability to:

  • Interpret natural language instructions with precision: Agents had to parse complex, often ambiguous human requests into actionable steps.

  • Plan a series of tool calls effectively: This involved understanding dependencies between API calls and orchestrating them in the correct logical sequence.

  • Adapt to the responses returned by the tools dynamically: Agents needed to handle success, failure, and unexpected data from API calls, adjusting their subsequent actions accordingly.

  • Reach a programmatically verifiable end state: The ultimate goal was for the agent to achieve a desired outcome that could be objectively confirmed by the system.


To illustrate, consider some of the intricate example tasks:

  • "Help me book a hotel for three people arriving on different dates, then cancel one night if the price exceeds $300." This task tests sequential planning, conditional logic, and the ability to modify prior decisions based on new information (price checks).

  • "Return an item purchased online, update the refund method to store credit, and send a confirmation email." This simulates a customer service scenario requiring multiple system updates and external communication.

  • "Schedule a car rental, then adjust pickup time based on flight delay info." This highlights real-time adaptation and integration of external, time-sensitive data.


Scaling human expertise

Our evaluation and labeling specifications were the backbone of this project, ensuring data integrity and model robustness. Dedicated teams of AI trainers, equipped with deep domain expertise in different programming languages, were instrumental in defining these API interfaces using languages like Python, specifying argument types and expected outputs with unwavering clarity. For each task, trainers meticulously crafted natural language prompts that represented realistic user queries. They also enumerated the ideal API call sequence, the "ground truth" action plan, and validated the resulting database state to ensure the agent's actions achieved the intended outcome.


A key challenge, and a critical component of the benchmark, was requiring API sequences to reflect conditional logic. This meant agents had to adapt based on varying responses, including branching logic or alternative actions based on the tool's outputs. The creation of this comprehensive benchmark was a formidable effort, necessitating two distinct tiers of human expertise. One highly specialized group of trainers focused on the architectural foundation: building the interfaces, meticulously defining tool schemas, and ensuring the technical accuracy of each API function. Concurrently, another group, equally expert but with a focus on real-world applicability, invented the realistic task prompts and mapped out the ground truth action plans. Every task represented a unique planning challenge, and its difficulty was carefully labeled based on whether it stumped the lab's base model or versus other leading competing models.


Delivering the benchmark

The technical and business outcomes of delivering on this was the creation of multifold: 25 distinct API interfaces, encompassing hundreds of meticulously defined API functions, and over 1,000 complex tool-use tasks spanning multiple critical domains. Each task was scored for difficulty and verified through structured outputs, enabling the AI lab to evaluate how well models could reason, adapt, and succeed across complex planning scenarios. This robust testbed provided the leading AI lab with a powerful and objective mechanism for evaluating how well their agentic models could reason through multi-step goals, dynamically adapt their behavior in the face of uncertainty, and consistently achieve structured outcomes by effectively leveraging tool interfaces. The partnership enabled Labelbox to help our customer advance next-generation AI by providing the essential, high-quality frontier data and evaluation frameworks needed to refine and validate complex agentic systems.