Labelbox•November 19, 2025

Bridging insight and innovation: Introducing Labelbox Applied Research

Today, we’re launching Labelbox Applied Research, along with its three flagship pillars:

Labelbox Evals: A unified evaluation framework that helps researchers deeply understand system behavior across reasoning, robustness, and alignment dimensions.
Labelbox Agents: A foundational suite that accelerates the development of agents that are reliable, interpretable, and built with modularity in mind.
Labelbox Robotics (LBRx): Our cutting-edge robotics division that delivers high-quality, diverse training data essential for teaching robots complex manipulation tasks.

Applied Research exists to address a fundamental gap in modern AI. We can build increasingly powerful models, but we lack equally powerful ways to truly understand, measure, and improve how they behave in the real world.

Together, the three pillars support a unified mission: to build the frameworks, measurements, and infrastructure necessary to understand and improve AI systems not just in theory, but in the environments where they will actually matter.

Evaluations

AI research progresses only as fast as its ability to measure progress, yet traditional benchmarks show scores, not reasons behind model behavior.

Labelbox Evaluations (or Evals for short) define how we measure model quality on real, economically meaningful tasks. Our focus is to build post-training evaluation methods, benchmarks, and leaderboards that stress-test reasoning, robustness, safety, and cost-quality trade-offs across domains and modalities, reflecting real-world complexity and impact. Our Evals team moves beyond saturated, score-driven benchmarks and toward analyses that reveal why a model responds the way it does, how it plans over long horizons, and how its reasoning holds up under real economic and operational constraints.

A core part of this work is our private benchmark program, which captures true model performance on tasks that reflect real operational difficulty rather than saturated public datasets. These private benchmarks surface real-world loss patterns, expose blind spots that traditional benchmarks miss, and guide targeted data collections that enable genuine hill climbing and measurable improvement over time. Our infrastructure plugs into training loops, CI, and production monitoring so model changes can be treated as controlled experiments that tie back to critical business metrics like user adoption and retention.

We also partner with researchers at leading labs to co-design evaluation suites that span both public frontier-model benchmarks and private, domain-specific tasks, establishing a shared, quantitative standard for intelligence.

What Evals brings

Private benchmarks & hill climbing: Real-world, domain-specific evaluations that uncover true performance, reveal real-world loss patterns, and power targeted data collections for continuous improvement
Alignment & safety: Tracking model robustness, adversarial behavior, and fairness
Insights: Diagnostic reports and visualizations for model error analysis
Stumps: Representing the frontier of difficulty, designed to probe model limits and expose gaps in reasoning, robustness, and generalization.

Labelbox Evals shifts the focus from leaderboard wins to true understanding. Transparent, human-centric evaluation helps researchers know what models truly can and cannot do.

Agents

Modern AI systems are evolving from passive predictors to active agents. Our team’s work focuses on LLM-based agents that operate within real software environments, interacting, reasoning, and acting to achieve complex goals.

This approach brings together customer-facing researchers and applied AI engineers to design and evaluate autonomous agents across diverse environments. We focus on three key areas: defining reward signals and feedback objectives that align with business value and safety, benchmarking agent performance on realistic workflows, and building the tooling needed to evaluate and govern models as agents.

Through this work, we aim to enable agents to dynamically interact with external systems over long horizons, running data programs that leverage trajectories, human feedback, and reinforcement learning to systematically improve reliability and impact. From RL environments to tool and computer use, our team explores how AI perceives, reasons, and manipulates complex systems to achieve multi-step goals safely, reliably, and efficiently.

What Agents offers

Agent architectures & templates for modular reasoning, planning, and execution
Planner modules that support decision-making under uncertainty and dynamic goals
Execution & control frameworks for safe actuation, fallback logic, and continuous monitoring

Our work within Agents makes it possible to build reliable, interpretable, and modular agents. By tightly coupling reasoning, planning, and action, it accelerates progress toward trustworthy autonomous systems.

Robotics (LBRx)

Robotics brings AI into the physical world, where perception, control, and real-time decision-making intersect. LBRx delivers the high-quality, diverse training data essential for teaching robots complex manipulation tasks.

We combine deep robotics expertise, a powerful unified platform, access to a global network of vetted human experts, and state-of-the-art systems. The result: faster breakthroughs and accelerated product development powered by high-quality robotics data.

What LBRx enables

Robotics teleops: Expert teleoperators collect high-fidelity demonstrations of robotic arm control, producing multimodal datasets essential for learning fine-grained manipulation policies.
Human egocentric: Capturing human perspective, decision-making, and interaction patterns through vision, motion sensors, and wearables to teach AI systems how people see, move, and behave in real-world contexts.
Annotation & synthetic data: Integrated pipelines for precisely annotated robotics datasets, paired with synthetic generation grounded in real-world observations.

LBRx bridges data, models, and deployment, providing a structured path to bring AI into real-world environments safely and efficiently.

During his visit to our SF robotics lab, Dwarkesh got a firsthand look at how we’re redefining next-gen robotics development with cutting-edge data collection

Shaping the next era of frontier AI

These three offerings from our Applied Research team are built to accelerate the remarkable progress our customers are making on the path towards superintelligence. Our shared goal is to provide benchmarks that deliver clearer insight into model behavior, agent frameworks that make complex behaviors easier to study, and robotics systems that connect models to the physical world.

We welcome researchers, practitioners, and builders to partner with us, offering feedback and insights that help shape the next generation of AI capabilities.

Get in touch with us to learn more

Continue reading

Introducing EchoChain: An audio benchmark for reasoning under pressure in full-duplex dialogue

We introduce EchoChain to advance audio evaluation by testing Dual-Stream Reasoning in scenario-driven conversations with mid-speech interruptions, constraint updates, and shifting objectives. The benchmark measures whether models sustain coherent, adaptive intelligence in real time.

Smit Nautambhai Modi•March 4, 2026

The AI safety illusion: why current safety datasets fool us on model safety

AI safety is often judged by refusal rates, but our study of datasets like AdvBench and HarmBench shows these scores rely on obvious trigger words, not real adversarial intent. Remove the cues and the supposed safety collapses, revealing a stark gap between benchmarks and real world risk.

Shahriar Golchin•February 20, 2026

Reflections on NeurIPS 2025: Advancing evaluation and continual learning in AI

Takeaways on the themes and research directions likely to shape the year ahead. We focus on two core areas how to rigorously measure AI capabilities and how to build interactive systems that learn through experience over time.

Labelbox•December 16, 2025

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free