Labelbox•November 19, 2025
Bridging insight and innovation: Introducing Labelbox Applied Research

Today, we’re launching Labelbox Applied Research, along with its three flagship pillars:
- Labelbox Evals: A unified evaluation framework that helps researchers deeply understand system behavior across reasoning, robustness, and alignment dimensions.
- Labelbox Agents: A foundational suite that accelerates the development of agents that are reliable, interpretable, and built with modularity in mind.
- Labelbox Robotics (LBRx): Our cutting-edge robotics division that delivers high-quality, diverse training data essential for teaching robots complex manipulation tasks.
Applied Research exists to address a fundamental gap in modern AI. We can build increasingly powerful models, but we lack equally powerful ways to truly understand, measure, and improve how they behave in the real world.
Together, the three pillars support a unified mission: to build the frameworks, measurements, and infrastructure necessary to understand and improve AI systems not just in theory, but in the environments where they will actually matter.

Evaluations
AI research progresses only as fast as its ability to measure progress, yet traditional benchmarks show scores, not reasons behind model behavior.
Labelbox Evaluations (or Evals for short) define how we measure model quality on real, economically meaningful tasks. Our focus is to build post-training evaluation methods, benchmarks, and leaderboards that stress-test reasoning, robustness, safety, and cost-quality trade-offs across domains and modalities, reflecting real-world complexity and impact. Our Evals team moves beyond saturated, score-driven benchmarks and toward analyses that reveal why a model responds the way it does, how it plans over long horizons, and how its reasoning holds up under real economic and operational constraints.
A core part of this work is our private benchmark program, which captures true model performance on tasks that reflect real operational difficulty rather than saturated public datasets. These private benchmarks surface real-world loss patterns, expose blind spots that traditional benchmarks miss, and guide targeted data collections that enable genuine hill climbing and measurable improvement over time. Our infrastructure plugs into training loops, CI, and production monitoring so model changes can be treated as controlled experiments that tie back to critical business metrics like user adoption and retention.
We also partner with researchers at leading labs to co-design evaluation suites that span both public frontier-model benchmarks and private, domain-specific tasks, establishing a shared, quantitative standard for intelligence.
What Evals brings
- Private benchmarks & hill climbing: Real-world, domain-specific evaluations that uncover true performance, reveal real-world loss patterns, and power targeted data collections for continuous improvement
- Alignment & safety: Tracking model robustness, adversarial behavior, and fairness
- Insights: Diagnostic reports and visualizations for model error analysis
- Stumps: Representing the frontier of difficulty, designed to probe model limits and expose gaps in reasoning, robustness, and generalization.
Labelbox Evals shifts the focus from leaderboard wins to true understanding. Transparent, human-centric evaluation helps researchers know what models truly can and cannot do.

Agents
Modern AI systems are evolving from passive predictors to active agents. Our team’s work focuses on LLM-based agents that operate within real software environments, interacting, reasoning, and acting to achieve complex goals.
This approach brings together customer-facing researchers and applied AI engineers to design and evaluate autonomous agents across diverse environments. We focus on three key areas: defining reward signals and feedback objectives that align with business value and safety, benchmarking agent performance on realistic workflows, and building the tooling needed to evaluate and govern models as agents.
Through this work, we aim to enable agents to dynamically interact with external systems over long horizons, running data programs that leverage trajectories, human feedback, and reinforcement learning to systematically improve reliability and impact. From RL environments to tool and computer use, our team explores how AI perceives, reasons, and manipulates complex systems to achieve multi-step goals safely, reliably, and efficiently.
What Agents offers
- Agent architectures & templates for modular reasoning, planning, and execution
- Planner modules that support decision-making under uncertainty and dynamic goals
- Execution & control frameworks for safe actuation, fallback logic, and continuous monitoring
Our work within Agents makes it possible to build reliable, interpretable, and modular agents. By tightly coupling reasoning, planning, and action, it accelerates progress toward trustworthy autonomous systems.

Robotics (LBRx)
Robotics brings AI into the physical world, where perception, control, and real-time decision-making intersect. LBRx delivers the high-quality, diverse training data essential for teaching robots complex manipulation tasks.
We combine deep robotics expertise, a powerful unified platform, access to a global network of vetted human experts, and state-of-the-art systems. The result: faster breakthroughs and accelerated product development powered by high-quality robotics data.
What LBRx enables
- Robotics teleops: Expert teleoperators collect high-fidelity demonstrations of robotic arm control, producing multimodal datasets essential for learning fine-grained manipulation policies.
- Human egocentric: Capturing human perspective, decision-making, and interaction patterns through vision, motion sensors, and wearables to teach AI systems how people see, move, and behave in real-world contexts.
- Annotation & synthetic data: Integrated pipelines for precisely annotated robotics datasets, paired with synthetic generation grounded in real-world observations.
LBRx bridges data, models, and deployment, providing a structured path to bring AI into real-world environments safely and efficiently.
During his visit to our SF robotics lab, Dwarkesh got a firsthand look at how we’re redefining next-gen robotics development with cutting-edge data collection
Shaping the next era of frontier AI
These three offerings from our Applied Research team are built to accelerate the remarkable progress our customers are making on the path towards superintelligence. Our shared goal is to provide benchmarks that deliver clearer insight into model behavior, agent frameworks that make complex behaviors easier to study, and robotics systems that connect models to the physical world.
We welcome researchers, practitioners, and builders to partner with us, offering feedback and insights that help shape the next generation of AI capabilities.