Blog

Your all-in-one hub for advancing frontier AI. Explore research, product updates, guides, and real-world use cases.

Latest Applied research Releases Announcements Use cases Engineering

The AI safety illusion: why current safety datasets fool us on model safety

AI safety is often judged by refusal rates, but our study of datasets like AdvBench and HarmBench shows these scores rely on obvious trigger words, not real adversarial intent. Remove the cues and the supposed safety collapses, revealing a stark gap between benchmarks and real world risk.

Shahriar Golchin•February 20, 2026

Reflections on NeurIPS 2025: Advancing evaluation and continual learning in AI

Takeaways on the themes and research directions likely to shape the year ahead. We focus on two core areas how to rigorously measure AI capabilities and how to build interactive systems that learn through experience over time.

Labelbox•December 16, 2025

Implicit Intelligence and Agent‑as‑a-World: Evaluating agents on what users don’t say

Most real-world tasks are underspecified. We introduce Implicit Intelligence to test whether agents can infer hidden constraints, and Agent-as-a-World, a simple YAML framework for simulating environments without brittle, hard-coded worlds.

Ved Sirdeshmukh•November 21, 2025

Bridging insight and innovation: Introducing Labelbox Applied Research

Today we’re launching Labelbox Applied Research with three flagship pillars: Labelbox Evals for unified model evaluation, Labelbox Agents for building reliable and interpretable agents, and Labelbox Robotics (LBRx) for delivering high-quality training data for advanced robotic manipulation.

Labelbox•November 19, 2025

Announcing R-ConstraintBench: A novel way to stress-test LLM reasoning abilities under interacting constraints

We've released a research paper on R-ConstraintBench, a novel benchmark for evaluating LLM reasoning on realistic resource-constrained project scheduling problems (RCPSP), a well-known NP-complete challenge.

Labelbox•August 22, 2025

Benchmarking deep research agents

Introducing Labelbox’s deep research leaderboard: an open, continuously‑updated scorecard that shows showing how top AI agents like OpenAI, Google, and Anthropic perform on long-form research tasks.

Labelbox•July 21, 2025

Building true RL systems: An experiment on solving real business tasks

We tested rubric-based rewards and GRPO on a real-world e-commerce task and found they outperformed sparse rewards by 300%. This helps validate their effectiveness for complex, multi-step business workflows.

Labelbox•July 1, 2025

Agentic AI: What it takes to build AI that acts

Agentic AI is emerging as a new frontier in autonomy, where models can plan, adapt, and take action independently. In this post we highlight three real-world projects with leading AI labs, from multi step tool use to structured reasoning and dynamic instruction following.

Labelbox•June 30, 2025

Benchmarking agentic search

Enterprises need search-augmented LLMs that deliver fast, trustworthy, and up-to-date answers—not just polished language. Since public benchmarks rarely test for this, the Labelbox research team conducted its own study across three frontier models: Gemini 2.5 Pro, GPT-4.1, and Claude 4.0 Opus.

Labelbox•June 13, 2025

Page 1 of 1