Environments for post-training, at scale
RL training gyms & evals for reasoning, tool use, and computer use — built for the domains where AI creates the most economic value.
RL environments for the hardest problems in AI
Software-generated RL environments at scale — with calibrated reward signals and your target pass@k, across most valuable knowledge work domains.
Scientific knowledge work examplesThe simulation platform for enterprise knowledge work
WorldSim recreates the full enterprise software stack — GitLab, Jira, CRM, email, chat, and more — seeded with realistic business data. Configurable world effects generate diverse scenarios at scale: outages, PRs, database mutations, content changes. Agents navigate 600+ MCP tools across computer use and terminal use tasks, graded deterministically or with LLM-as-judge — tuned to your pass@k and integrated with your RL infrastructure.
The reward signal problem, solved at scale
Building RL environments that produce good reward signals is hard. Task design, verification logic, complexity gradients, credit assignment — get any of it wrong and your model learns shortcuts instead of capabilities.
Labelbox's software generates environments that encode that expertise — calibrated to your reward objectives and pass@k targets, at the scale your post-training demands.
Teaching models taste
Robert Pirsig called it Quality — recognizable before it's definable. Preference labels are that signal: structured comparative judgments on agentic trajectories from RL environments, capturing what makes a long-horizon response genuinely good, not just correct.
Operating across most economically valuable domains
Autonomous AI research
Long-horizon reasoning tasks with intermediate reward signals. Multi-step hypothesis generation, verification, and structured knowledge synthesis. Environments co-designed with domain experts.
Agent coding & software engineering
Long-horizon software tasks across real codebases — debugging production failures, authoring PRs, navigating full SDLC workflows. Agents operate on real code with real consequences, not toy problems.
Multimodal knowledge work
Tasks spanning text, images, charts, documents, and structured data — requiring agents to reason across modalities within a single workflow. Built for the full complexity of real knowledge work.
Voice with agentic tool use
Voice-native agents that reason, plan, and execute tool calls mid-conversation. Tasks designed for the latency, interruption, and context challenges unique to voice interfaces.
Computer use
GUI-based task execution across enterprise software. Verifiable multi-step outcomes in environments that mirror production toolchains.
Cybersecurity
Attack and defense scenarios. CTF-style tasks with programmatic verification. Adversarial edge cases designed to surface brittleness.
Hundreds of AI teams build with Labelbox

How Meta built GIM with Labelbox data to evaluate frontier AI reasoning
Problem
Meta needed a benchmark that remained discriminative as existing LLM evaluations saturated. The team wanted tasks grounded in practical reasoning rather than obscure knowledge or synthetic puzzles, with enough rubric detail to capture partial credit and enough quality control to support a public-private contamination diagnostic.
Solution
Labelbox produced the data foundation for GIM: 820 expert-authored problems across seven cognitive categories, including 229 multimodal items and 528 rubric-graded prompts. The work included original prompt creation, structured scoring criteria, review, annotation, and quality assurance, enabling Meta to calibrate a 2PL IRT model over more than 200,000 prompt-response pairs.
Result
Meta released GIM-615, calibrated item parameters, and an evaluation framework that benchmarked 22 models across 47 reporting configurations. The paper found GIM remains far from saturated, with roughly 20% of items above frontier ability, giving researchers a durable way to compare model capability, thinking budgets, and future systems.

Human preference signal for evaluating LLMs inside Vertex AI


Tracking surgical instruments in video to advance robotic surgery
