Environments for post-training, at scale

RL training gyms & evals for reasoning, tool use, and computer use — built for the domains where AI creates the most economic value.

RL environments for the hardest problems in AI

Software-generated RL environments at scale — with calibrated reward signals and your target pass@k, across most valuable knowledge work domains.

Scientific knowledge work examples

The simulation platform for enterprise knowledge work

WorldSim recreates the full enterprise software stack — GitLab, Jira, CRM, email, chat, and more — seeded with realistic business data. Configurable world effects generate diverse scenarios at scale: outages, PRs, database mutations, content changes. Agents navigate 600+ MCP tools across computer use and terminal use tasks, graded deterministically or with LLM-as-judge — tuned to your pass@k and integrated with your RL infrastructure.

The reward signal problem, solved at scale

Building RL environments that produce good reward signals is hard. Task design, verification logic, complexity gradients, credit assignment — get any of it wrong and your model learns shortcuts instead of capabilities.

Labelbox's software generates environments that encode that expertise — calibrated to your reward objectives and pass@k targets, at the scale your post-training demands.

Teaching models taste

Robert Pirsig called it Quality — recognizable before it's definable. Preference labels are that signal: structured comparative judgments on agentic trajectories from RL environments, capturing what makes a long-horizon response genuinely good, not just correct.

Operating across most economically valuable domains

Autonomous AI research

Long-horizon reasoning tasks with intermediate reward signals. Multi-step hypothesis generation, verification, and structured knowledge synthesis. Environments co-designed with domain experts.

Agent coding & software engineering

Long-horizon software tasks across real codebases — debugging production failures, authoring PRs, navigating full SDLC workflows. Agents operate on real code with real consequences, not toy problems.

Multimodal knowledge work

Tasks spanning text, images, charts, documents, and structured data — requiring agents to reason across modalities within a single workflow. Built for the full complexity of real knowledge work.

Voice with agentic tool use

Voice-native agents that reason, plan, and execute tool calls mid-conversation. Tasks designed for the latency, interruption, and context challenges unique to voice interfaces.

Computer use

GUI-based task execution across enterprise software. Verifiable multi-step outcomes in environments that mirror production toolchains.

Cybersecurity

Attack and defense scenarios. CTF-style tasks with programmatic verification. Adversarial edge cases designed to surface brittleness.

Hundreds of AI teams build with Labelbox

How Meta built GIM with Labelbox data to evaluate frontier AI reasoning

Problem

Meta needed a benchmark that remained discriminative as existing LLM evaluations saturated. The team wanted tasks grounded in practical reasoning rather than obscure knowledge or synthetic puzzles, with enough rubric detail to capture partial credit and enough quality control to support a public-private contamination diagnostic.

Solution

Labelbox produced the data foundation for GIM: 820 expert-authored problems across seven cognitive categories, including 229 multimodal items and 528 rubric-graded prompts. The work included original prompt creation, structured scoring criteria, review, annotation, and quality assurance, enabling Meta to calibrate a 2PL IRT model over more than 200,000 prompt-response pairs.

Result

Meta released GIM-615, calibrated item parameters, and an evaluation framework that benchmarked 22 models across 47 reporting configurations. The paper found GIM remains far from saturated, with roughly 20% of items above frontier ability, giving researchers a durable way to compare model capability, thinking budgets, and future systems.

Technology and software

Experience the difference with Labelbox

Get started with high-quality RL data today.