Introducing Recursion: the RL platform for enterprise specialist agents

How Meta built GIM with Labelbox data to evaluate frontier AI reasoning

Problem

Meta needed a benchmark that remained discriminative as existing LLM evaluations saturated. The team wanted tasks grounded in practical reasoning rather than obscure knowledge or synthetic puzzles, with enough rubric detail to capture partial credit and enough quality control to support a public-private contamination diagnostic.

Solution

Labelbox produced the data foundation for GIM: 820 expert-authored problems across seven cognitive categories, including 229 multimodal items and 528 rubric-graded prompts. The work included original prompt creation, structured scoring criteria, review, annotation, and quality assurance, enabling Meta to calibrate a 2PL IRT model over more than 200,000 prompt-response pairs.

Result

Meta released GIM-615, calibrated item parameters, and an evaluation framework that benchmarked 22 models across 47 reporting configurations. The paper found GIM remains far from saturated, with roughly 20% of items above frontier ability, giving researchers a durable way to compare model capability, thinking budgets, and future systems.

How Meta built GIM with Labelbox data to evaluate frontier AI reasoning

Building a benchmark for integrated reasoning

As LLM benchmarks saturated, Meta Superintelligence Labs set out to measure something harder to fake: whether models can integrate multiple cognitive skills inside practical, grounded tasks. The result was the Grounded Integration Measure (GIM), a benchmark designed to test reasoning across constraints, state tracking, epistemic vigilance, audience calibration, planning, and multimodal context.

Read the full Meta GIM research paper on arXiv.

Meta's GIM paper describes a benchmark of 820 original problems, including 615 public items and 205 private items for contamination diagnostics. Rather than relying on obscure expert trivia or purely abstract puzzles, GIM uses broadly accessible knowledge and raises difficulty through integration density: each task asks models to coordinate several forms of reasoning at once.

Producing the data foundation behind GIM

Labelbox produced the data behind the research paper, turning Meta's benchmark design into a high-quality evaluation dataset. The work spanned original expert-authored prompts, multimodal items, rubric-decomposed scoring criteria, and reviewed annotations that made the benchmark reliable enough for calibrated model comparisons.

The paper notes that each GIM prompt required, on average, about 11 person-hours from drafting through reference-answer derivation, rubric decomposition, peer review, and quality assurance. Across the full benchmark, that represents roughly 9,000 person-hours, or about four person-years of expert author time.

Key elements of the dataset included:

820 original problems across seven primary cognitive categories and eighteen sub-categories. This let Meta slice results by logical reasoning, quantitative reasoning, world knowledge, language and intent, procedural planning, constraints, and spatial or intuitive reasoning.
528 rubric-graded prompts with a median of six independently judged criteria. Rubrics converted open-ended responses into partial-credit signals instead of flattening each answer to a binary pass or fail.
229 multimodal items spanning images and PDFs. These tasks tested whether models could reason over supporting context, not merely read text prompts.
A balanced public-private split. The 615 public and 205 private items give researchers reproducibility while preserving a built-in way to detect benchmark contamination.

Turning expert judgment into calibrated evaluation signal

GIM's value depends on more than the prompts themselves. The benchmark needed scoring criteria that were atomic, self-contained, mutually exclusive, collectively exhaustive, and easy to verify. That structure gave Meta a way to score open-ended answers with partial credit and compare models even when frontier-model runs had missing, truncated, or failed responses.

With the dataset in place, Meta calibrated a continuous-response two-parameter logistic Item Response Theory model over more than 200,000 prompt-response pairs. This produced ability estimates that remain comparable across model runs, item subsets, and thinking configurations.

A durable benchmark for frontier AI

The GIM paper reports a leaderboard spanning 22 models and 47 reporting configurations, showing that model configuration choices such as thinking budget and quantization can matter as much as model family. It also finds that GIM remains far from saturated: roughly 20% of items are above the ability of the strongest reported frontier configuration.

By producing the benchmark's data foundation, Labelbox helped Meta turn complex, real-world reasoning demands into a reproducible evaluation framework. GIM now gives researchers a public item bank, calibrated parameters, and an extensible way to track frontier model capability as models, reasoning budgets, and agentic systems evolve.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free