logo
×

LabelboxJuly 21, 2025

Benchmarking deep research agents

Our latest benchmark takes an in-depth look into the latest advances in research-grade AI


In this post, we introduce Labelbox’s agentic leaderboard for deep research where we provide an open, continuously‑updated scorecard that shows how the leading research‑grade agents (e.g., Google, OpenAI, and Anthropic) perform when presented with long-form research driven questions and tasks.

Most public leaderboards reward eloquent answers to short factual prompts. In real-world settings and use cases however, we see that enterprises, academics, and analysts want to assess:

  • Depth: Can the agent reliably trace an argument across 20 k tokens of technical text?
  • Evidence:  Does the agent cite peer‑reviewed literature (and not random blogs) in the correct style?
  • Taste: Can the agent reliably summarize nuance, note counter‑arguments, and know when to say “the evidence is inconclusive”?

How we built the test set

Phase

What we did

Why it matters

Prompt design

Partnered with PhDs who drafted domain‑expert questions spanning physics, bio‑engineering, economics, political theory, art history, and more.

Ensures each prompt demands genuine reasoning and literature synthesis.

Rubric creation

The same scholars wrote hyper‑detailed “answer traces”—key citations, logic checkpoints, common pitfalls—to anchor grading.

Prevents “fluent nonsense” from sneaking through.

Dual evaluation

Responses scored first by LLM‑as-a-judge model trained on the traces, then spot‑audited by human reviewers.

Combines scale with expert oversight.

Headline results

Rank

Model (research mode)

Composite(0‑100)

STEM

Humanities

Econ / Politics

1

Gemini 2.5 Pro

77.55%

79

75

77

2

GPT‑o4-mini

71.46%

72

70

71

3

Claude 4 Opus

69.05%

69

67

68

What the numbers tell us

  • Gemini 2.5 Pro currently leads, especially on long‑context STEM reasoning (e.g., deriving scaling laws from 30‑page fluid‑dynamics pre‑prints) and on weaving EU policy documents into economic critiques.
  • GPT‑o4-mini delivers tight mathematical proofs and elegant prose but loses points for sporadic citation lapses and a weaker showing on historical nuance.
  • Claude 4 Opus shines at explanatory clarity yet trails on cutting‑edge source retrieval, particularly in fast‑moving policy debates.

We conducted a rigorous evaluation using actual PhD-level research as our benchmark, comparing Google's Deep Research product against OpenAI and Anthropic's research capabilities. Rather than synthetic tests, we used real academic research with established ground truth and optimal research methodologies as our comparison standard. Google's Deep Research outperformed both competitors across every major metric including research quality, source integration accuracy, methodological rigor, and output reliability.

This outcome mirrors our previous Search Agent leaderboard findings and highlights Google's core advantage: decades of expertise in both information retrieval and web search. Their deep understanding of how to find, evaluate, and synthesize information at scale translated directly into superior research capabilities, demonstrating that domain expertise in search and knowledge organization provides a measurable edge in AI-powered research tools, even when tested against the highest academic standards.

What each research agent does well, and where it can do better

Capability

Best‑in‑class

Areas where errors tend to occur

Ultra‑long technical synthesis

Gemini parsed 60 k‑token treatises without truncating context.

GPT‑o4 occasionally dropped early table notes; Claude compressed mid‑section arguments.

Evidence discipline

GPT‑o4 kept inline APA citations 92 % of the time.

All models hallucinated DOIs in ~5 % of cases; citation QA still required.

Cross‑domain agility

Gemini connected semiconductor export bans to nitrogen fertiliser futures in Ukraine, demonstrating cross‑vertical reasoning.

Claude undercited primary economic data; GPT‑o4 misdated one WTO ruling.

Taste / critical voice

Claude offered the most balanced historiography critiques.

Gemini occasionally defaulted to policy buzzwords; GPT‑o4 tended to hedge excessively.

STEM subrankings

Rank

Model (research mode)

Composite(0‑100)

STEM

1

Gemini 2.5 Pro

77.55%

79.4%

2

GPT‑o4-mini

71.46%

72.7%

3

Claude 4 Opus

69.05%

69.8%


Gemini emerges as the strongest STEM researcher here, largely because of its tight hooks into academic databases which allows it to surface brand new studies and well‑cited conference papers faster than its peers. GPT‑4.1 follows closely, often providing the most rigorous step‑by‑step derivations and mathematical reasoning, though it can trail when the prompt hinges on very recent literature. Claude distinguishes itself with exceptionally clear explanations which are great for audiences who need concepts unpacked, yet still misses the occasional cutting‑edge citation, which ultimately nudges it behind the other two.

Economics, politics and global affairs subranking

Rank

Model (research mode)

Composite(0‑100)

Econ/Policy

1

Gemini 2.5 Pro

77.55%

75.9%

2

GPT‑o4-mini

71.46%

70.6%

3

Claude 4 Opus

69.05%

67.4%

When it comes to politics and economics-related tasks, the advantage shifts toward tools that ingest real‑time news feeds: Gemini usually cites fresher legislative updates and market data, giving its answers a timeliness edge. GPT‑4.1 shines when prompts demand comparative or multi‑actor analysis, piecing together nuanced arguments from a broad document set, though it may reference filings a day or two out of date. Claude offers thoughtful normative commentary and clear policy framing, yet its supporting references are a bit less diverse and occasionally miss subtle date discrepancies, keeping it just shy of the front‑runners.

Humanities & arts subranking

Rank

Model (research mode)

Composite(0‑100)

Humanities/Arts

1

Gemini 2.5 Pro

77.55%

74.1%

2

GPT‑o4-mini

71.46%

71.8%

3

Claude 4 Opus

69.05%

68.8%

Across humanities prompts (e.g., history, cultural analysis, literature, etc), Gemini maintains a narrow lead by blending multilingual sources‑gathering with concise synthesis. GPT‑4.1 is the model most adept at weaving together multiple perspectives and theoretical frameworks, but it sometimes leans too heavily on secondary commentary where primary texts would be stronger. Claude’s prose is the most elegant and contextual, making its answers highly readable; however, its source list tends to be shorter and skews toward mid‑tier journals, which slightly diminishes evidentiary depth.

Key insights

Complex questions reveal true capabilities

Simple factual queries make all three models look impressive, but enterprise-grade challenges expose their distinct strengths and weaknesses. We found the largest performance gaps emerged from:

  • Time-sensitive queries requiring fresh information
  • Multi-perspective analysis of contested policy issues
  • Adversarial prompts designed to test robustness

Citation reliability remains a universal challenge

Despite improvements in retrieval quality, none of the models can guarantee accurate attribution. Common issues include:

  • Broken or outdated links
  • Mismatched passages and sources
  • Fabricated citations that look legitimate

Recommendation: Treat independent citation validation as essential infrastructure, not an optional add-on. Any production deployment should include automated fact-checking or human oversight for source verification.

Model specialization creates portfolio opportunities

Rather than seeking a one "best" model, our data suggests clear domain-specific advantages:

  • Gemini 2.5 Pro: Excels at cutting-edge scientific content and real-time information synthesis
  • GPT-o4: Delivers superior cross-source analysis and quantitative reasoning
  • Claude 4 Opus: Provides exceptional narrative clarity and explanatory depth

Strategic Implication: Organizations should consider a portfolio approach—routing different query types to the model best suited for that domain and urgency level.


You can check out the deep research leaderboard here. We'll be planning to update this on a regular basis to reflect major releases and model updates. If you're interested in implementing research-grade evaluations for your AI models and projects, contact us to explore how Labelbox can help operate a powerful evaluation of your models.