Labelbox•July 21, 2025

Benchmarking deep research agents

Our latest benchmark takes an in-depth look into the latest advances in research-grade AI

In this post, we introduce Labelbox’s agentic leaderboard for deep research where we provide an open, continuously‑updated scorecard that shows how the leading research‑grade agents (e.g., Google, OpenAI, and Anthropic) perform when presented with long-form research driven questions and tasks.

Most public leaderboards reward eloquent answers to short factual prompts. In real-world settings and use cases however, we see that enterprises, academics, and analysts want to assess:

Depth: Can the agent reliably trace an argument across 20 k tokens of technical text?
Evidence:  Does the agent cite peer‑reviewed literature (and not random blogs) in the correct style?
Taste: Can the agent reliably summarize nuance, note counter‑arguments, and know when to say “the evidence is inconclusive”?

How we built the test set

Phase	What we did	Why it matters
Prompt design	Partnered with PhDs who drafted domain‑expert questions spanning physics, bio‑engineering, economics, political theory, art history, and more.	Ensures each prompt demands genuine reasoning and literature synthesis.
Rubric creation	The same scholars wrote hyper‑detailed “answer traces”—key citations, logic checkpoints, common pitfalls—to anchor grading.	Prevents “fluent nonsense” from sneaking through.
Dual evaluation	Responses scored first by LLM‑as-a-judge model trained on the traces, then spot‑audited by human reviewers.	Combines scale with expert oversight.

Headline results

Rank	Model (research mode)	Composite(0‑100)	STEM	Humanities	Econ / Politics
1	Gemini 2.5 Pro	77.55%	79	75	77
2	GPT‑o4-mini	71.46%	72	70	71
3	Claude 4 Opus	69.05%	69	67	68

What the numbers tell us

Gemini 2.5 Pro currently leads, especially on long‑context STEM reasoning (e.g., deriving scaling laws from 30‑page fluid‑dynamics pre‑prints) and on weaving EU policy documents into economic critiques.
GPT‑o4-mini delivers tight mathematical proofs and elegant prose but loses points for sporadic citation lapses and a weaker showing on historical nuance.
Claude 4 Opus shines at explanatory clarity yet trails on cutting‑edge source retrieval, particularly in fast‑moving policy debates.

We conducted a rigorous evaluation using actual PhD-level research as our benchmark, comparing Google's Deep Research product against OpenAI and Anthropic's research capabilities. Rather than synthetic tests, we used real academic research with established ground truth and optimal research methodologies as our comparison standard. Google's Deep Research outperformed both competitors across every major metric including research quality, source integration accuracy, methodological rigor, and output reliability.

This outcome mirrors our previous Search Agent leaderboard findings and highlights Google's core advantage: decades of expertise in both information retrieval and web search. Their deep understanding of how to find, evaluate, and synthesize information at scale translated directly into superior research capabilities, demonstrating that domain expertise in search and knowledge organization provides a measurable edge in AI-powered research tools, even when tested against the highest academic standards.

What each research agent does well, and where it can do better

Capability	Best‑in‑class	Areas where errors tend to occur
Ultra‑long technical synthesis	Gemini parsed 60 k‑token treatises without truncating context.	GPT‑o4 occasionally dropped early table notes; Claude compressed mid‑section arguments.
Evidence discipline	GPT‑o4 kept inline APA citations 92 % of the time.	All models hallucinated DOIs in ~5 % of cases; citation QA still required.
Cross‑domain agility	Gemini connected semiconductor export bans to nitrogen fertiliser futures in Ukraine, demonstrating cross‑vertical reasoning.	Claude undercited primary economic data; GPT‑o4 misdated one WTO ruling.
Taste / critical voice	Claude offered the most balanced historiography critiques.	Gemini occasionally defaulted to policy buzzwords; GPT‑o4 tended to hedge excessively.

STEM subrankings

Rank	Model (research mode)	Composite(0‑100)	STEM
1	Gemini 2.5 Pro	77.55%	79.4%
2	GPT‑o4-mini	71.46%	72.7%
3	Claude 4 Opus	69.05%	69.8%

Gemini emerges as the strongest STEM researcher here, largely because of its tight hooks into academic databases which allows it to surface brand new studies and well‑cited conference papers faster than its peers. GPT‑4.1 follows closely, often providing the most rigorous step‑by‑step derivations and mathematical reasoning, though it can trail when the prompt hinges on very recent literature. Claude distinguishes itself with exceptionally clear explanations which are great for audiences who need concepts unpacked, yet still misses the occasional cutting‑edge citation, which ultimately nudges it behind the other two.

Economics, politics and global affairs subranking

Rank	Model (research mode)	Composite(0‑100)	Econ/Policy
1	Gemini 2.5 Pro	77.55%	75.9%
2	GPT‑o4-mini	71.46%	70.6%
3	Claude 4 Opus	69.05%	67.4%

When it comes to politics and economics-related tasks, the advantage shifts toward tools that ingest real‑time news feeds: Gemini usually cites fresher legislative updates and market data, giving its answers a timeliness edge. GPT‑4.1 shines when prompts demand comparative or multi‑actor analysis, piecing together nuanced arguments from a broad document set, though it may reference filings a day or two out of date. Claude offers thoughtful normative commentary and clear policy framing, yet its supporting references are a bit less diverse and occasionally miss subtle date discrepancies, keeping it just shy of the front‑runners.

Humanities & arts subranking

Rank	Model (research mode)	Composite(0‑100)	Humanities/Arts
1	Gemini 2.5 Pro	77.55%	74.1%
2	GPT‑o4-mini	71.46%	71.8%
3	Claude 4 Opus	69.05%	68.8%

Across humanities prompts (e.g., history, cultural analysis, literature, etc), Gemini maintains a narrow lead by blending multilingual sources‑gathering with concise synthesis. GPT‑4.1 is the model most adept at weaving together multiple perspectives and theoretical frameworks, but it sometimes leans too heavily on secondary commentary where primary texts would be stronger. Claude’s prose is the most elegant and contextual, making its answers highly readable; however, its source list tends to be shorter and skews toward mid‑tier journals, which slightly diminishes evidentiary depth.

Key insights

Complex questions reveal true capabilities

Simple factual queries make all three models look impressive, but enterprise-grade challenges expose their distinct strengths and weaknesses. We found the largest performance gaps emerged from:

Time-sensitive queries requiring fresh information
Multi-perspective analysis of contested policy issues
Adversarial prompts designed to test robustness

Citation reliability remains a universal challenge

Despite improvements in retrieval quality, none of the models can guarantee accurate attribution. Common issues include:

Broken or outdated links
Mismatched passages and sources
Fabricated citations that look legitimate

Recommendation: Treat independent citation validation as essential infrastructure, not an optional add-on. Any production deployment should include automated fact-checking or human oversight for source verification.

Model specialization creates portfolio opportunities

Rather than seeking a one "best" model, our data suggests clear domain-specific advantages:

Gemini 2.5 Pro: Excels at cutting-edge scientific content and real-time information synthesis
GPT-o4: Delivers superior cross-source analysis and quantitative reasoning
Claude 4 Opus: Provides exceptional narrative clarity and explanatory depth

Strategic Implication: Organizations should consider a portfolio approach—routing different query types to the model best suited for that domain and urgency level.

You can check out the deep research leaderboard here. We'll be planning to update this on a regular basis to reflect major releases and model updates. If you're interested in implementing research-grade evaluations for your AI models and projects, contact us to explore how Labelbox can help operate a powerful evaluation of your models.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free