Labelbox•July 21, 2025
Benchmarking deep research agents

Our latest benchmark takes an in-depth look into the latest advances in research-grade AI
In this post, we introduce Labelbox’s agentic leaderboard for deep research where we provide an open, continuously‑updated scorecard that shows how the leading research‑grade agents (e.g., Google, OpenAI, and Anthropic) perform when presented with long-form research driven questions and tasks.
Most public leaderboards reward eloquent answers to short factual prompts. In real-world settings and use cases however, we see that enterprises, academics, and analysts want to assess:
- Depth: Can the agent reliably trace an argument across 20 k tokens of technical text?
- Evidence: Does the agent cite peer‑reviewed literature (and not random blogs) in the correct style?
- Taste: Can the agent reliably summarize nuance, note counter‑arguments, and know when to say “the evidence is inconclusive”?
How we built the test set
Headline results
What the numbers tell us
- Gemini 2.5 Pro currently leads, especially on long‑context STEM reasoning (e.g., deriving scaling laws from 30‑page fluid‑dynamics pre‑prints) and on weaving EU policy documents into economic critiques.
- GPT‑o4-mini delivers tight mathematical proofs and elegant prose but loses points for sporadic citation lapses and a weaker showing on historical nuance.
- Claude 4 Opus shines at explanatory clarity yet trails on cutting‑edge source retrieval, particularly in fast‑moving policy debates.
We conducted a rigorous evaluation using actual PhD-level research as our benchmark, comparing Google's Deep Research product against OpenAI and Anthropic's research capabilities. Rather than synthetic tests, we used real academic research with established ground truth and optimal research methodologies as our comparison standard. Google's Deep Research outperformed both competitors across every major metric including research quality, source integration accuracy, methodological rigor, and output reliability.
This outcome mirrors our previous Search Agent leaderboard findings and highlights Google's core advantage: decades of expertise in both information retrieval and web search. Their deep understanding of how to find, evaluate, and synthesize information at scale translated directly into superior research capabilities, demonstrating that domain expertise in search and knowledge organization provides a measurable edge in AI-powered research tools, even when tested against the highest academic standards.
What each research agent does well, and where it can do better
STEM subrankings
Gemini emerges as the strongest STEM researcher here, largely because of its tight hooks into academic databases which allows it to surface brand new studies and well‑cited conference papers faster than its peers. GPT‑4.1 follows closely, often providing the most rigorous step‑by‑step derivations and mathematical reasoning, though it can trail when the prompt hinges on very recent literature. Claude distinguishes itself with exceptionally clear explanations which are great for audiences who need concepts unpacked, yet still misses the occasional cutting‑edge citation, which ultimately nudges it behind the other two.
Economics, politics and global affairs subranking
When it comes to politics and economics-related tasks, the advantage shifts toward tools that ingest real‑time news feeds: Gemini usually cites fresher legislative updates and market data, giving its answers a timeliness edge. GPT‑4.1 shines when prompts demand comparative or multi‑actor analysis, piecing together nuanced arguments from a broad document set, though it may reference filings a day or two out of date. Claude offers thoughtful normative commentary and clear policy framing, yet its supporting references are a bit less diverse and occasionally miss subtle date discrepancies, keeping it just shy of the front‑runners.
Humanities & arts subranking
Across humanities prompts (e.g., history, cultural analysis, literature, etc), Gemini maintains a narrow lead by blending multilingual sources‑gathering with concise synthesis. GPT‑4.1 is the model most adept at weaving together multiple perspectives and theoretical frameworks, but it sometimes leans too heavily on secondary commentary where primary texts would be stronger. Claude’s prose is the most elegant and contextual, making its answers highly readable; however, its source list tends to be shorter and skews toward mid‑tier journals, which slightly diminishes evidentiary depth.
Key insights
Complex questions reveal true capabilities
Simple factual queries make all three models look impressive, but enterprise-grade challenges expose their distinct strengths and weaknesses. We found the largest performance gaps emerged from:
- Time-sensitive queries requiring fresh information
- Multi-perspective analysis of contested policy issues
- Adversarial prompts designed to test robustness
Citation reliability remains a universal challenge
Despite improvements in retrieval quality, none of the models can guarantee accurate attribution. Common issues include:
- Broken or outdated links
- Mismatched passages and sources
- Fabricated citations that look legitimate
Recommendation: Treat independent citation validation as essential infrastructure, not an optional add-on. Any production deployment should include automated fact-checking or human oversight for source verification.
Model specialization creates portfolio opportunities
Rather than seeking a one "best" model, our data suggests clear domain-specific advantages:
- Gemini 2.5 Pro: Excels at cutting-edge scientific content and real-time information synthesis
- GPT-o4: Delivers superior cross-source analysis and quantitative reasoning
- Claude 4 Opus: Provides exceptional narrative clarity and explanatory depth
Strategic Implication: Organizations should consider a portfolio approach—routing different query types to the model best suited for that domain and urgency level.
You can check out the deep research leaderboard here. We'll be planning to update this on a regular basis to reflect major releases and model updates. If you're interested in implementing research-grade evaluations for your AI models and projects, contact us to explore how Labelbox can help operate a powerful evaluation of your models.