Introducing Recursion: the RL platform for enterprise specialist agents

Labelbox•June 13, 2025

Benchmarking agentic search

Headline results

Model	Composite Score*
1.Gemini 2.5 Pro (search)	8.0 / 10
2.GPT-4.1 (search)	7.25 / 10
3.Claude 4.0 Opus (search)	7.0 / 10

Composite = weighted mean across source quality, relevance, recency, and multi-language understanding

Research purpose

When enterprises deploy a search-augmented LLM, they expect more than eloquent responses; they need fast, up-to-date answers that cite trustworthy sources across diverse domains of knowledge. Public leaderboards rarely measure that blend of skills in real-world conditions, so the Labelbox research team ran its own comprehensive study. Our target lineup featured three frontier-grade systems with native web search:

Google Gemini 2.5 Pro (grounded search)
OpenAI GPT-4.1 (search preview)
Anthropic Claude 4.0 Opus (web search tool)

See the agentic search leaderboards here.

Building a real-world test set

We constructed 200 challenging questions designed to expose the cracks in modern retrieval across the full spectrum of knowledge domains. Rather than limiting ourselves to simple factual queries, we deliberately crafted a diverse dataset that reflects the complexity of real enterprise search needs.

The search-type query framework

Our methodology centered on what we call "search-type" queries.

Knowledge gap verification: These prompts failed to generate accurate responses on a super-majority (>66.6%) of standard LLMs without search capabilities, ensuring we were testing genuine retrieval needs rather than memorized knowledge.
Search-engine orientation: The queries mirror the types of information-seeking behavior users exhibit when interacting with search engines or formulating advanced search queries, as opposed to task-oriented prompts like "write a summary" or "create a presentation."

Question categories and distribution

Our 200-question dataset spanned multiple domains and complexity levels:

Domain distribution:

25% STEM questions — Complex scientific, technical, engineering, and mathematical queries requiring current research findings and precise technical accuracy
25% Recent news & current events — Time-sensitive questions about political developments, market changes, and breaking news
20% Historical & archival information — Questions requiring synthesis of historical data, archival research, and temporal context
15% Faulty & adversarial prompts — Intentionally misleading premises, ambiguous wording, and queries designed to test robustness
10% Multi-language context — Questions requiring understanding of non-English sources or cross-cultural perspectives
5% Specialized domain knowledge — Niche topics in law, medicine, finance, and other expert domains

Complexity breakdown:

50% time-sensitive — "What are the latest amendments to the EU AI Act?" or "How have semiconductor export restrictions evolved in Q1 2025?"
30% multi-perspective or comparative — "Contrast Japan's and Germany's chip-subsidy strategies across economic and geopolitical dimensions"
20% intentionally challenging — ambiguous wording, niche scientific topics, misleading premises, or queries requiring synthesis across contradictory sources

Rigorous evaluation methodology

Each query was processed by all three agents with their search features fully enabled. We employed expert human aligners throughout both the creation and evaluation phases, implementing a multi-stage quality control process:

Prompt design phase: Expert aligners crafted questions based on real enterprise search patterns and current information gaps
Data control & filtering: Systematic review to ensure query quality, eliminate bias, and verify the "search-type" criteria
Evaluation phase: The same expert aligners rated responses on our composite ten-point scale

Our evaluation framework weighs four critical dimensions:

Source quality & trustworthiness (30%) — Credibility and authority of cited sources
Answer relevance & completeness (25%) — Direct response to the query with comprehensive coverage
Information recency & accuracy (25%) — Up-to-date information with factual precision

Multi-language & cross-cultural understanding (20%) — Ability to synthesize information across linguistic and cultural boundaries

Where each model shines—and stumbles

Recency & real-time information Gemini demonstrated superior performance in accessing the most current information, almost always linking to articles published within 24 hours of the query. GPT-4.1 and Claude occasionally served day-old or week-old sources, particularly problematic for fast-moving news and market information.

STEM & technical accuracy All three models showed strong performance on technical questions, but Gemini's integration with Google Scholar and academic databases gave it an edge in finding authoritative scientific sources. Claude performed best at explaining complex technical concepts clearly, while GPT-4.1 excelled at mathematical reasoning within STEM contexts.

Multi-perspective policy analysis GPT-4.1 demonstrated the strongest capability for synthesizing diverse viewpoints on complex policy questions, though Gemini maintained an edge in completeness and source diversity. Claude showed good analytical depth but sometimes struggled with balancing competing perspectives.

Citation reliability & source attribution This remains a critical weakness across all models. Even Gemini, the top performer, skipped or mangled citations approximately 12% of the time, a rate that presents significant compliance risks for enterprise deployments. GPT-4.1 showed similar citation gaps, while Claude had the highest rate of incomplete or inaccurate source attribution at nearly 18%.

Ambiguity & adversarial robustness Gemini handled ambiguous queries most gracefully, often asking clarifying questions or providing structured responses that acknowledged uncertainty. GPT-4.1 occasionally hallucinated assumptions when faced with unclear prompts. Claude had the highest rate of choosing incorrect interpretations of ambiguous queries, though it provided more detailed reasoning for its choices.

Multi-language & cross-cultural context Gemini's multilingual capabilities proved strongest, effectively synthesizing information from non-English sources and maintaining cultural context. All models showed limitations when queries required deep understanding of non-Western perspectives or specialized regional knowledge.

Domain-specific performance breakdown

STEM questions (25% of dataset)

Model	Score	Key Strengths	Notable Weaknesses
1.Gemini 2.5 Pro	7.8 / 10	Google Scholar integration, academic source access	Occasional oversimplification of complex concepts
2.GPT-4.1	8.2 / 10	Strong mathematical reasoning, technical synthesis	Limited access to latest research papers
3.Claude 4.0 Opus	7.2 / 10	Excellent concept explanation clarity	Weaker on cutting-edge research findings

Recent news & current events (25% of dataset)

Model	Score	Key Strengths	Notable Weaknesses
1.Gemini 2.5 Pro	8.9 / 10	Same-day article access, breaking news coverage	Occasional information overload
2.GPT-4.1	6.8 / 10	Good synthesis of multiple news sources	1-2 day lag on latest developments
3.Claude 4.0 Opus	6.4 / 10	Thoughtful analysis of news implications	Frequently relies on day-old sources

Historical & archival information (20% of dataset)

Model	Score	Key Strengths	Notable Weaknesses
1.Gemini 2.5 Pro	7.7 / 10	Access to digitized archives, comprehensive sourcing	Sometimes prioritizes quantity over quality
2.GPT-4.1	7.4 / 10	Excellent historical context and chronology	Limited access to specialized archives
3.Claude 4.0 Opus	7.1 / 10	Strong analytical narrative construction	Gaps in accessing primary sources

Faulty & adversarial prompts (15% of dataset)

Model	Score	Key Strengths	Notable Weaknesses
1.Gemini 2.5 Pro	8.3 / 10	Clarifying questions, uncertainty acknowledgment	Occasionally over-cautious responses
2.GPT-4.1	6.2 / 10	Reasonable assumption handling when unclear	Tendency to hallucinate missing context
3.Claude 4.0 Opus	5.8 / 10	Detailed reasoning for interpretation choices	Highest rate of incorrect assumption selection

Multi-language context (10% of dataset)

Model	Score	Key Strengths	Notable Weaknesses
1. Gemini 2.5 Pro	8.2 / 10	Strong multilingual synthesis, cultural context	Occasional translation nuance losses
2.GPT-4.1	6.8 / 10	Decent cross-language source integration	Weaker on non-Western cultural perspectives
3.Claude 4.0 Opus	6.9 / 10	Good explanation of cultural differences	Limited access to non-English sources

Specialized domain knowledge (5% of dataset)

Model	Score	Key Strengths	Notable Weaknesses
1.Gemini 2.5 Pro	7.4 / 10	Professional database access, regulatory sources	Complex jargon sometimes unclear
2.GPT-4.1	7.1 / 10	Strong analytical synthesis of expert sources	Limited real-time regulatory updates
3.Claude 4.0 Opus	6.8 / 10	Clear explanation of specialized concepts	Gaps in latest professional standards

Key insights

1. Question type dramatically affects performance gaps

Using only straightforward factual Q&A would obscure meaningful differences between models. The performance gaps become most apparent in time-sensitive queries, multi-perspective analysis, and adversarial scenarios, precisely the areas where enterprise users need the most reliability.

2. Citation accuracy remains an unsolved challenge

Despite advances in retrieval capabilities, source attribution errors occur frequently enough across all models to require systematic verification processes. Organizations deploying these systems must implement robust citation checking and source validation workflows.

3. Domain specialization varies significantly

Each model showed distinct strengths: Gemini for recency and scientific sources, GPT-4.1 for synthesis and reasoning, Claude for explanation clarity. Enterprise deployment strategies should consider these specializations when designing multi-model approaches.

You can check out the agentic search leaderboards here. If you're interested in implementing robust search-agent evaluations for your AI models and projects, contact us to explore how Labelbox can help operate a powerful evaluation of your models.

Continue reading

When benchmarks saturate, what comes next? Meta’s GIM pushes AI evaluation toward integrated reasoning

Meta Superintelligence Labs introduces GIM (Grounded Integration Measure), a benchmark shifting from isolated recall to integrated reasoning. It evaluates how models coordinate constraints, ambiguity, spatial logic, and epistemic judgment within a single problem.

Labelbox•May 20, 2026

Introducing EchoChain: An audio benchmark for reasoning under pressure in full-duplex dialogue

We introduce EchoChain to advance audio evaluation by testing Dual-Stream Reasoning in scenario-driven conversations with mid-speech interruptions, constraint updates, and shifting objectives. The benchmark measures whether models sustain coherent, adaptive intelligence in real time.

Smit Nautambhai Modi•March 4, 2026

The AI safety illusion: why current safety datasets fool us on model safety

AI safety is often judged by refusal rates, but our study of datasets like AdvBench and HarmBench shows these scores rely on obvious trigger words, not real adversarial intent. Remove the cues and the supposed safety collapses, revealing a stark gap between benchmarks and real world risk.

Shahriar Golchin•February 20, 2026

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free