logo
×

Arjun NargolwalaJune 13, 2025

Benchmarking agentic search

Headline results

Model

Composite Score*

1.Gemini 2.5 Pro (search)

8.0 / 10 

2.GPT-4.1 (search)

7.25 / 10 

3.Claude 4.0 Opus (search)

7.0 / 10 

Composite = weighted mean across source quality, relevance, recency, and multi-language understanding

Research purpose

When enterprises deploy a search-augmented LLM, they expect more than eloquent responses; they need fast, up-to-date answers that cite trustworthy sources across diverse domains of knowledge. Public leaderboards rarely measure that blend of skills in real-world conditions, so the Labelbox research team ran its own comprehensive study. Our target lineup featured three frontier-grade systems with native web search:

  • Google Gemini 2.5 Pro (grounded search)
  • OpenAI GPT-4.1 (search preview)
  • Anthropic Claude 4.0 Opus (web search tool)

See the agentic search leaderboards here.

Building a real-world test set 

We constructed 200 challenging questions designed to expose the cracks in modern retrieval across the full spectrum of knowledge domains. Rather than limiting ourselves to simple factual queries, we deliberately crafted a diverse dataset that reflects the complexity of real enterprise search needs.

The search-type query framework

Our methodology centered on what we call "search-type" queries.

  1. Knowledge gap verification: These prompts failed to generate accurate responses on a super-majority (>66.6%) of standard LLMs without search capabilities, ensuring we were testing genuine retrieval needs rather than memorized knowledge.
  2. Search-engine orientation: The queries mirror the types of information-seeking behavior users exhibit when interacting with search engines or formulating advanced search queries, as opposed to task-oriented prompts like "write a summary" or "create a presentation."

Question categories and distribution

Our 200-question dataset spanned multiple domains and complexity levels:

Domain distribution:

  • 25% STEM questions — Complex scientific, technical, engineering, and mathematical queries requiring current research findings and precise technical accuracy
  • 25% Recent news & current events — Time-sensitive questions about political developments, market changes, and breaking news
  • 20% Historical & archival information — Questions requiring synthesis of historical data, archival research, and temporal context
  • 15% Faulty & adversarial prompts — Intentionally misleading premises, ambiguous wording, and queries designed to test robustness
  • 10% Multi-language context — Questions requiring understanding of non-English sources or cross-cultural perspectives
  • 5% Specialized domain knowledge — Niche topics in law, medicine, finance, and other expert domains

Complexity breakdown:

  • 50% time-sensitive — "What are the latest amendments to the EU AI Act?" or "How have semiconductor export restrictions evolved in Q1 2025?"
  • 30% multi-perspective or comparative — "Contrast Japan's and Germany's chip-subsidy strategies across economic and geopolitical dimensions"
  • 20% intentionally challenging — ambiguous wording, niche scientific topics, misleading premises, or queries requiring synthesis across contradictory sources

Rigorous evaluation methodology

Each query was processed by all three agents with their search features fully enabled. We employed expert human aligners throughout both the creation and evaluation phases, implementing a multi-stage quality control process:

  1. Prompt design phase: Expert aligners crafted questions based on real enterprise search patterns and current information gaps
  2. Data control & filtering: Systematic review to ensure query quality, eliminate bias, and verify the "search-type" criteria
  3. Evaluation phase: The same expert aligners rated responses on our composite ten-point scale

Our evaluation framework weighs four critical dimensions:

  • Source quality & trustworthiness (30%) — Credibility and authority of cited sources
  • Answer relevance & completeness (25%) — Direct response to the query with comprehensive coverage
  • Information recency & accuracy (25%) — Up-to-date information with factual precision

Multi-language & cross-cultural understanding (20%) — Ability to synthesize information across linguistic and cultural boundaries

Where each model shines—and stumbles

Recency & real-time information Gemini demonstrated superior performance in accessing the most current information, almost always linking to articles published within 24 hours of the query. GPT-4.1 and Claude occasionally served day-old or week-old sources, particularly problematic for fast-moving news and market information.

STEM & technical accuracy All three models showed strong performance on technical questions, but Gemini's integration with Google Scholar and academic databases gave it an edge in finding authoritative scientific sources. Claude performed best at explaining complex technical concepts clearly, while GPT-4.1 excelled at mathematical reasoning within STEM contexts.

Multi-perspective policy analysis GPT-4.1 demonstrated the strongest capability for synthesizing diverse viewpoints on complex policy questions, though Gemini maintained an edge in completeness and source diversity. Claude showed good analytical depth but sometimes struggled with balancing competing perspectives.

Citation reliability & source attribution This remains a critical weakness across all models. Even Gemini, the top performer, skipped or mangled citations approximately 12% of the time, a rate that presents significant compliance risks for enterprise deployments. GPT-4.1 showed similar citation gaps, while Claude had the highest rate of incomplete or inaccurate source attribution at nearly 18%.

Ambiguity & adversarial robustness Gemini handled ambiguous queries most gracefully, often asking clarifying questions or providing structured responses that acknowledged uncertainty. GPT-4.1 occasionally hallucinated assumptions when faced with unclear prompts. Claude had the highest rate of choosing incorrect interpretations of ambiguous queries, though it provided more detailed reasoning for its choices.

Multi-language & cross-cultural context Gemini's multilingual capabilities proved strongest, effectively synthesizing information from non-English sources and maintaining cultural context. All models showed limitations when queries required deep understanding of non-Western perspectives or specialized regional knowledge.

Domain-specific performance breakdown

STEM questions (25% of dataset)

Model

Score

Key Strengths

Notable Weaknesses

1.Gemini 2.5 Pro

7.8 / 10

Google Scholar integration, academic source access

Occasional oversimplification of complex concepts

2.GPT-4.1

8.2 / 10

Strong mathematical reasoning, technical synthesis

Limited access to latest research papers

3.Claude 4.0 Opus

7.2 / 10

Excellent concept explanation clarity

Weaker on cutting-edge research findings

Recent news & current events (25% of dataset)

Model

Score

Key Strengths

Notable Weaknesses

1.Gemini 2.5 Pro

8.9 / 10

Same-day article access, breaking news coverage

Occasional information overload

2.GPT-4.1

6.8 / 10

Good synthesis of multiple news sources

1-2 day lag on latest developments

3.Claude 4.0 Opus

6.4 / 10

Thoughtful analysis of news implications

Frequently relies on day-old sources

Historical & archival information (20% of dataset)

Model

Score

Key Strengths

Notable Weaknesses

1.Gemini 2.5 Pro

7.7 / 10

Access to digitized archives, comprehensive sourcing

Sometimes prioritizes quantity over quality

2.GPT-4.1

7.4 / 10

Excellent historical context and chronology

Limited access to specialized archives

3.Claude 4.0 Opus

7.1 / 10

Strong analytical narrative construction

Gaps in accessing primary sources

Faulty & adversarial prompts (15% of dataset)

Model

Score

Key Strengths

Notable Weaknesses

1.Gemini 2.5 Pro

8.3 / 10

Clarifying questions, uncertainty acknowledgment

Occasionally over-cautious responses

2.GPT-4.1

6.2 / 10

Reasonable assumption handling when unclear

Tendency to hallucinate missing context

3.Claude 4.0 Opus

5.8 / 10

Detailed reasoning for interpretation choices

Highest rate of incorrect assumption selection

Multi-language context (10% of dataset)

Model

Score

Key Strengths

Notable Weaknesses

1. Gemini 2.5 Pro

8.2 / 10

Strong multilingual synthesis, cultural context

Occasional translation nuance losses

2.GPT-4.1

6.8 / 10

Decent cross-language source integration

Weaker on non-Western cultural perspectives

3.Claude 4.0 Opus

6.9 / 10

Good explanation of cultural differences

Limited access to non-English sources

Specialized domain knowledge (5% of dataset)

Model

Score

Key Strengths

Notable Weaknesses

1.Gemini 2.5 Pro

7.4 / 10

Professional database access, regulatory sources

Complex jargon sometimes unclear

2.GPT-4.1

7.1 / 10

Strong analytical synthesis of expert sources

Limited real-time regulatory updates

3.Claude 4.0 Opus

6.8 / 10

Clear explanation of specialized concepts

Gaps in latest professional standards

Key insights

1. Question type dramatically affects performance gaps

Using only straightforward factual Q&A would obscure meaningful differences between models. The performance gaps become most apparent in time-sensitive queries, multi-perspective analysis, and adversarial scenarios, precisely the areas where enterprise users need the most reliability.

2. Citation accuracy remains an unsolved challenge

Despite advances in retrieval capabilities, source attribution errors occur frequently enough across all models to require systematic verification processes. Organizations deploying these systems must implement robust citation checking and source validation workflows.

3. Domain specialization varies significantly

Each model showed distinct strengths: Gemini for recency and scientific sources, GPT-4.1 for synthesis and reasoning, Claude for explanation clarity. Enterprise deployment strategies should consider these specializations when designing multi-model approaches.

You can check out the agentic search leaderboards here. If you're interested in implementing robust search-agent evaluations for your AI models and projects, contact us to explore how Labelbox can help operate a powerful evaluation of your models.