Arjun Nargolwala•June 13, 2025
Benchmarking agentic search

Headline results
Composite = weighted mean across source quality, relevance, recency, and multi-language understanding
Research purpose
When enterprises deploy a search-augmented LLM, they expect more than eloquent responses; they need fast, up-to-date answers that cite trustworthy sources across diverse domains of knowledge. Public leaderboards rarely measure that blend of skills in real-world conditions, so the Labelbox research team ran its own comprehensive study. Our target lineup featured three frontier-grade systems with native web search:
- Google Gemini 2.5 Pro (grounded search)
- OpenAI GPT-4.1 (search preview)
- Anthropic Claude 4.0 Opus (web search tool)
See the agentic search leaderboards here.
Building a real-world test set
We constructed 200 challenging questions designed to expose the cracks in modern retrieval across the full spectrum of knowledge domains. Rather than limiting ourselves to simple factual queries, we deliberately crafted a diverse dataset that reflects the complexity of real enterprise search needs.
The search-type query framework
Our methodology centered on what we call "search-type" queries.
- Knowledge gap verification: These prompts failed to generate accurate responses on a super-majority (>66.6%) of standard LLMs without search capabilities, ensuring we were testing genuine retrieval needs rather than memorized knowledge.
- Search-engine orientation: The queries mirror the types of information-seeking behavior users exhibit when interacting with search engines or formulating advanced search queries, as opposed to task-oriented prompts like "write a summary" or "create a presentation."
Question categories and distribution
Our 200-question dataset spanned multiple domains and complexity levels:
Domain distribution:
- 25% STEM questions — Complex scientific, technical, engineering, and mathematical queries requiring current research findings and precise technical accuracy
- 25% Recent news & current events — Time-sensitive questions about political developments, market changes, and breaking news
- 20% Historical & archival information — Questions requiring synthesis of historical data, archival research, and temporal context
- 15% Faulty & adversarial prompts — Intentionally misleading premises, ambiguous wording, and queries designed to test robustness
- 10% Multi-language context — Questions requiring understanding of non-English sources or cross-cultural perspectives
- 5% Specialized domain knowledge — Niche topics in law, medicine, finance, and other expert domains
Complexity breakdown:
- 50% time-sensitive — "What are the latest amendments to the EU AI Act?" or "How have semiconductor export restrictions evolved in Q1 2025?"
- 30% multi-perspective or comparative — "Contrast Japan's and Germany's chip-subsidy strategies across economic and geopolitical dimensions"
- 20% intentionally challenging — ambiguous wording, niche scientific topics, misleading premises, or queries requiring synthesis across contradictory sources
Rigorous evaluation methodology
Each query was processed by all three agents with their search features fully enabled. We employed expert human aligners throughout both the creation and evaluation phases, implementing a multi-stage quality control process:
- Prompt design phase: Expert aligners crafted questions based on real enterprise search patterns and current information gaps
- Data control & filtering: Systematic review to ensure query quality, eliminate bias, and verify the "search-type" criteria
- Evaluation phase: The same expert aligners rated responses on our composite ten-point scale
Our evaluation framework weighs four critical dimensions:
- Source quality & trustworthiness (30%) — Credibility and authority of cited sources
- Answer relevance & completeness (25%) — Direct response to the query with comprehensive coverage
- Information recency & accuracy (25%) — Up-to-date information with factual precision
Multi-language & cross-cultural understanding (20%) — Ability to synthesize information across linguistic and cultural boundaries
Where each model shines—and stumbles
Recency & real-time information Gemini demonstrated superior performance in accessing the most current information, almost always linking to articles published within 24 hours of the query. GPT-4.1 and Claude occasionally served day-old or week-old sources, particularly problematic for fast-moving news and market information.
STEM & technical accuracy All three models showed strong performance on technical questions, but Gemini's integration with Google Scholar and academic databases gave it an edge in finding authoritative scientific sources. Claude performed best at explaining complex technical concepts clearly, while GPT-4.1 excelled at mathematical reasoning within STEM contexts.
Multi-perspective policy analysis GPT-4.1 demonstrated the strongest capability for synthesizing diverse viewpoints on complex policy questions, though Gemini maintained an edge in completeness and source diversity. Claude showed good analytical depth but sometimes struggled with balancing competing perspectives.
Citation reliability & source attribution This remains a critical weakness across all models. Even Gemini, the top performer, skipped or mangled citations approximately 12% of the time, a rate that presents significant compliance risks for enterprise deployments. GPT-4.1 showed similar citation gaps, while Claude had the highest rate of incomplete or inaccurate source attribution at nearly 18%.
Ambiguity & adversarial robustness Gemini handled ambiguous queries most gracefully, often asking clarifying questions or providing structured responses that acknowledged uncertainty. GPT-4.1 occasionally hallucinated assumptions when faced with unclear prompts. Claude had the highest rate of choosing incorrect interpretations of ambiguous queries, though it provided more detailed reasoning for its choices.
Multi-language & cross-cultural context Gemini's multilingual capabilities proved strongest, effectively synthesizing information from non-English sources and maintaining cultural context. All models showed limitations when queries required deep understanding of non-Western perspectives or specialized regional knowledge.
Domain-specific performance breakdown
STEM questions (25% of dataset)
Recent news & current events (25% of dataset)
Historical & archival information (20% of dataset)
Faulty & adversarial prompts (15% of dataset)
Multi-language context (10% of dataset)
Specialized domain knowledge (5% of dataset)
Key insights
1. Question type dramatically affects performance gaps
Using only straightforward factual Q&A would obscure meaningful differences between models. The performance gaps become most apparent in time-sensitive queries, multi-perspective analysis, and adversarial scenarios, precisely the areas where enterprise users need the most reliability.
2. Citation accuracy remains an unsolved challenge
Despite advances in retrieval capabilities, source attribution errors occur frequently enough across all models to require systematic verification processes. Organizations deploying these systems must implement robust citation checking and source validation workflows.
3. Domain specialization varies significantly
Each model showed distinct strengths: Gemini for recency and scientific sources, GPT-4.1 for synthesis and reasoning, Claude for explanation clarity. Enterprise deployment strategies should consider these specializations when designing multi-model approaches.
You can check out the agentic search leaderboards here. If you're interested in implementing robust search-agent evaluations for your AI models and projects, contact us to explore how Labelbox can help operate a powerful evaluation of your models.