building a deep research api benchmark from scratch
last week i read a 95-page arxiv survey on deep research systems. it catalogs 80+ implementations, proposes taxonomies, discusses architectural patterns. comprehensive academic work.
one thing it doesn't have: actual comparative data on whether these systems cite real sources.
the paper references benchmarks like HLE and GAIA for general reasoning, but nothing that answers the basic question i care about: when a deep research api gives me a citation, does that url actually exist? does it support the claim?
so i started building one.
the gap
the survey identifies five "easy metrics" for evaluation:1
- latency — how long does the research take
- cost — what does a query actually cost
- url validity — do the cited urls exist
- citation count — how many sources
- domain diversity — are sources from varied domains or all from one site
and one "hard metric":
- claim-source alignment — does the cited source actually support the claim made
everyone building deep research comparison tools (including me, at research.site) focuses on features, pricing, subjective "quality" assessments. nobody's systematically checking if the citations are real.
the system
i built an automated benchmark that runs standardized queries across providers and computes metrics on each response.
the architecture:2
queries → provider adapters → raw responses → metric computation → results
each provider gets an adapter that normalizes the api differences into a common BenchmarkResponse format. openai and gemini require background polling (10-30 min latencies). perplexity and parallel return faster.
for url checking, i hit each cited url directly. if it returns 403 (bot blocked), i fall back to tavily's search api to verify the page exists. anything else gets flagged for manual review.
domain diversity uses shannon entropy — higher score means citations spread across more unique domains rather than clustering on one source.
first real data
ran the benchmark on one query ("what was the total funding raised by openai through 2024, broken down by funding round?") across four providers:
| provider | latency | citations | cost | url validity | diversity |
|---|---|---|---|---|---|
| perplexity-deep | 54s | 58 | $0.97 | 94.8% | 0.87 |
| openai-o4-mini-deep | 4.5min | 12 | $0.11 | 100% | 0.81 |
| gemini-deep | 5.4min | 23 | $0.03 | 17.4% | 0.00 |
| parallel-pro | 14.8min | 22 | $0.10 | 90.9% | 0.94 |
no clear winner. tradeoffs everywhere.
what the data shows
perplexity is fastest and most prolific. 54 seconds, 58 citations. but it's also 10x the cost of competitors. the high citation count might be noise — need the claim alignment metric to know if those citations actually matter.
openai has perfect url validity but few citations. 12 sources, all real. this might be the "quality over quantity" bet — worth investigating whether those 12 citations cover the key facts better than perplexity's 58.
parallel has the highest source diversity. 0.94 entropy score means it's pulling from the widest range of domains. slowest though (14.8 min).
gemini has a problem.
17.4% url validity with 0.0 diversity. what's happening?
gemini's api returns urls like:
vertexaisearch.cloud.google.com/grounding-api-redirect/...
instead of actual source urls. every citation points to a google redirect wrapper, not the real source. from a benchmarking perspective, this makes gemini's citations essentially unverifiable through automated means.3
this might be a limitation of the grounding api, or there might be a way to extract the actual destination urls from the redirect parameters. either way, it's a real finding that affects how you'd evaluate gemini's citation quality programmatically.
what's not built yet
the hard part: claim-source alignment.
the current system tells you whether urls exist. it doesn't tell you whether the content at those urls actually supports the claims being made. that requires:
- extracting specific claims from the response
- fetching the full content of each cited source
- using an llm to verify whether the source supports the claim
- aggregating into supported/not found/contradicted categories
this is where you'd go from "interesting comparison" to "credible benchmark." it's also significantly more complex and expensive to run.
other gaps:
- only tested 1 of 10 planned queries
- using json file storage instead of proper database
- manual runs instead of automated scheduling
- no multi-judge verification for llm-based evaluations
design decisions
why these providers? openai and gemini are the obvious inclusions. perplexity has the most mature api. parallel is the main startup competitor. together they cover the realistic options someone evaluating deep research apis would consider.
why these metrics? the arxiv survey's framework is solid. easy metrics are automatable and objective. the hard metric (claim alignment) is what actually matters but requires more infrastructure.
why start with factual queries? they have verifiable ground truth. "what was openai's funding" has a correct answer. this lets me validate the system before moving to fuzzier categories like comparative analysis or open-ended research.
cost projection for full benchmark: ~$12 per complete run (10 queries × 4 providers). cheap enough to run regularly, expensive enough that i'm not going to spam it during development.
what's next
- fix or document the gemini redirect issue
- run remaining 9 queries to complete the v1 dataset
- build the claim extraction pipeline
- publish methodology and raw data
the goal isn't to declare a "winner" — it's to give people building on these apis actual data about what they're getting. citation count doesn't matter if the citations are wrong. speed doesn't matter if the research is shallow.
if you're evaluating deep research apis, the current answer is "it depends on what you're optimizing for." i want the answer to be "here's the data, decide for yourself."
tracking this at research.site. raw benchmark code is private for now but methodology will be published with v1 results.
Footnotes
-
metrics framework from xu & peng, "a comprehensive survey of deep research" (arxiv:2506.12594), section 4.1. the easy/hard distinction maps to automatable vs requires-human-judgment. ↩
-
architecture follows the adapter pattern recommended in the survey (section 4.1.1). each provider normalizes to a common
BenchmarkResponsewith citations, timing, and cost metadata. ↩ -
gemini grounding api documentation doesn't clearly specify whether direct source urls are available. the redirect wrapper may be intentional (for analytics/safety) or an artifact of the api design. investigating. ↩