building a deep research api benchmark from scratch

february 1, 2026

last week i read a 95-page arxiv survey on deep research systems. it catalogs 80+ implementations, proposes taxonomies, discusses architectural patterns. comprehensive academic work.

one thing it doesn't have: actual comparative data on whether these systems cite real sources.

the paper references benchmarks like HLE and GAIA for general reasoning, but nothing that answers the basic question i care about: when a deep research api gives me a citation, does that url actually exist? does it support the claim?

so i started building one.

the gap

the survey identifies five "easy metrics" for evaluation:¹

latency — how long does the research take
cost — what does a query actually cost
url validity — do the cited urls exist
citation count — how many sources
domain diversity — are sources from varied domains or all from one site

and one "hard metric":

claim-source alignment — does the cited source actually support the claim made

everyone building deep research comparison tools (including me, at research.site) focuses on features, pricing, subjective "quality" assessments. nobody's systematically checking if the citations are real.

the system

i built an automated benchmark that runs standardized queries across providers and computes metrics on each response.

the architecture:²

queries → provider adapters → raw responses → metric computation → results

each provider gets an adapter that normalizes the api differences into a common BenchmarkResponse format. openai and gemini require background polling (10-30 min latencies). perplexity and parallel return faster.

for url checking, i hit each cited url directly. if it returns 403 (bot blocked), i fall back to tavily's search api to verify the page exists. anything else gets flagged for manual review.

domain diversity uses shannon entropy — higher score means citations spread across more unique domains rather than clustering on one source.

first real data

ran the benchmark on one query ("what was the total funding raised by openai through 2024, broken down by funding round?") across four providers:

provider	latency	citations	cost	url validity	diversity
perplexity-deep	54s	58	$0.97	94.8%	0.87
openai-o4-mini-deep	4.5min	12	$0.11	100%	0.81
gemini-deep	5.4min	23	$0.03	17.4%	0.00
parallel-pro	14.8min	22	$0.10	90.9%	0.94

no clear winner. tradeoffs everywhere.

what the data shows

perplexity is fastest and most prolific. 54 seconds, 58 citations. but it's also 10x the cost of competitors. the high citation count might be noise — need the claim alignment metric to know if those citations actually matter.

openai has perfect url validity but few citations. 12 sources, all real. this might be the "quality over quantity" bet — worth investigating whether those 12 citations cover the key facts better than perplexity's 58.

parallel has the highest source diversity. 0.94 entropy score means it's pulling from the widest range of domains. slowest though (14.8 min).

gemini has a problem.

17.4% url validity with 0.0 diversity. what's happening?

gemini's api returns urls like:

vertexaisearch.cloud.google.com/grounding-api-redirect/...

instead of actual source urls. every citation points to a google redirect wrapper, not the real source. from a benchmarking perspective, this makes gemini's citations essentially unverifiable through automated means.³

this might be a limitation of the grounding api, or there might be a way to extract the actual destination urls from the redirect parameters. either way, it's a real finding that affects how you'd evaluate gemini's citation quality programmatically.

what's not built yet

the hard part: claim-source alignment.

the current system tells you whether urls exist. it doesn't tell you whether the content at those urls actually supports the claims being made. that requires:

extracting specific claims from the response
fetching the full content of each cited source
using an llm to verify whether the source supports the claim
aggregating into supported/not found/contradicted categories

this is where you'd go from "interesting comparison" to "credible benchmark." it's also significantly more complex and expensive to run.

other gaps:

only tested 1 of 10 planned queries
using json file storage instead of proper database
manual runs instead of automated scheduling
no multi-judge verification for llm-based evaluations

design decisions

why these providers? openai and gemini are the obvious inclusions. perplexity has the most mature api. parallel is the main startup competitor. together they cover the realistic options someone evaluating deep research apis would consider.

why these metrics? the arxiv survey's framework is solid. easy metrics are automatable and objective. the hard metric (claim alignment) is what actually matters but requires more infrastructure.

why start with factual queries? they have verifiable ground truth. "what was openai's funding" has a correct answer. this lets me validate the system before moving to fuzzier categories like comparative analysis or open-ended research.

cost projection for full benchmark: ~$12 per complete run (10 queries × 4 providers). cheap enough to run regularly, expensive enough that i'm not going to spam it during development.

what's next

fix or document the gemini redirect issue
run remaining 9 queries to complete the v1 dataset
build the claim extraction pipeline
publish methodology and raw data

the goal isn't to declare a "winner" — it's to give people building on these apis actual data about what they're getting. citation count doesn't matter if the citations are wrong. speed doesn't matter if the research is shallow.

if you're evaluating deep research apis, the current answer is "it depends on what you're optimizing for." i want the answer to be "here's the data, decide for yourself."

tracking this at research.site. raw benchmark code is private for now but methodology will be published with v1 results.

Footnotes

metrics framework from xu & peng, "a comprehensive survey of deep research" (arxiv:2506.12594), section 4.1. the easy/hard distinction maps to automatable vs requires-human-judgment. ↩
architecture follows the adapter pattern recommended in the survey (section 4.1.1). each provider normalizes to a common BenchmarkResponse with citations, timing, and cost metadata. ↩
gemini grounding api documentation doesn't clearly specify whether direct source urls are available. the redirect wrapper may be intentional (for analytics/safety) or an artifact of the api design. investigating. ↩