Deep research APIs are a new category. OpenAI, Perplexity, Google, and startups like Parallel are shipping systems that can browse the web, synthesize sources, and return cited answers in a single API call. These tools are powerful. Comparing them is not. The Deep Research API Index is an independent platform to evaluate, compare, and rank these APIs through community-driven blind battles and comprehensive metrics.

What's Here

Arena

Run blind battles between providers. Two random models, same prompt, you vote on which response is better.

Leaderboard

Community-driven rankings based on blind battle wins. See which providers actually perform best.

Providers Table

Side-by-side metrics: pricing, latency, context windows, benchmarks, structured output support.

Museum of Queries

Real outputs from deep research providers, preserved for comparison. Same prompt, different approaches.

Unified API

One endpoint to query multiple providers. Fallback strategies, budget caps, normalized responses.

Writing

Deep dives on verification debt, citation quality, provider updates, and what actually matters.

Who's Behind This

Vani

I'm Vani, a Math + Informatics student at UW, currently a TA for Data Structures & Algorithms (CSE 373), and incoming Instructor for the course in Summer 2026.

I built this because I kept running into the same problem: trying to pick the right deep research API and finding zero serious, neutral comparisons. So I made the resource I wished existed—and turned it into a community-driven arena.

Methodology

Provider metrics come from official documentation, published benchmarks, and direct API testing. Leaderboard rankings are based entirely on community blind votes—no synthetic benchmarks, just real human preferences. If something is wrong, missing, or outdated, I want to know.

Independence Note

This is an independent project. I'm not affiliated with OpenAI, Perplexity, Google, Parallel, or any other provider listed here. I don't take money from providers.

Get in Touch

Building with these tools, notice an error, or want to debate evaluation criteria?