citation quality metrics

when comparing deep research apis, citation count is the wrong metric. what matters is citation quality — whether the sources actually support the claims, and whether those sources are appropriate for the query type.

beyond citation count

a report with 50 citations isn't necessarily better than one with 10. if 40 of those citations are redundant news aggregators repeating the same wire story, you have one effective source dressed up as fifty.

the index tracks several providers. the ones with lower citation counts but higher per-citation quality consistently outperform on downstream tasks.

a working framework

here's how i evaluate citation quality in production:

source class

not all sources serve the same function. i categorize them:

  • primary — original research, official documents, firsthand accounts
  • authoritative secondary — peer-reviewed analysis, established reporting
  • aggregation — news wires, content farms, listicles
  • social — forums, social media, comments

a good research output should lean heavily on primary and authoritative secondary. if your citations are mostly aggregation and social, the research is shallow.

claim-source alignment

each citation should match the claim type:

claim typeappropriate sources
factual / statisticalprimary data, govt sources, academic papers
opinion / analysisnamed experts, editorial from credible outlets
event / newsoriginal reporting, official statements
technicaldocumentation, whitepapers, peer review

misalignment is a red flag. a statistical claim backed only by a blog post should be treated with skepticism.

freshness requirements

freshness depends on the query:

  • current events — sources should be hours to days old
  • market data — real-time or same-day
  • scientific consensus — recent meta-analyses or reviews
  • historical — contemporary sources or established scholarship

a research api that cites 2019 articles for "what's happening in AI agents" is giving you stale information regardless of how many citations it provides.

practical scoring

for each research output, i score citations on a simple 0-2 scale:

  • 2 — primary or authoritative source, matches claim type, appropriate freshness
  • 1 — decent source with minor issues (slightly stale, aggregated but traceable)
  • 0 — poor source, misaligned, or doesn't support the claim

sum the scores, divide by total citations. this gives you a quality ratio. track it over time. compare across providers.


the apis that win long-term will be the ones that optimize for citation quality, not citation volume. the index is tracking this. the results are already diverging.