verification debt
every deep research api ships citations. most teams assume citations mean the output is trustworthy. this assumption creates what i call verification debt — the gap between cited claims and actually verified claims.
the problem
citations are not verification. a citation tells you where an llm claims it found information. it does not tell you whether that source actually contains the claim, whether the claim was interpreted correctly, or whether the source itself is reliable.
in production systems, verification debt compounds. one unverified claim feeds into another workflow. that workflow generates a report. the report informs a decision. the decision has consequences.
how it accumulates
verification debt grows through three mechanisms:
- source drift — the cited url changes or disappears after the research was conducted
- interpretation error — the model misreads or misattributes information from a valid source
- context collapse — correct information from a source gets applied to the wrong context
most teams don't instrument for any of these. they ship citations as a liability shield rather than a quality signal.
measuring citation quality
not all citations are equal. a production research system should track:
- resolution rate — what percentage of cited urls actually resolve?
- content match — does the cited page contain the claimed information?
- source authority — is the source appropriate for the claim type?
- freshness — how old is the cited content relative to the query?
the index tracks some of these metrics across providers. the variance is significant.
practical approaches
if you're building on deep research apis, here's what actually helps:
sample and verify. pick 5-10 citations per output and manually check them. track the failure rate over time. this gives you a baseline.
instrument resolution. log every cited url your system encounters. run async checks to verify they resolve. alert when resolution rates drop.
separate citation from confidence. just because something is cited doesn't mean it's high-confidence. build your downstream systems to treat citations as one signal among several.
design for failure. assume some percentage of citations will be wrong. what does your system do when that happens? if the answer is "nothing," you have a problem.
verification debt isn't a reason to avoid deep research apis. it's a reason to build robust systems around them. the apis that surface citation quality metrics are giving you the tools to manage this debt. use them.