7,372 claims audited across 30 sector-diversified US large-cap earnings releases. Four frontier LLMs. Zero LLMs in the audit loop.
| Model | Claims | Accuracy | Fabrication |
|---|---|---|---|
| GPT-5.5 Pro | 742 | 74.3 % | 5.1 % |
| OpenAI o3 | 2,038 | 71.5 % | 5.3 % |
| Claude Sonnet 4.6 | 2,372 | 59.6 % | 8.5 % |
| Claude Opus 4.7 (extended thinking, flagship) | 2,220 | 60.0 % | 10.6 % |
A fabrication is a claim with no traceable anchor in the source document. Every single one is, by definition, an audit failure waiting to happen. 95 % cluster-bootstrap CIs on every headline number; the OpenAI–Anthropic gap is robust on both the full 203-pair panel and a fully balanced 13-ticker matched subset.
Frontier LLMs produce confident, plausible, factually wrong claims at a rate that no prompt or model upgrade eliminates. The two industry-standard fixes both fail in characteristic ways.
The missing primitive is a non-generative verifier: a system that answers, for each claim in an LLM output, what specific source content backs this, and how confident are we in the match?, without itself being subject to hallucination.
That is the design centre of Project Armour.
Range: 5.1 % (GPT-5.5 Pro) to 10.6 % (Claude Opus 4.7). The cleanest model in the panel still invents one in twenty claims. That is the market.
Counter-intuitive and audit-relevant: the one-page summary emailed to the PM is the riskier artefact. Fabrication concentrates exactly where the model is trying to be precise.
Opus invents. Sonnet misstates. o3 drifts. The right model depends on which failure mode your reviewer can catch by hand. No LLM-as-judge can produce this signal.
A deterministic verifier built to a single hard constraint: no generative AI in the audit loop. Identical input produces an identical verdict, every time. The verifier cannot hallucinate, because it cannot generate.
pass to fabrication, with full audit trail. Not a binary score.compliance (strictest), balanced, research. Same engine, different verdict thresholds. Trade-offs surfaced explicitly.The large incumbents (Bloomberg, FactSet, S&P) will eventually build something like this. They will build it badly first, and to defend their data moat. There is a window.
If you run AI-assisted research, compliance, or model-risk at a firm where "we used a smarter model" isn't a defensible audit trail, I'd value pushback on these findings before v0.5.