Project Armour Whitepaper  ·  Brief  ·  Walkthrough
v0.4 evaluation · May 2026

The marketed flagship LLM has the highest fabrication rate in the panel.

7,372 claims audited across 30 sector-diversified US large-cap earnings releases. Four frontier LLMs. Zero LLMs in the audit loop.

Model Claims Accuracy Fabrication
GPT-5.5 Pro 742 74.3 % 5.1 %
OpenAI o3 2,038 71.5 % 5.3 %
Claude Sonnet 4.6 2,372 59.6 % 8.5 %
Claude Opus 4.7 (extended thinking, flagship) 2,220 60.0 % 10.6 %

A fabrication is a claim with no traceable anchor in the source document. Every single one is, by definition, an audit failure waiting to happen. 95 % cluster-bootstrap CIs on every headline number; the OpenAI–Anthropic gap is robust on both the full 203-pair panel and a fully balanced 13-ticker matched subset.

Why this isn't visible without Project Armour

Frontier LLMs produce confident, plausible, factually wrong claims at a rate that no prompt or model upgrade eliminates. The two industry-standard fixes both fail in characteristic ways.

Standard fixes, and why they don't close the gap

  • LLM-as-a-judge. The judge shares training distribution with the model under test. Errors are correlated. The judge agrees with the wrong answer.
  • RAG / better grounding. Grounds the input. Does not constrain the output. The model still interpolates and embellishes.
  • "Use a better model." Reasoning models editorialise more, not less, and the data above shows it.
  • "Trust the model and review." Not an audit trail. A compliance officer cannot defend a sign-off on the basis that the analyst skim-read it.

The missing primitive is a non-generative verifier: a system that answers, for each claim in an LLM output, what specific source content backs this, and how confident are we in the match?, without itself being subject to hallucination.

That is the design centre of Project Armour.

Three findings a buyer should care about

1. Every frontier model fabricates.

Range: 5.1 % (GPT-5.5 Pro) to 10.6 % (Claude Opus 4.7). The cleanest model in the panel still invents one in twenty claims. That is the market.

2. Summaries fabricate more than analyses.

Counter-intuitive and audit-relevant: the one-page summary emailed to the PM is the riskier artefact. Fabrication concentrates exactly where the model is trying to be precise.

3. Different models fail differently.

Opus invents. Sonnet misstates. o3 drifts. The right model depends on which failure mode your reviewer can catch by hand. No LLM-as-judge can produce this signal.

How Armour works

A deterministic verifier built to a single hard constraint: no generative AI in the audit loop. Identical input produces an identical verdict, every time. The verifier cannot hallucinate, because it cannot generate.

Source parsingspaCy NER + regex extraction → structured fact store of entities, numbers, periods, table values.
Claim extractionEvery sentence in the LLM output classified by type (numeric, temporal, directional, quote, entity) and origin (restatement, derived, inference, fabrication).
AnchoringToken-set Jaccard with sentence-embedding fallback (MiniLM, non-generative) for paraphrased restatement.
Per-claim checksNumeric tolerance, arithmetic recompute, period alignment, direction sign-check, scope fingerprint, source presence.
Verdict ladderSeven graduated outcomes from pass to fabrication, with full audit trail. Not a binary score.
ProfilesThree configurations, compliance (strictest), balanced, research. Same engine, different verdict thresholds. Trade-offs surfaced explicitly.

Who it's for

The large incumbents (Bloomberg, FactSet, S&P) will eventually build something like this. They will build it badly first, and to defend their data moat. There is a window.

Worth fifteen minutes?

If you run AI-assisted research, compliance, or model-risk at a firm where "we used a smarter model" isn't a defensible audit trail, I'd value pushback on these findings before v0.5.