v0.4 evaluation · May 2026

The marketed flagship LLM has the highest fabrication rate in the panel.

7,372 claims audited across 30 sector-diversified US large-cap earnings releases. Four frontier LLMs. Zero LLMs in the audit loop.

Model	Claims	Accuracy	Fabrication
GPT-5.5 Pro	742	74.3 %	5.1 %
OpenAI o3	2,038	71.5 %	5.3 %
Claude Sonnet 4.6	2,372	59.6 %	8.5 %
Claude Opus 4.7 (extended thinking, flagship)	2,220	60.0 %	10.6 %

A fabrication is a claim with no traceable anchor in the source document. Every single one is, by definition, an audit failure waiting to happen. 95 % cluster-bootstrap CIs on every headline number; the OpenAI–Anthropic gap is robust on both the full 203-pair panel and a fully balanced 13-ticker matched subset.

Why this isn't visible without Project Armour

Frontier LLMs produce confident, plausible, factually wrong claims at a rate that no prompt or model upgrade eliminates. The two industry-standard fixes both fail in characteristic ways.

Standard fixes, and why they don't close the gap

LLM-as-a-judge. The judge shares training distribution with the model under test. Errors are correlated. The judge agrees with the wrong answer.
RAG / better grounding. Grounds the input. Does not constrain the output. The model still interpolates and embellishes.
"Use a better model." Reasoning models editorialise more, not less, and the data above shows it.
"Trust the model and review." Not an audit trail. A compliance officer cannot defend a sign-off on the basis that the analyst skim-read it.

The missing primitive is a non-generative verifier: a system that answers, for each claim in an LLM output, what specific source content backs this, and how confident are we in the match?, without itself being subject to hallucination.

That is the design centre of Project Armour.

Three findings a buyer should care about

1. Every frontier model fabricates.

Range: 5.1 % (GPT-5.5 Pro) to 10.6 % (Claude Opus 4.7). The cleanest model in the panel still invents one in twenty claims. That is the market.

2. Summaries fabricate more than analyses.

Counter-intuitive and audit-relevant: the one-page summary emailed to the PM is the riskier artefact. Fabrication concentrates exactly where the model is trying to be precise.

3. Different models fail differently.

Opus invents. Sonnet misstates. o3 drifts. The right model depends on which failure mode your reviewer can catch by hand. No LLM-as-judge can produce this signal.

How Armour works

A deterministic verifier built to a single hard constraint: no generative AI in the audit loop. Identical input produces an identical verdict, every time. The verifier cannot hallucinate, because it cannot generate.

Source parsingspaCy NER + regex extraction → structured fact store of entities, numbers, periods, table values.

Claim extractionEvery sentence in the LLM output classified by type (numeric, temporal, directional, quote, entity) and origin (restatement, derived, inference, fabrication).

AnchoringToken-set Jaccard with sentence-embedding fallback (MiniLM, non-generative) for paraphrased restatement.

Per-claim checksNumeric tolerance, arithmetic recompute, period alignment, direction sign-check, scope fingerprint, source presence.

Verdict ladderSeven graduated outcomes from pass to fabrication, with full audit trail. Not a binary score.

ProfilesThree configurations, compliance (strictest), balanced, research. Same engine, different verdict thresholds. Trade-offs surfaced explicitly.

Who it's for

Boutique asset managers ($500m–$10bn AUM). Use AI-generated research routinely. No in-house model-risk function. Need a defensible audit trail to show a regulator or LP.
Independent research firms. Sell research on subscription. A per-note audit ledger is a credential, not a cost.
Mid-tier bank compliance and MRM. Same regulatory pressure as Goldman or JPM, no in-house ML capacity to match. Vendor-supplied deterministic audit is exactly what the SR 11-7 frame is built to consume.

The large incumbents (Bloomberg, FactSet, S&P) will eventually build something like this. They will build it badly first, and to defend their data moat. There is a window.

Worth fifteen minutes?

If you run AI-assisted research, compliance, or model-risk at a firm where "we used a smarter model" isn't a defensible audit trail, I'd value pushback on these findings before v0.5.

Book a 15-minute walkthrough Read the whitepaper (PDF) One-page brief (PDF)