Utalkia
← All insights
Engineering·8 min read

Why RAG alone fails on regulated corpora

By Aghasi Gasparyan

If you can't trace the assertion to the source paragraph, version, and signature, you don't have an audit trail — you have a confident hallucination wearing a citation.

We've integrated RAG into roughly two-thirds of the production systems we've shipped. We've also seen RAG as the single biggest source of post-deployment incidents in regulated workloads. Both things are true. The reason both things are true is that the textbook version of RAG — embed everything, retrieve top-k, generate over the retrieved context — handles 80% of cases beautifully and 20% of cases catastrophically. In regulated work, the 20% is what gets you in front of an examiner.

Three failure modes worth naming.

1. Chunk drift across versions

Regulations get amended. Policies get superseded. Contracts get countersigned. Most RAG systems treat their corpus as a flat namespace of chunks with no concept of which version of a document any given chunk came from.

Two engagements ago we found a system citing the previous year's version of a sanctions list. The text the model returned was technically present in the corpus — but it had been superseded six months earlier, and the new list flagged a new entity the old one didn't cover. The retrieval was correct. The retrieval was also wrong. Welcome to regulated work.

The fix isn't a single technique. It's an invariant: every chunk carries an effective-date range, and every retrieval respects an as-of date the orchestrator passes in. Searches without an as-of date are not allowed in production.

2. Authority confusion

Your corpus contains a draft, a final, an amendment, a redline, a board memo summarizing the amendment, and a customer-facing FAQ paraphrasing the memo. All six documents 'discuss' the same rule. Five of them are not the rule. The retrieval system doesn't know the difference unless you tell it.

The only fix we've found that holds up under audit: every document carries a typed authority level — primary source, derivative, summary, communication — and retrieval can be filtered to authority levels. When a question is 'what does the rule require?', retrieval restricts to primary sources. When the question is 'how have we communicated this to customers?', retrieval restricts to communications. Same corpus, different governance.

3. Silent misses

The most dangerous failure isn't retrieving the wrong thing. It's retrieving nothing relevant and still getting a fluent answer.

Stock RAG behaves the same whether it found a great match or a lousy one. The model writes a confident paragraph either way. In a regulated context, this isn't a quality issue — it's a control issue. You can't have a system that fabricates plausible answers when the evidence isn't there.

We instrument every retrieval with two signals: a relevance score the system can act on, and a coverage signal — does the retrieved set actually contain enough of the relevant ground truth to answer the question? Below either threshold, the system is required to refuse. Refuse, escalate, or ask a clarifying question. Never make something up to fill the silence.

What audit-grade RAG looks like in practice

Stitching the three together: every assertion the system produces is traceable to (a) a specific paragraph, (b) the version of the document that paragraph appeared in, (c) the authority level of that document, and (d) the as-of date the retrieval respected. The evidence package is generated at write time, not synthesized after the fact.

When a regulator or an internal auditor asks 'why did the system say X?', the answer takes ten seconds and includes a hyperlink. When a model is wrong, you can demonstrate exactly which step of the pipeline introduced the error — retrieval, scoring, generation, or verification. That visibility is what 'audit-grade' actually means. Most production RAG systems don't have it.

Closing

RAG is one of the highest-leverage capabilities in the AI stack right now, and it earns its place — but only if you build it for the worst case rather than the median case. In regulated mid-market work, the worst case is the one your buyer is going to be measured on. The median case is what your demo runs on.

Build for the worst case. Demo the median.


Have a workflow that fits the patterns above?

Thirty minutes, no slideware. We'll tell you honestly whether AI fits and where it doesn't.

Book a working session →