Insights
RAG systems that work in demos fail in production for a small set of repeatable reasons. Understanding them before you build saves months of debugging after you deploy.
Retrieval-Augmented Generation has become the default architecture for enterprise AI systems that need to answer questions from proprietary data. It is well-understood, well-documented, and demonstrably effective — in demos. In production, it fails in ways that are consistent enough to be predicted, and specific enough to be fixed, if you know what to look for.
Here are the failure modes I encounter most often, and what they actually indicate.
This is the most common failure, and the least obvious to diagnose from the outside. The system produces an answer. The answer sounds reasonable. But it is answering a slightly different question than the one that was asked, because the retrieval step surfaced chunks that were topically adjacent rather than directly relevant.
The underlying cause is almost always one of three things: the embedding model was not trained on text similar to your documents, the chunking strategy doesn't preserve semantic coherence, or the query is being embedded differently than the documents.
The fix is rarely "use a better embedding model." It is almost always "evaluate retrieval quality independently from generation quality." If you can measure context groundedness — whether the retrieved chunks are actually relevant to the query — before you ship, you will catch this failure mode before users do.
This one is more dangerous, because the failure looks like a success. The retrieval returns relevant-looking chunks. The model generates a coherent, confident answer. But the chunks were outdated, or the source document was incorrect, or the chunk was extracted from a context that changes its meaning.
The model is doing its job. The retrieval is doing its job. The problem is that the information in the pipeline is wrong, and neither component is positioned to detect that.
The mitigation is source quality control: knowing which documents are authoritative, how often they change, and how stale content gets evicted from the index. This is usually treated as an operational concern after launch. It should be treated as an architectural requirement before it.
Enterprise document collections are not clean. They contain formatting artifacts, duplicate content, boilerplate disclaimers, version history, and cross-references that look meaningful in isolation but add no value to a query response. When the retrieval step returns a generous number of chunks — which it tends to do when recall is prioritized over precision — the context window fills with this noise before it fills with signal.
The generation model can only work with what it receives. A context window full of noise produces answers that are either hedged to the point of uselessness or confidently wrong about the wrong thing.
The fix is reranking: a second retrieval stage that scores the initial candidate chunks for relevance before they are sent to the generation model. It adds latency. It is worth it.
A RAG pipeline optimized for factual lookup — "what is the refund policy?" — will degrade significantly when asked to synthesize across multiple documents or reason about a multi-step problem. The retrieval returns chunks relevant to individual components of the question, but the generation model has no mechanism to structure a response that accounts for all of them coherently.
This is not a failure of the model. It is a mismatch between the architecture and the query type. If your users will ask complex questions — and enterprise users always will — the architecture needs to account for it before launch, not after.
The pipeline works at launch. Three months later, it doesn't — and nobody noticed because there was no automated evaluation running against production traffic. The documents changed. The query distribution shifted. A new document category was added that the embedding model handles poorly. And every week that passes without detection makes the problem harder to trace.
This is the failure mode that damages trust most durably, because users stop relying on the system gradually rather than all at once, and by the time the team notices, the reputation of the system has already been established.
The solution is an evaluation loop that runs continuously — not just before launch. Measuring faithfulness, relevance, and groundedness on a sample of production queries, with alerting when scores degrade, is not optional for a system that people are expected to rely on.
None of these failure modes are mysterious. They are predictable consequences of architectural decisions — or the absence of them — made early in the build. The teams that ship RAG systems that stay reliable treat evaluation as a first-class engineering concern, design retrieval quality independently from generation quality, and plan for document lifecycle management before they write the first line of pipeline code.
The teams that don't do these things ship systems that work in the demo and erode in production. The failure modes are different every time on the surface. Underneath, they are the same problem: treating RAG as a prompt engineering task rather than a systems engineering task.
Practitioner notes on AI architecture and production delivery — when they go out, not on a schedule.