RAG in Production: What Breaks and How to Fix It

Retrieval-augmented generation (RAG) is the fastest way to make an LLM answer questions about your own data—and the fastest way to ship something that looks great in a demo and falls apart in production. The demo works because you tested the three questions you already knew the answers to. Production is the other ten thousand. Here’s what actually breaks, and how we engineer around it.

Retrieval quality is the whole game

If the right chunk never makes it into the context window, no model—however capable—can answer correctly. Most “the AI is hallucinating” complaints are really retrieval failures wearing a costume.

The usual culprits: naive chunking that splits a table or a clause in half; embeddings that don’t capture your domain’s vocabulary; and pure vector search that misses exact-match terms like part numbers or policy codes. Fixes that pay off: chunk on document structure (headings, sections) rather than fixed token counts, add hybrid search (vector + keyword/BM25) so exact terms aren’t lost, and rerank the top candidates before they hit the prompt. Retrieval is an engineering problem with measurable inputs and outputs—treat it like one.

You can’t improve what you don’t evaluate

Teams ship RAG with zero evaluation and then tune by vibes. The first thing we build is an eval set: real questions, expected answers, and the source passages that should be retrieved. That lets you measure retrieval (did we fetch the right context?) separately from generation (given the context, was the answer faithful?).

With automated evals in place, every change—a new chunking strategy, a different embedding model, a prompt tweak—becomes a measurable experiment instead of a gut call. Faithfulness checks (is the answer grounded in the retrieved text?) catch hallucinations before users do.

Latency and cost are features, not afterthoughts

A correct answer that takes nine seconds and costs a dollar per query won’t survive contact with real traffic. Reranking, large context windows, and multi-step agentic retrieval all add latency and tokens fast.

Levers that matter in production: cache frequent queries and embeddings, retrieve fewer-but-better chunks instead of stuffing the context, route simple queries to smaller/cheaper models, and stream responses so perceived latency drops. Cost per query and p95 latency belong on a dashboard from day one—not discovered in the first invoice.

Guardrails and monitoring keep it trustworthy

The model will eventually be asked something out of scope, fed a poisoned document, or prompted to misbehave. Production RAG needs guardrails: scope limits, prompt-injection defenses, PII handling, and a graceful “I don’t know” instead of a confident fabrication.

And it needs observability. Log queries, retrieved chunks, and answers so you can debug failures, watch for drift as your data changes, and feed real misses back into the eval set. RAG isn’t ship-once—it’s a system you operate.

From demo to dependable

The gap between a RAG prototype and a production system is exactly the discipline of AI engineering: retrieval pipelines, evaluation, deployment, and the LLMOps to keep quality high and cost predictable over time.

If you have a RAG demo that impresses in the room but isn’t ready for users—or a stalled pilot you want in production—that’s the work we do. Explore our AI engineering & LLMOps and AI & ML solutions services, or book a call to get a clear scope and fixed price before you build.