Designing a RAG Pipeline for Production
Chunking, hybrid retrieval, reranking, grounding with citations, and the evals that separate demo-ware from production.
Most "RAG" demos work on a Sunday afternoon and fail in production on Monday morning. The failure modes are predictable: bad chunking, lexical queries that embeddings can't handle, stale indexes, and no way to tell if the model is making things up. A production RAG system is a small set of unglamorous choices that add up.
Chunk with intent
Naive fixed-size chunking (say 500 tokens) loses structure. Prefer semantic chunking that respects the document: split at headings, keep code blocks together, and overlap by 10–20%. Include the heading chain (H1 > H2 > H3) in the chunk metadata so the LLM can anchor answers.
Hybrid retrieval
Dense embeddings are great at concepts. They're terrible at exact terms — error codes, IDs, SKUs. A BM25 lexical retriever catches what embeddings miss. Run both, combine the lists (reciprocal rank fusion is a solid default), and optionally rerank.
A small, well-built retrieval layer with a mediocre model beats a giant model with poor retrieval. Spend your budget here.
Ground with citations
Every answer should cite. Not just for the user — for you. Citations are the easiest signal to measure: "did the answer cite the right chunk?" is a cleaner eval than "does this sound right?"
Measure what matters
Build a golden set of 50–200 questions that reflect real usage. Score answers on:
- Groundedness — can each claim be traced to a cited chunk?
- Faithfulness — does the answer contradict the chunk?
- Helpfulness — does it actually answer the question?
Run it in CI. Block merges that regress.
Operating it
- Reindex on a schedule; treat indexes as build artifacts.
- Log retrieved chunks + answer for every query (with PII care) so you can debug in production.
- Have a "no answer" branch when retrieval confidence is low. Refusing is a feature.
What I pick
Postgres with pgvector for under ~10M chunks — one database is cheaper than two. Dedicated vector DBs (Qdrant, Pinecone) when you outgrow that. Hybrid retrieval with a small reranker, and evals that are the non-negotiable gate before deploy.