Hassan Raza
All deep dives

Designing a RAG Pipeline for Production

Chunking, hybrid retrieval, reranking, grounding with citations, and the evals that separate demo-ware from production.

13 min
LLMRAGAI Systems

Most "RAG" demos work on a Sunday afternoon and fail in production on Monday morning. The failure modes are predictable: bad chunking, lexical queries that embeddings can't handle, stale indexes, and no way to tell if the model is making things up. A production RAG system is a small set of unglamorous choices that add up.

Chunk with intent

Naive fixed-size chunking (say 500 tokens) loses structure. Prefer semantic chunking that respects the document: split at headings, keep code blocks together, and overlap by 10–20%. Include the heading chain (H1 > H2 > H3) in the chunk metadata so the LLM can anchor answers.

Hybrid retrieval

Dense embeddings are great at concepts. They're terrible at exact terms — error codes, IDs, SKUs. A BM25 lexical retriever catches what embeddings miss. Run both, combine the lists (reciprocal rank fusion is a solid default), and optionally rerank.

A small, well-built retrieval layer with a mediocre model beats a giant model with poor retrieval. Spend your budget here.

Ground with citations

Every answer should cite. Not just for the user — for you. Citations are the easiest signal to measure: "did the answer cite the right chunk?" is a cleaner eval than "does this sound right?"

Measure what matters

Build a golden set of 50–200 questions that reflect real usage. Score answers on:

  • Groundedness — can each claim be traced to a cited chunk?
  • Faithfulness — does the answer contradict the chunk?
  • Helpfulness — does it actually answer the question?

Run it in CI. Block merges that regress.

Operating it

  • Reindex on a schedule; treat indexes as build artifacts.
  • Log retrieved chunks + answer for every query (with PII care) so you can debug in production.
  • Have a "no answer" branch when retrieval confidence is low. Refusing is a feature.

What I pick

Postgres with pgvector for under ~10M chunks — one database is cheaper than two. Dedicated vector DBs (Qdrant, Pinecone) when you outgrow that. Hybrid retrieval with a small reranker, and evals that are the non-negotiable gate before deploy.

Book a Call