2025

RAG That Actually Works in Production

RAGLLMpgvectorEvals

Problem

Support and product teams were drowning in repetitive questions that were already answered in our docs, tickets, and runbooks. Earlier RAG attempts were 'demo-great, production-bad': plausible answers, frequent hallucinations, no trust signals.

Constraints

All data must stay inside our VPC
Answers must cite sources; unsourced answers are not allowed
Latency budget: time-to-first-token under 500ms

Architecture

We chunk documents semantically with awareness of headings and code blocks, embed with a small Gemini embedding model, and store vectors in pgvector next to the source records so joins are trivial. Retrieval is hybrid: ANN over embeddings + BM25 lexical, then a lightweight reranker picks the top K. The LLM layer streams tokens with citations, and a guardrail refuses when retrieved context is insufficient. Quality is enforced by an eval suite in CI against a golden set of question/answer pairs.

Trade-offs

pgvector over a dedicated vector DB

Why: One system to operate; transactional joins with source records.

Cost: Careful index tuning; not ideal for 100M+ vectors.

Hybrid retrieval

Why: Dense embeddings find concepts; lexical finds exact terms (IDs, error codes).

Cost: Two retrievers to operate and tune.

Evals in CI

Why: Regressions are caught before they hit users.

Cost: Golden set must be maintained; flaky generations handled by thresholds.

Outcome

28% reduction in support volume within six weeks
Citations in every answer, with click-through back to source
Quality regressions caught by CI before production