RAG That Actually Works in Production
Problem
Support and product teams were drowning in repetitive questions that were already answered in our docs, tickets, and runbooks. Earlier RAG attempts were 'demo-great, production-bad': plausible answers, frequent hallucinations, no trust signals.
Constraints
- All data must stay inside our VPC
- Answers must cite sources; unsourced answers are not allowed
- Latency budget: time-to-first-token under 500ms
Architecture
We chunk documents semantically with awareness of headings and code blocks, embed with a small Gemini embedding model, and store vectors in pgvector next to the source records so joins are trivial. Retrieval is hybrid: ANN over embeddings + BM25 lexical, then a lightweight reranker picks the top K. The LLM layer streams tokens with citations, and a guardrail refuses when retrieved context is insufficient. Quality is enforced by an eval suite in CI against a golden set of question/answer pairs.
Trade-offs
Why: One system to operate; transactional joins with source records.
Cost: Careful index tuning; not ideal for 100M+ vectors.
Why: Dense embeddings find concepts; lexical finds exact terms (IDs, error codes).
Cost: Two retrievers to operate and tune.
Why: Regressions are caught before they hit users.
Cost: Golden set must be maintained; flaky generations handled by thresholds.
Outcome
- 28% reduction in support volume within six weeks
- Citations in every answer, with click-through back to source
- Quality regressions caught by CI before production