How an LLM answer is actually produced.
An LLM is not one box. In production it's a pipeline — and every stage has its own trade-offs. This is the mental model I use when designing AI systems.
Most people think of the LLM as a single step. In production it’s a pipeline: tokenize, retrieve, compose, generate, stream. Each stage has its own trade-offs.
Core concepts
Tokens, context windows, and cost
Models don't see characters — they see tokens. Your prompt, system prompt, retrieved context, and previous messages all share a fixed context window. Tokens in + tokens out drive latency and cost, so a good system trims aggressively and measures.
Retrieval is the product
The model is commoditized. The product is what you put in the prompt. That means chunking, embeddings, hybrid search, and reranking matter more than picking the shiniest model. A smaller model with great retrieval beats a big model with poor retrieval every time.
Streaming is UX
Token streaming is the difference between a tool that feels instant and a tool that feels broken. Time-to-first-token is a product metric. Measure it, budget for it, and stream incrementally rendered markdown.
Evals, not vibes
Production LLM systems are only as good as their evals. Curate a golden set of realistic queries, score answers for faithfulness and groundedness, and block regressions in CI. Anything else is vibes-based engineering.