How LLM systems actually work — Hassan Raza

LLMs

How an LLM answer is actually produced.

An LLM is not one box. In production it's a pipeline — and every stage has its own trade-offs. This is the mental model I use when designing AI systems.

LLM pipeline · hover to learn, press play to watch it run

How an LLM answer is actually produced.

Most people think of the LLM as a single step. In production it’s a pipeline: tokenize, retrieve, compose, generate, stream. Each stage has its own trade-offs.

Core concepts

Embeddings: vector representations used for semantic retrieval.

Tokens: the unit models process; tokens drive context limits and cost.

Prompting: explicit instructions that shape model behavior and output quality.

RAG (Retrieval Augmented Generation): combining retrieval + LLM generation for grounded answers.

Tokens, context windows, and cost

Models don't see characters — they see tokens. Your prompt, system prompt, retrieved context, and previous messages all share a fixed context window. Tokens in + tokens out drive latency and cost, so a good system trims aggressively and measures.

Retrieval is the product

The model is commoditized. The product is what you put in the prompt. That means chunking, embeddings, hybrid search, and reranking matter more than picking the shiniest model. A smaller model with great retrieval beats a big model with poor retrieval every time.

Streaming is UX

Token streaming is the difference between a tool that feels instant and a tool that feels broken. Time-to-first-token is a product metric. Measure it, budget for it, and stream incrementally rendered markdown.

Evals, not vibes

Production LLM systems are only as good as their evals. Curate a golden set of realistic queries, score answers for faithfulness and groundedness, and block regressions in CI. Anything else is vibes-based engineering.