Over the past couple of years, inference has evolved from βthe model just generates tokensβ into one of the most complex engineering systems in AI. While you wait 2β3 seconds for a response, dozens of mechanisms are already working behind the scenes: tokenization, embeddings, attention, KV cache, request routing, retrieval, batching, memory management, and entire optimization pipelines. |
In one of our earlier articles, we explained the core fundamentals of inference: key concepts, optimization techniques, and hardware trends. But that was a year ago, and the focus of the field is shifting extremely fast. |
Inference now is more about system orchestration β a coordinated runtime system where all elements work together to produce an answer under latency and cost constraints. |
Today weβre going to put all the pieces together into one pipeline. Youβll see the full path from tokens to generated answers, and weβll answer the most interesting question: what actually happens in the 2.5 seconds between your prompt and the modelβs response? |
There is more going on there than most people realize. |
|
But before, watch an episode of Attention Span, inspired by Demis Hassabis and OpenAIβs incredible achievement in math |
 | AI for Science Just Had Its ChatGPT Moment (and Scientists Aren't Extinct) |
|
|
|
In todayβs episode: |
LLM inference in two phases: prefill and decode Prefill unpacked The first layer: Tokens as the runtime currency Embeddings: From token IDs to meaningful geometry Attention: Where representations become context and prefill meets decode
Whatβs behind decode? The role of attention and KV cache Context is not only inside the model Inference optimization: batching, chunking, and parallelism Why modern inference is system orchestration Why attention is not the same as understanding Sources and further reading
|
LLM inference in two phases: prefill and decode |
When you write a prompt and send it to a model, a surprisingly complex pipeline starts running. But at the core, the process has two main stages: first, the model processes your request; then, it generates the response. One stage flows directly into the other: |
Prefill β this is the first stage, when the model reads the entire prompt and builds understanding of the context. Since all prompt tokens are already known, this step can be heavily parallelized and runs very fast on the GPU. Then prefill flows into β Decode β the model generates the response one token at a time. Each new token depends on the previous ones, so this stage is mostly sequential and slower.
|
The first output token usually takes the longest, because the model is still processing the whole prompt. After that, generation becomes a steady stream of tokens. |
When many users send requests at once, inference systems try to balance several goals: |
low latency, meaning fast responses high throughput to serve many users efficiently GPU memory efficiency and right GPU utilization.
|
Speaking of latency, we need to distinguish between two important metrics: |
Metric | What it measures | Main stage | What it affects |
|---|
Time to First Token (TTFT) | The time between sending a prompt and receiving the first generated token | Mostly prefill latency | How fast the model starts responding | Time per Output Token (TPOT) | The average time required to generate each token after the first one | Mostly decode latency | How fast the response streams after generation begins |
|
So, total latency is approximately: TTFT + (TPOT Γ number of output tokens). |
And about the hardware, the key detail is that prefill requires more compute, while decode is memory-bandwidth-bound. |
But why does each phase use GPU differently? To understand that, and how systems can be optimized for efficiency and lower GPU usage, we need to look at how all the LLM workflow components β tokenization, embeddings, attention, and others β are distributed across prefill and decode. |
Thereβs much more interesting stuff behind this pipeline than just a sequence of steps for processing text and generating responses. |
Prefill unpacked |
The first layer: Tokens as the runtime currency |
Letβs start from the very beginning. Before a model can process and generate anything, text gets broken into tokens. The tokenization process creates these tokens: models split raw text into smaller pieces, which are then converted into numerical IDs. Depending on the tokenizer, a token can be a whole word, part of a word, punctuation, whitespace, or even a byte sequence, but it is always small enough to generalize, yet meaningful enough to preserve structure and semantics. In production, tokenization is effectively a learned compression layer sitting between human language and GPU compute. |
However, this part of the workflow is not only about counting tokens. The way text gets split defines almost everything about modern AI systems: final sequence lengths, context limits, latency, memory usage, throughput, and even pricing. |
Moreover, not all tokens are equal. A system needs to βunderstandβ what exact kinds of tokens flow through it. An inference pipeline can involve the following token types which behave very differently: |
Input tokens are relatively cheap because models process them mostly in parallel during the prefill stage. Output tokens are more expensive because generation is sequential: the model predicts one token at a time. And they belong to decode stage. Reasoning tokens can silently multiply compute usage by generating long internal chains of thought before the final answer appears. Cached tokens reduce cost by reusing previously processed context. Retrieval and tool-use tokens often dominate agentic systems because every loop adds more context back into the window.
|
This influences how people design AI systems, a lot. A long conversation, a RAG pipeline, or an autonomous agent is now fundamentally a token-management problem. The smartest systems appear to be the ones βdecidingβ which tokens are actually worth processing, storing, retrieving, or generating in the first place. |
Tokenization happens before inference itself starts, but optimal tokenization and working with only the necessary tokens is one of the directions for optimizing compute and memory use. |
Tokens are what the input consists of β now letβs look at how they start to come βaliveβ inside the model. |
Embeddings: From token IDs to meaningful geometry |
After tokenization the system only has token IDs β integers like 14382 or 5021. They are useless for the model until they reconstruct their meaning. In AI, this meaning is hidden in geometry. |
An embedding layer maps every token ID to a dense vector β a learned coordinate in a high-dimensional space. The model then learns relationships between these representations through distance and direction. Similar concepts end up near each other, and this is the key to a total generalization (like generalizing from βcatβ to βdogβ or from βroomβ to βbedroomβ) without memorizing every possible sentence individually. |
Technically, this happens through an embedding matrix: a trainable lookup table where each token maps to a vector. During training, those initially random vectors organize into a semantic space where patterns emerge naturally. |
Since models also need to know the order of tokens in the sequence, positional encodings are used to inject the actual order directly into the vectors. Many systems use a fundamental technique called RoPE (Rotary Position Embedding), which rotates embeddings in vector space based on token position, allowing attention layers to track relative distance between tokens efficiently. This concrete geometry is finally what the network can reason over. |
Only after this step does the real computation begin β |
Attention: Where representations become context and prefill meets decode |
Donβt settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying. | | Join Premium members from top companies like Microsoft, NVIDIA, Google, HF, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand whatβs going on in AI. |
|
|
How did you like it? |
|