Paper Notes: vLLM PagedAttention | Jiangneng's Homepage

Paper: Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023)

Executive Summary

PagedAttention revolutionizes Large Language Model (LLM) inference by applying operating system virtual memory concepts to KV cache management. Instead of allocating contiguous GPU memory for a sequence’s maximum potential length (which causes massive external and internal fragmentation), PagedAttention divides the KV cache into fixed-size Physical Blocks (e.g., storing 16 tokens each) and maps them dynamically via a Block Table.

Core Mechanisms & Critical Insights

1. Memory Management & Copy-on-Write (CoW)

To enable highly efficient memory sharing (e.g., multiple generated sequences sharing the same system prompt), PagedAttention implements a strict Reference Counting (ref_count) mechanism at the physical block level.

Shared Pointers: Multiple logical blocks from different sequences can map to the exact same physical block. When this happens, the physical block’s ref_count is incremented.
Copy-on-Write (CoW) Execution: When a sequence generates a new token and attempts to append it to the current physical block, the system first checks the ref_count.
- Trigger Condition: If ref_count > 1 (meaning the block is shared), the sequence is not allowed to write directly. Instead, a CoW is triggered: the system allocates a brand-new physical block, copies the existing historical tokens into it, decrements the original block’s ref_count, updates its own Block Table, and finally writes the new token into the newly copied block.

2. Hardware Bottleneck Dichotomy: Prefill vs. Decoding

The system must distinctly separate these two phases because they stress completely different physical hardware units on the GPU:

Prefill Phase (Strictly Compute-Bound): To process the initial prompt (e.g., 1,000 tokens), the model must compute the Q, K, and V for all tokens. Because of the multi-layer Transformer architecture, every token must perform an Attention calculation with all preceding tokens to generate its distinct output ($O$) before passing through the FFN to the next layer. This results in a massive $O(N^2)$ Dense Matrix-Matrix Multiplication (GEMM). The GPU’s memory bandwidth is sufficient, but the Tensor Cores (ALUs) hit their maximum capacity. (This is where FlashAttention steps in to optimize).
Decoding Phase (Strictly Memory-Bound): During autoregressive generation, the model predicts only one token at a time. The arithmetic operation is a tiny Matrix-Vector Multiplication (GEMV). However, to compute this single step, the GPU must fetch the entire historical KV cache from the global HBM into the SRAM. The ALUs sit idle waiting for data to arrive. Thus, decoding speed is strictly bottlenecked by GPU memory bandwidth.

3. Fine-Grained Branching in Beam Search

While architectural diagrams often simplify Beam Search (or parallel decoding) by showing sequences branching perfectly at the boundary of a block, the engineering reality is much more granular.

Token-Level Divergence: Sequences branch at the exact token level, which almost always happens right in the middle of a physical block.
CoW Resilience: The Copy-on-Write mechanism seamlessly handles this. If a beam diverges at the 5th token of a 16-token block, the CoW mechanism will copy those 5 tokens into a new physical block, and the new sequence will continue appending its unique 6th token into the new block, leaving the original shared block perfectly intact for the other beams.

System Impact

By combining dynamic block allocation, precise reference counting, and Copy-on-Write, PagedAttention achieves near-zero memory waste (less than 4% internal fragmentation in the final block). This fundamentally shifts the LLM inference paradigm, allowing batch sizes to scale significantly higher and dramatically improving overall system throughput.

ML Systems

Authors

Jiangneng Li

Doctor of Philosophy

PhD at NTU researching database systems, Data+AI, and multimedia data analytics.

← Paper Notes: FlashAttention April 1, 2026