Blog | Jiangneng's Homepage

Paper Notes: Dynamic Memory Compression

Thu, 02 Apr 2026 00:00:00 +0000

Paper:

Summary

Dynamic Memory Compression (DMC) optimizes LLM inference by allowing the model to autonomously decide when to merge redundant token representations in the KV cache based on learned contextual importance. To achieve this, the algorithm requires “retrofitting”—fine-tuning the pre-trained LLM on a fraction of its original data to teach the attention mechanism this dynamic pooling behavior. Consequently, while it drastically reduces memory footprint, it fundamentally alters the original model weights, making it incompatible with “training-free,” plug-and-play inference engines.

Key Takeaway

The core limitation of DMC is its dependency on 2% of the original pre-training data to train the merging mechanism. This is a non-trivial cost — unless every model provider commits to this retrofitting step, adoption remains impractical at scale. The training overhead makes it fundamentally different from training-free approaches like quantization or eviction-based methods.

This concern has been echoed in practice. A similar merging request appeared in the vLLM project ( ), where it was pointed out that the required training data makes it too expensive for general-purpose deployment.

Paper Notes: Speculative Decoding

Thu, 02 Apr 2026 00:00:00 +0000

Papers:

Industry Adoption: vLLM employs EAGLE as one of its most important speculative decoding features. See .

Why Speculative Decoding Breaks PagedAttention (and How vLLM Fixes It)

1. Allocation Frequency: Steady Pruning vs. Burst-and-Kill

Beam Search allocates and frees KV cache blocks at the pace of actual generation. With beam width = 3, each forward pass produces 3 new tokens. When a branch is pruned, the system triggers a single deallocation — the churn rate is synchronized with real generation speed.

Speculative Decoding (Tree Attention) operates in explosive bursts. The draft model generates an entire speculation tree (e.g., 15 candidate tokens) in a single, ultra-fast forward pass (a few milliseconds). The verifier then instantly rejects most of them — say 12 out of 15.

The critical gap: Beam search prunes gradually. Speculative decoding forces the page table to insert 15 pointers, then execute 12 Free operations just ~10ms later — every single step. This high-frequency “instant garbage collection” creates severe lock contention on the CPU-side scheduler, and the management overhead can eat into the speedup gained from speculation.

2. Copy-on-Write Fragmentation at Micro Scale

CoW is elegant for beam search branching, but becomes a nightmare at speculative decoding’s micro-granularity.

Consider a block of size 16 tokens with 3 empty slots remaining. Beam search fills one token at a time. But when the draft model forks into 3 parallel paths (A, B, C) each producing 3 tokens, CoW forces the system to immediately copy 3 independent physical blocks:

$$3 \times 16 = 48 \text{ token slots allocated, but only } 3 \times 3 = 9 \text{ draft tokens stored}$$

This is extreme internal fragmentation. After verification kills paths B and C, those under-filled blocks must be freed immediately — adding to the allocation churn.

3. The Engineering Solution: Volatile Draft Buffer

Because naive PagedAttention integration causes page table thrashing and internal fragmentation, production systems like vLLM and TensorRT-LLM do not let draft tokens enter the global PagedAttention memory pool.

Instead, they employ an isolation mechanism:

Volatile Draft Buffer: Each sequence gets a small, contiguous temporary buffer (a simple array, not managed by the block allocator). Draft tokens are written directly here — no block allocation, no CoW, no fragmentation.
In-place Overwrite: The speculation tree is written into this buffer each step. Rejected tokens are simply overwritten by the next round of drafts — no Free syscall needed.
Commit on Verification: Only after the verifier confirms tokens as correct are they “promoted” into the global PagedAttention KV cache as committed tokens, in a single batch write.

This is the real-world engineering answer: use a short-lived, lock-free ring buffer to absorb the high-frequency allocation/deallocation storm, and only touch the global page table for verified, permanent tokens.

Why vLLM Adopts EAGLE but Not DMC

The key difference from approaches like is decoupling. EAGLE does not modify the base model weights at all — it trains a lightweight external draft head as a plug-in. If you don’t want speculative decoding, you simply remove the EAGLE head and the original model remains a standard, unmodified checkpoint.

In contrast, DMC requires retrofitting the base model itself with 2% of pre-training data, permanently altering its weights. This makes it impractical unless every model provider commits to the training cost.

With EAGLE, the training cost is absorbed by the open-source community: labs with compute (e.g., Tsinghua, Berkeley, LMSYS) pre-train EAGLE heads for popular models (Llama-3, Qwen, Mistral, etc.) and publish them on HuggingFace. End users simply download the plug-in weights and enjoy ~3x decoding speedup — no training required.

Paper Notes: FlashAttention

Wed, 01 Apr 2026 00:00:00 +0000

Paper:

1. Tiling and Safe Online Softmax (The Forward Pass Math)

The fundamental bottleneck of standard attention is the $\Theta(N^2)$ memory requirement to materialize the attention score matrix in High Bandwidth Memory (HBM). FlashAttention solves this via tiling (computing block by block in SRAM) combined with the Safe Online Softmax mathematical trick.

The Overflow Problem & Safe Softmax

Standard softmax operations $e^{x_i} / \sum e^{x_j}$ will trigger numerical overflow (e.g., NaN in FP16) if $x_i$ is large. To prevent this, a local maximum $m(x) = \max_i x_i$ is subtracted from all elements:

$$\text{softmax}(x_i) = \frac{e^{x_i - m(x)}}{\sum_j e^{x_j - m(x)}}$$

The “Time-Travel” Reweighting Trick (Online Softmax)

Because blocks are processed sequentially and earlier blocks are discarded from SRAM, we cannot retroactively subtract a newly discovered global maximum from old blocks. Instead, FlashAttention leverages the exponential property $e^{a-b} = e^a \cdot e^{-b}$ to dynamically “decay” historical running states.

For each new block $j$, the GPU computes the local scores $S_j = Q K_j^T$, the local max $m_{local}$, and the local exponentiated values $\tilde{P}_{local} = \exp(S_j - m_{local})$. The running variables are updated entirely in SRAM:

Update Global Max:

$$m_{new} = \max(m_{old}, m_{local})$$

Update Running Denominator ($l$) via Exponential Decay:

$$l_{new} = l_{old} \cdot \exp(m_{old} - m_{new}) + \text{rowsum}(\tilde{P}_{local})$$

Update Running Numerator/Output ($O$) via Weighted Sum:

$$O_{new} = O_{old} \cdot \exp(m_{old} - m_{new}) + \tilde{P}_{local} V_{local}$$

By applying the decay factor $\exp(m_{old} - m_{new})$ to the history, the algorithm mathematically aligns all previous calculations to the new maximum without ever reloading old $K$ and $V$ matrices.

2. Loop Order Optimization: FlashAttention-1 vs. FlashAttention-2

The physical execution speed of GPU kernels is heavily bound by HBM write operations.

FlashAttention-1 (KV Outer Loop, Q Inner Loop): FA1 iterates over $K, V$ blocks in the outer loop. For every inner loop step over $Q$, the intermediate, partially accumulated output block $O_i$ must be read from HBM, “un-normalized” by multiplying the old denominator, updated with the new block’s weighted sum, re-normalized, and written back to HBM. This causes a massive $O_i$ read/write overhead.

FlashAttention-2 (Q Outer Loop, KV Inner Loop): FA2 pins a $Q_i$ block in the outer loop and iterates through all $K_j, V_j$ blocks in the inner loop. The running variables $O_{run}$, $m_{run}$, and $l_{run}$ stay exclusively inside the SRAM registers. The intermediate $O_i$ is continuously accumulated using the decay formula and is written to HBM exactly once after the entire inner KV loop finishes. This simple loop swap eliminates the repetitive HBM writes, drastically dropping the constant factor in the $O(N^2 d^2 M^{-1})$ complexity.

3. The Backward Pass and Gradient Recomputation

During model training, the backward pass requires the full $N \times N$ attention probability matrix $P$ to calculate gradients using the Chain Rule. Writing this massive matrix to HBM during the forward pass would negate all memory optimizations.

Checkpointing Global Statistics

Instead of storing the $N \times N$ matrix, the forward pass only saves the final global scalars to HBM: the global maximum ($m^{global}$) and the global denominator ($l^{global}$).

On-the-Fly Recomputation and Matrix Calculus

During the backward pass, the GPU loads $Q_i$, $K_j$, $V_j$, and the upstream gradient $dO_i$ into SRAM. Because the true global maximum is already known, there is no need for dynamic reweighting. The exact local probability block $P_{ij}$ is reconstructed instantly:

$$P_{ij} = \frac{\exp(Q_i K_j^T - m^{global})}{l^{global}}$$

With $P_{ij}$ reconstructed locally, the gradients are computed using the multivariable chain rule, and the results are accumulated (+=):

Gradient of V:

$$dV_j \mathrel{+}= P_{ij}^T \cdot dO_i$$

Gradient of Pre-Softmax Scores ($S$):

$$dS_{ij} = P_{ij} \circ (dO_i \cdot V_j^T - D_i)$$

(where $\circ$ is element-wise multiplication, and $D_i = \text{rowsum}(dO_i \circ O_i)$)

Gradients of Q and K:

$$dQ_i \mathrel{+}= dS_{ij} \cdot K_j$$

$$dK_j \mathrel{+}= dS_{ij}^T \cdot Q_i$$

The strict accumulation logic (+=) represents the physical manifestation of the mathematical summation over all blocks. Once the local gradients are added to the accumulators in HBM, the massive $P_{ij}$ and $dS_{ij}$ blocks are immediately destroyed from SRAM, ensuring the memory footprint remains constant $\Theta(1)$ regardless of sequence length.

Paper Notes: vLLM PagedAttention

Fri, 27 Mar 2026 00:00:00 +0000

Paper:

Executive Summary

PagedAttention revolutionizes Large Language Model (LLM) inference by applying operating system virtual memory concepts to KV cache management. Instead of allocating contiguous GPU memory for a sequence’s maximum potential length (which causes massive external and internal fragmentation), PagedAttention divides the KV cache into fixed-size Physical Blocks (e.g., storing 16 tokens each) and maps them dynamically via a Block Table.

Core Mechanisms & Critical Insights

1. Memory Management & Copy-on-Write (CoW)

To enable highly efficient memory sharing (e.g., multiple generated sequences sharing the same system prompt), PagedAttention implements a strict Reference Counting (ref_count) mechanism at the physical block level.

Shared Pointers: Multiple logical blocks from different sequences can map to the exact same physical block. When this happens, the physical block’s ref_count is incremented.
Copy-on-Write (CoW) Execution: When a sequence generates a new token and attempts to append it to the current physical block, the system first checks the ref_count.
- Trigger Condition: If ref_count > 1 (meaning the block is shared), the sequence is not allowed to write directly. Instead, a CoW is triggered: the system allocates a brand-new physical block, copies the existing historical tokens into it, decrements the original block’s ref_count, updates its own Block Table, and finally writes the new token into the newly copied block.

2. Hardware Bottleneck Dichotomy: Prefill vs. Decoding

The system must distinctly separate these two phases because they stress completely different physical hardware units on the GPU:

Prefill Phase (Strictly Compute-Bound): To process the initial prompt (e.g., 1,000 tokens), the model must compute the Q, K, and V for all tokens. Because of the multi-layer Transformer architecture, every token must perform an Attention calculation with all preceding tokens to generate its distinct output ($O$) before passing through the FFN to the next layer. This results in a massive $O(N^2)$ Dense Matrix-Matrix Multiplication (GEMM). The GPU’s memory bandwidth is sufficient, but the Tensor Cores (ALUs) hit their maximum capacity. (This is where FlashAttention steps in to optimize).
Decoding Phase (Strictly Memory-Bound): During autoregressive generation, the model predicts only one token at a time. The arithmetic operation is a tiny Matrix-Vector Multiplication (GEMV). However, to compute this single step, the GPU must fetch the entire historical KV cache from the global HBM into the SRAM. The ALUs sit idle waiting for data to arrive. Thus, decoding speed is strictly bottlenecked by GPU memory bandwidth.

3. Fine-Grained Branching in Beam Search

While architectural diagrams often simplify Beam Search (or parallel decoding) by showing sequences branching perfectly at the boundary of a block, the engineering reality is much more granular.

Token-Level Divergence: Sequences branch at the exact token level, which almost always happens right in the middle of a physical block.
CoW Resilience: The Copy-on-Write mechanism seamlessly handles this. If a beam diverges at the 5th token of a 16-token block, the CoW mechanism will copy those 5 tokens into a new physical block, and the new sequence will continue appending its unique 6th token into the new block, leaving the original shared block perfectly intact for the other beams.

System Impact

By combining dynamic block allocation, precise reference counting, and Copy-on-Write, PagedAttention achieves near-zero memory waste (less than 4% internal fragmentation in the final block). This fundamentally shifts the LLM inference paradigm, allowing batch sizes to scale significantly higher and dramatically improving overall system throughput.