Paper Notes: SGLang and Safe Eviction

April 9, 2026·
Jiangneng Li
Jiangneng Li
· 5 min read
post Paper Notes

Paper: SGLang: Efficient Execution of Structured Language Model Programs (Zheng et al., 2024)

SGLang’s key systems idea is RadixAttention: instead of discarding KV cache after every request, it retains reusable prefixes in a radix tree and manages them as a cache. The paper explicitly combines three ideas here: tree-structured KV reuse, leaf-first LRU eviction, and reference-count-based protection for in-flight requests.

The most intuitive way to understand why this matters is to imagine a tree-search workload such as Tree-of-Thought or an MCTS-style reasoning loop. The paper itself evaluates Tree-of-Thought and discusses dynamic tree structures; the MCTS framing below is my systems interpretation of what goes wrong if eviction is implemented naively.

Why a Naive LRU Can End Up Evicting the Root

The problem is not that SGLang’s own RadixAttention blindly evicts the root. The problem is that a naive recency-only cache, if it ignores the tree structure, can absolutely do that.

1. Temporal Blind Spot

In a tree search, execution often spends a long stretch of time deep inside one branch. While the runtime is busy expanding, scoring, and decoding around a leaf, the root prefix may not be “touched” for several seconds.

To a naive LRU queue, that root prefix now looks cold.

2. Physical Illusion

The root is usually also the largest shared prefix in the system: system prompt, tool instructions, long context, prior reasoning trace, and so on. In practice, it can easily span thousands of tokens.

So a naive LRU sees something that is:

  • large in memory,
  • old in recency,
  • and apparently inactive.

That is exactly the profile of what a traditional cache would want to evict first.

3. Catastrophic Recomputation

But in a tree workload, evicting the root is not like evicting a normal cold page from an OS cache. It is closer to deleting the load-bearing structure of the whole search tree.

Suppose the runtime has been exploring a deep branch for a while and later needs to backtrack to an earlier decision point to open a sibling branch. If the shared prefix near the root has already been evicted, the model must recompute the entire path from scratch before it can continue. What looked like a “good” local eviction decision becomes a globally disastrous one.

This is why naive LRU is too myopic for tree-shaped KV reuse.

What SGLang Actually Does Differently

1. It Evicts Leaves First

The paper does not describe an arbitrary node-level LRU. It states that RadixAttention uses an eviction policy that removes the least recently used leaf first. This is the crucial structural fix.

If you evict leaves first, shared ancestors stay alive as long as they still support some subtree. Only after descendants disappear and an ancestor itself becomes a leaf does it become a candidate for eviction.

That is exactly the behavior we want in tree search: prune the fringe before touching the trunk.

2. It Protects In-Flight Prefixes with Reference Counts

The paper also states that in continuous batching, nodes used by the currently running batch cannot be evicted, so each node maintains a reference counter and is evictable only when that counter is zero.

Conceptually, this behaves like pinning, even though the implementation is not a standalone pinned=true flag.

In the current SGLang codebase, this protection appears as lock_ref in radix_cache.py:

  • inc_lock_ref(...) walks from a node toward the root and protects the whole ancestor chain,
  • dec_lock_ref(...) releases that protection when the request finishes or moves on,
  • _update_leaf_status(...) removes locked nodes from the evictable leaf set,
  • evict(...) only evicts from evictable leaves and will not continue upward if the parent still has a positive lock reference.

So the runtime behavior is: if a request is actively decoding at a deep node, its parent, grandparent, and so on are all protected transitively.

3. It Schedules Requests to Preserve Prefix Locality

SGLang does not only fix the eviction rule. It also fixes the execution order.

The paper proposes a cache-aware scheduler based on longest-shared-prefix-first, and proves that in the offline case this corresponds to a DFS order over the radix tree. Intuitively, once the runtime enters a subtree, it tries to stay there long enough to exploit the hot shared prefix instead of bouncing randomly across unrelated branches.

This matters because even a good eviction policy can still suffer if the scheduler keeps forcing the cache to thrash between disjoint prefixes.

The Engineering Intuition

Putting the pieces together:

  • A naive flat LRU may see the root as “old and big” and evict it.
  • RadixAttention changes the object being managed from isolated pages to tree-structured prefixes.
  • Leaf-first eviction preserves shared ancestors.
  • Reference counts protect every prefix that is still needed by a live request.
  • Prefix-aware scheduling keeps the active subtree hot.

So the real lesson is not just “use LRU carefully.” It is that eviction must respect the geometry of the workload. Once requests share prefixes in a branching tree, cache management has to become tree-aware as well.

Key Takeaway

SGLang avoids the classic eviction disaster not by abandoning LRU entirely, but by embedding it inside the right structure. A naive LRU can indeed be “dumb enough” to evict the root of a reasoning tree. RadixAttention prevents that by combining radix-tree structure, leaf-first eviction, reference-protected ancestors, and cache-aware scheduling.

Jiangneng Li
Authors
Doctor of Philosophy
PhD at NTU researching database systems, Data+AI, and multimedia data analytics.