System Notes: Queueing Avalanche

April 9, 2026·
Jiangneng Li
Jiangneng Li
· 4 min read
post System Notes

This note explains a failure mode I like to call queueing avalanche. It is not the cost of a single eviction that kills the system. It is the way a moderate control-plane stall gets amplified by a queue, then amplified again by concurrency, until the whole serving stack starts to look “frozen.”

You can think of it as the dynamic version of the safe-eviction story in SGLang and Safe Eviction: that post focuses on what should be evicted, while this one focuses on how long the eviction machinery itself takes.

1. The Scaling Trap

The first trap is that microbenchmarks lie.

Suppose we measure std::make_heap on a tiny tree and see something like 12,000 ns (12 us). That sounds negligible. The problem is that the number was measured on a structure with only a few thousand nodes.

A real MCTS-style search can easily grow to 100,000 nodes or more. Once the heap maintenance work is proportional to that larger live set, the cost is no longer in microseconds. It can jump into the millisecond range.

That is the dangerous transition:

  • 12 us feels like metadata overhead,
  • 3-10 ms becomes a scheduling event,
  • and repeated 3-10 ms stalls become a system-level failure mode.

So the core mistake is extrapolating from a small-tree benchmark to a large-tree production workload.

2. The Event-Loop Choke Point

In SGLang-like serving stacks, the control plane is orchestrated by a Python asyncio event loop. Conceptually, that loop is the traffic controller for request admission, scheduling decisions, and GPU handoff.

If a large eviction operation forces the control path to spend several milliseconds rebuilding or reordering a heap, that work can block the loop at exactly the wrong time.

During that blocked window:

  • new API requests cannot be admitted promptly,
  • TCP accept and request handling begin to backlog,
  • completed GPU batches come back to the CPU asking for the next step,
  • but the CPU-side scheduler is still busy maintaining the priority structure.

This is the absurd outcome: the GPUs are not slow, and memory bandwidth is not the bottleneck. The expensive accelerators go idle because the control plane cannot answer the question, “What should I run next?”

That is why a few milliseconds on the CPU can translate into 0% effective GPU utilization for that window.

3. Why Concurrency Turns It into an Avalanche

A single 5 ms stall is unpleasant but survivable. The real disaster appears when many clients arrive concurrently.

Imagine 256 concurrent MCTS clients. While the event loop is blocked:

  • existing requests cannot make progress,
  • new requests keep arriving,
  • completed work cannot be dispatched into the next batch,
  • and queue length starts increasing faster than the system can drain it.

Now the next request does not just pay the original 5 ms stall. It pays:

  • the original stall,
  • plus the waiting time of all work already queued ahead of it,
  • plus any additional stalls triggered by the larger backlog.

This is the queueing-theory version of positive feedback: delay creates backlog, backlog increases service delay, and the longer delay creates even more backlog.

That is why the tail latency explosion is so violent. The P99 does not grow linearly. Once the queue starts feeding on itself, it can blow up by orders of magnitude.

4. The Intuition in One Sentence

The system does not fail because heap maintenance is individually expensive. It fails because a synchronous control-plane pause starves batch scheduling, which idles the GPU, which slows queue draining, which amplifies the next pause.

That feedback loop is the avalanche.

Key Takeaway

The phrase queueing avalanche describes a control-plane collapse where a seemingly modest per-eviction cost becomes catastrophic under scale and concurrency. Small-tree benchmarks hide the problem, but once heap maintenance reaches the millisecond range, a single-threaded scheduler can become the bottleneck for the entire cluster. At that point, the dominant cost is no longer the heap operation itself. It is the cascading queueing delay that follows.

Jiangneng Li
Authors
Doctor of Philosophy
PhD at NTU researching database systems, Data+AI, and multimedia data analytics.