Notes on speculative decoding methods for accelerating LLM inference, including Medusa and EAGLE.
DMC retrofits LLMs to autonomously merge redundant KV cache entries based on learned contextual importance, trading 2% pre-training data fine-tuning for significant memory reduction.
FlashAttention eliminates the O(N²) memory bottleneck of standard attention by tiling computation in SRAM with an online softmax trick, achieving exact results with no approximation.
PagedAttention revolutionizes LLM inference by applying OS virtual memory concepts to KV cache management, achieving near-zero memory waste.