Paper Notes: FlashAttention
FlashAttention eliminates the O(N²) memory bottleneck of standard attention by tiling computation in SRAM with an online softmax trick, achieving exact results with no …
FlashAttention eliminates the O(N²) memory bottleneck of standard attention by tiling computation in SRAM with an online softmax trick, achieving exact results with no …
PagedAttention revolutionizes LLM inference by applying OS virtual memory concepts to KV cache management, achieving near-zero memory waste.