<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Data Structure | Jiangneng's Homepage</title><link>https://www.jiangnengli.com/tag/data-structure/</link><atom:link href="https://www.jiangnengli.com/tag/data-structure/index.xml" rel="self" type="application/rss+xml"/><description>Data Structure</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 09 Apr 2026 00:00:00 +0000</lastBuildDate><image><url>https://www.jiangnengli.com/media/icon_hu_37c904991c0d686.png</url><title>Data Structure</title><link>https://www.jiangnengli.com/tag/data-structure/</link></image><item><title>Data Structure Notes: LSM-Tree</title><link>https://www.jiangnengli.com/post/data-structure-lsmtree/</link><pubDate>Thu, 09 Apr 2026 00:00:00 +0000</pubDate><guid>https://www.jiangnengli.com/post/data-structure-lsmtree/</guid><description>&lt;p&gt;An &lt;strong&gt;LSM-tree&lt;/strong&gt; (Log-Structured Merge-Tree) is a write-optimized data structure widely used in modern storage engines such as RocksDB, LevelDB, Bigtable-style systems, and many NoSQL databases.&lt;/p&gt;
&lt;p&gt;The core idea is simple: instead of updating data in place like a B-tree, an LSM-tree turns random writes into a sequence of &lt;strong&gt;append, flush, and merge&lt;/strong&gt; operations. That design is much friendlier to disks and SSDs when the workload is write-heavy.&lt;/p&gt;
&lt;h2 id="1-why-lsm-trees-exist"&gt;1. Why LSM-Trees Exist&lt;/h2&gt;
&lt;p&gt;Traditional update-in-place structures are good at keeping data immediately organized, but random writes are expensive. Every insert may require touching existing pages, rewriting internal nodes, and causing scattered I/O.&lt;/p&gt;
&lt;p&gt;LSM-trees take the opposite approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;accept writes quickly in memory,&lt;/li&gt;
&lt;li&gt;write sorted immutable files to disk in batches,&lt;/li&gt;
&lt;li&gt;and reorganize them later through background compaction.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the design goal is not &amp;ldquo;keep the data perfectly ordered at every moment.&amp;rdquo; It is &amp;ldquo;make writes cheap first, then clean up efficiently in bulk.&amp;rdquo;&lt;/p&gt;
&lt;h2 id="2-the-basic-write-path"&gt;2. The Basic Write Path&lt;/h2&gt;
&lt;p&gt;The write path usually looks like this:&lt;/p&gt;
&lt;h3 id="memtable"&gt;MemTable&lt;/h3&gt;
&lt;p&gt;Incoming writes first go to an in-memory structure, often called a &lt;strong&gt;MemTable&lt;/strong&gt;. This is typically a sorted map or skip list.&lt;/p&gt;
&lt;p&gt;Updating memory is fast, so the system can absorb writes with very low latency.&lt;/p&gt;
&lt;h3 id="wal"&gt;WAL&lt;/h3&gt;
&lt;p&gt;Because memory is volatile, the system also appends each update to a &lt;strong&gt;Write-Ahead Log (WAL)&lt;/strong&gt; for durability.&lt;/p&gt;
&lt;p&gt;So before data reaches long-term storage, it usually exists in two places:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;in the WAL for crash recovery,&lt;/li&gt;
&lt;li&gt;and in the MemTable for fast access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="flush-to-sstable"&gt;Flush to SSTable&lt;/h3&gt;
&lt;p&gt;When the MemTable becomes full, it is frozen and flushed to disk as a sorted immutable file, usually called an &lt;strong&gt;SSTable&lt;/strong&gt; or &lt;strong&gt;Sorted String Table&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This is an important moment: instead of many tiny random writes, the engine emits one large sequential write.&lt;/p&gt;
&lt;p&gt;That batching effect is one major reason LSM-trees are so write-efficient.&lt;/p&gt;
&lt;h2 id="3-the-basic-read-path"&gt;3. The Basic Read Path&lt;/h2&gt;
&lt;p&gt;Reads are more complicated than writes, because the newest value may live in several places:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the current MemTable,&lt;/li&gt;
&lt;li&gt;one or more immutable MemTables waiting to flush,&lt;/li&gt;
&lt;li&gt;or multiple SSTables on disk across different levels.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A point lookup therefore checks the newest structures first and may search older ones only if needed.&lt;/p&gt;
&lt;p&gt;Without extra help, this would be expensive, because many SSTables might need to be consulted.&lt;/p&gt;
&lt;p&gt;That is why LSM engines typically use &lt;strong&gt;Bloom filters&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a Bloom filter can quickly say &amp;ldquo;this key is definitely not in this SSTable,&amp;rdquo;&lt;/li&gt;
&lt;li&gt;so the engine avoids many unnecessary disk reads,&lt;/li&gt;
&lt;li&gt;and point lookup latency stays practical even when there are many files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="4-why-compaction-is-the-real-heart-of-the-design"&gt;4. Why Compaction Is the Real Heart of the Design&lt;/h2&gt;
&lt;p&gt;If the system only kept flushing SSTables forever, reads would eventually become too expensive. The engine would have to search too many files, and outdated versions of keys would accumulate.&lt;/p&gt;
&lt;p&gt;So LSM-trees run a background process called &lt;strong&gt;compaction&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Compaction does three things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;merges sorted files,&lt;/li&gt;
&lt;li&gt;discards obsolete versions and tombstones when possible,&lt;/li&gt;
&lt;li&gt;and reshapes data into a layout that is cheaper to read later.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is the hidden cost of the LSM-tree. Writes look cheap on the foreground path because the expensive cleanup work is deferred into the background.&lt;/p&gt;
&lt;p&gt;So compaction is not an implementation detail. It is the mechanism that makes the whole structure sustainable.&lt;/p&gt;
&lt;h2 id="5-the-three-amplifications"&gt;5. The Three Amplifications&lt;/h2&gt;
&lt;p&gt;The cleanest way to understand LSM trade-offs is through the classic trio:&lt;/p&gt;
&lt;h3 id="write-amplification"&gt;Write Amplification&lt;/h3&gt;
&lt;p&gt;Data may be rewritten multiple times during compaction as it moves through levels.&lt;/p&gt;
&lt;p&gt;So one logical write from the application can turn into several physical writes inside the storage engine.&lt;/p&gt;
&lt;h3 id="read-amplification"&gt;Read Amplification&lt;/h3&gt;
&lt;p&gt;A read may need to check multiple files or levels before it finds the newest version of a key.&lt;/p&gt;
&lt;p&gt;Bloom filters reduce this cost a lot for point lookups, but range queries still feel the underlying multi-file structure.&lt;/p&gt;
&lt;h3 id="space-amplification"&gt;Space Amplification&lt;/h3&gt;
&lt;p&gt;Because old versions and tombstones are not removed immediately, extra disk space is temporarily consumed until compaction catches up.&lt;/p&gt;
&lt;p&gt;An LSM-tree is therefore not &amp;ldquo;free writes.&amp;rdquo; It is a system that trades one kind of cost for another in a controlled way.&lt;/p&gt;
&lt;h2 id="6-leveled-vs-tiered-intuition"&gt;6. Leveled vs. Tiered Intuition&lt;/h2&gt;
&lt;p&gt;Two common compaction styles are:&lt;/p&gt;
&lt;h3 id="leveled-compaction"&gt;Leveled Compaction&lt;/h3&gt;
&lt;p&gt;Each level keeps a relatively strict size bound and non-overlapping key ranges.&lt;/p&gt;
&lt;p&gt;This usually gives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;better read performance,&lt;/li&gt;
&lt;li&gt;lower space amplification,&lt;/li&gt;
&lt;li&gt;but higher write amplification because data is rewritten more aggressively.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="tiered-compaction"&gt;Tiered Compaction&lt;/h3&gt;
&lt;p&gt;The system allows multiple overlapping files to accumulate and merges them less eagerly.&lt;/p&gt;
&lt;p&gt;This usually gives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;better write throughput,&lt;/li&gt;
&lt;li&gt;lower write amplification,&lt;/li&gt;
&lt;li&gt;but worse read amplification and sometimes worse space usage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the compaction strategy is really a policy choice about where you want to pay.&lt;/p&gt;
&lt;h2 id="7-the-mental-model"&gt;7. The Mental Model&lt;/h2&gt;
&lt;p&gt;The best mental model is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the MemTable absorbs updates,&lt;/li&gt;
&lt;li&gt;SSTables preserve sorted immutable snapshots,&lt;/li&gt;
&lt;li&gt;Bloom filters keep point lookups from exploding,&lt;/li&gt;
&lt;li&gt;and compaction continuously converts short-term write efficiency into long-term read efficiency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why LSM-trees are so common in write-heavy systems. They do not eliminate cost. They &lt;strong&gt;move and reshape&lt;/strong&gt; it.&lt;/p&gt;
&lt;h2 id="key-takeaway"&gt;Key Takeaway&lt;/h2&gt;
&lt;p&gt;An LSM-tree is a data structure designed to make writes fast by buffering updates in memory, flushing them as immutable sorted files, and restoring order later through compaction. Its power comes from turning random writes into sequential work, but its real trade-offs live in compaction, Bloom-filter-assisted reads, and the balance among write amplification, read amplification, and space amplification.&lt;/p&gt;</description></item></channel></rss>