Data Structure Notes: LSM-Tree

Thu, 09 Apr 2026 00:00:00 +0000

An LSM-tree (Log-Structured Merge-Tree) is a write-optimized data structure widely used in modern storage engines such as RocksDB, LevelDB, Bigtable-style systems, and many NoSQL databases.

The core idea is simple: instead of updating data in place like a B-tree, an LSM-tree turns random writes into a sequence of append, flush, and merge operations. That design is much friendlier to disks and SSDs when the workload is write-heavy.

1. Why LSM-Trees Exist

Traditional update-in-place structures are good at keeping data immediately organized, but random writes are expensive. Every insert may require touching existing pages, rewriting internal nodes, and causing scattered I/O.

LSM-trees take the opposite approach:

accept writes quickly in memory,
write sorted immutable files to disk in batches,
and reorganize them later through background compaction.

So the design goal is not “keep the data perfectly ordered at every moment.” It is “make writes cheap first, then clean up efficiently in bulk.”

2. The Basic Write Path

The write path usually looks like this:

MemTable

Incoming writes first go to an in-memory structure, often called a MemTable. This is typically a sorted map or skip list.

Updating memory is fast, so the system can absorb writes with very low latency.

WAL

Because memory is volatile, the system also appends each update to a Write-Ahead Log (WAL) for durability.

So before data reaches long-term storage, it usually exists in two places:

in the WAL for crash recovery,
and in the MemTable for fast access.

Flush to SSTable

When the MemTable becomes full, it is frozen and flushed to disk as a sorted immutable file, usually called an SSTable or Sorted String Table.

This is an important moment: instead of many tiny random writes, the engine emits one large sequential write.

That batching effect is one major reason LSM-trees are so write-efficient.

3. The Basic Read Path

Reads are more complicated than writes, because the newest value may live in several places:

the current MemTable,
one or more immutable MemTables waiting to flush,
or multiple SSTables on disk across different levels.

A point lookup therefore checks the newest structures first and may search older ones only if needed.

Without extra help, this would be expensive, because many SSTables might need to be consulted.

That is why LSM engines typically use Bloom filters:

a Bloom filter can quickly say “this key is definitely not in this SSTable,”
so the engine avoids many unnecessary disk reads,
and point lookup latency stays practical even when there are many files.

4. Why Compaction Is the Real Heart of the Design

If the system only kept flushing SSTables forever, reads would eventually become too expensive. The engine would have to search too many files, and outdated versions of keys would accumulate.

So LSM-trees run a background process called compaction.

Compaction does three things:

merges sorted files,
discards obsolete versions and tombstones when possible,
and reshapes data into a layout that is cheaper to read later.

This is the hidden cost of the LSM-tree. Writes look cheap on the foreground path because the expensive cleanup work is deferred into the background.

So compaction is not an implementation detail. It is the mechanism that makes the whole structure sustainable.

5. The Three Amplifications

The cleanest way to understand LSM trade-offs is through the classic trio:

Write Amplification

Data may be rewritten multiple times during compaction as it moves through levels.

So one logical write from the application can turn into several physical writes inside the storage engine.

Read Amplification

A read may need to check multiple files or levels before it finds the newest version of a key.

Bloom filters reduce this cost a lot for point lookups, but range queries still feel the underlying multi-file structure.

Space Amplification

Because old versions and tombstones are not removed immediately, extra disk space is temporarily consumed until compaction catches up.

An LSM-tree is therefore not “free writes.” It is a system that trades one kind of cost for another in a controlled way.

6. Leveled vs. Tiered Intuition

Two common compaction styles are:

Leveled Compaction

Each level keeps a relatively strict size bound and non-overlapping key ranges.

This usually gives:

better read performance,
lower space amplification,
but higher write amplification because data is rewritten more aggressively.

Tiered Compaction

The system allows multiple overlapping files to accumulate and merges them less eagerly.

This usually gives:

better write throughput,
lower write amplification,
but worse read amplification and sometimes worse space usage.

So the compaction strategy is really a policy choice about where you want to pay.

7. The Mental Model

The best mental model is:

the MemTable absorbs updates,
SSTables preserve sorted immutable snapshots,
Bloom filters keep point lookups from exploding,
and compaction continuously converts short-term write efficiency into long-term read efficiency.

That is why LSM-trees are so common in write-heavy systems. They do not eliminate cost. They move and reshape it.

Key Takeaway

An LSM-tree is a data structure designed to make writes fast by buffering updates in memory, flushing them as immutable sorted files, and restoring order later through compaction. Its power comes from turning random writes into sequential work, but its real trade-offs live in compaction, Bloom-filter-assisted reads, and the balance among write amplification, read amplification, and space amplification.

Data Structure | Jiangneng's Homepage

Data Structure Notes: LSM-Tree

1. Why LSM-Trees Exist

2. The Basic Write Path

MemTable

WAL

Flush to SSTable

3. The Basic Read Path

4. Why Compaction Is the Real Heart of the Design

5. The Three Amplifications

Write Amplification

Read Amplification

Space Amplification

6. Leveled vs. Tiered Intuition

Leveled Compaction

Tiered Compaction

7. The Mental Model

Key Takeaway