Understanding Log-Structured Merge Trees

What is LSM-Tree?

LSM (Log-Structured Merge) is a high-performance data structure widely used in modern databases and storage engines. It's designed to optimize write-heavy workloads by converting random writes into sequential writes, making it ideal for systems that prioritize write throughput. Popular databases like Apache Cassandra, RocksDB, LevelDB, and ScyllaDB are built on LSM-tree architecture.

Architecture and Components

The LSM-tree architecture consists of four main components:

MemTable: An in-memory data structure (typically a balanced tree like Red-Black tree or Skip list) that stores the most recent write operations. It keeps key-value pairs sorted by key for efficient lookups and range queries. Once the MemTable reaches a certain size threshold, it becomes immutable and a new MemTable is created for incoming writes.

ImmutableMemTable: A read-only version of the MemTable that is waiting to be flushed to disk. When the active MemTable reaches its capacity, it is converted to an ImmutableMemTable. This allows new writes to continue in a fresh MemTable while the ImmutableMemTable is being written to disk in the background, ensuring no write operations are blocked.

SSTable (Sorted String Table): Persistent, immutable files stored on disk that contain sorted key-value pairs. SSTables are organized in multiple levels (L0, L1, L2, ..., LN), where each level contains multiple SSTable files. Lower levels contain more recent data, while higher levels store older, compacted data. SSTables are never modified in place; instead, they are merged and rewritten during compaction.

WAL (Write-Ahead Log): A sequential append-only log file that records all write operations before they are applied to the MemTable. This provides durability and crash recovery. If the system crashes before the MemTable is flushed to disk, the WAL can be replayed to restore the lost in-memory data. Once data is safely persisted to SSTables, the corresponding WAL entries can be deleted.

How LSM-Tree Works

Write Path

Log Struture Merge Write flow - nanliangwu.com

Write to WAL: Every write operation is first appended to the Write-Ahead Log for durability. This ensures data persistence even if the system crashes before the in-memory data is flushed to disk.
Insert into MemTable: The data is then inserted into the active MemTable, which maintains keys in sorted order. When the MemTable reaches its size threshold, it is converted to an ImmutableMemTable, and a new empty MemTable is created to handle subsequent writes.
Flush to Disk: A background thread periodically flushes ImmutableMemTables to disk as new SSTable files in the L0 level. Once successfully written, the corresponding WAL entries can be safely deleted.
Compaction: SSTables are organized in multiple levels (L0, L1, L2, ...). When a level exceeds its size limit, a compaction process merges SSTables from that level into the next level. During compaction, duplicate keys are resolved (keeping the most recent version), and deleted entries are removed, reducing storage space and improving read performance.

Read Path

Log Structure Merge Read flow - nanliangwu.com

Reading data from an LSM-tree requires checking multiple locations since data may exist in memory or across different levels on disk. The read path is designed to check the most recent data first:

Check MemTable: The read operation first searches the active MemTable in memory. Since it contains the most recent writes, if the key exists here, it's returned immediately with the lowest latency.
Check ImmutableMemTable: If the key is not found in the MemTable, the search continues to any ImmutableMemTables that are waiting to be flushed to disk. These contain recent data that hasn't been persisted yet.
Search SSTables: If the key isn't in memory, the search moves to disk, checking SSTables level by level (L0 → L1 → L2 → ... → LN). Since lower levels contain more recent data, they are searched first. Within each level, multiple SSTable files may need to be checked.
Bloom Filter Optimization: To avoid expensive disk I/O for keys that don't exist, each SSTable has an associated Bloom filter—a probabilistic data structure that can quickly determine if a key definitely doesn't exist in the SSTable. This significantly reduces unnecessary disk reads.
Key Merging: If the same key exists in multiple locations (e.g., updates or deletes), the most recent version (from the lowest level) takes precedence. Tombstone markers indicate deleted keys.

This read amplification (checking multiple locations) is the trade-off for LSM's excellent write performance. Various optimizations like Bloom filters, compression, and efficient indexing help minimize the read latency impact.

Trade-offs

Advantages

High Write Throughput: LSM-trees convert random writes into sequential writes, which are much faster on modern storage devices (especially SSDs). This makes LSM ideal for write-intensive applications where data is constantly being inserted or updated.

Efficient Storage Utilization: Through compaction, LSM-trees eliminate duplicate entries and deleted data, resulting in better space efficiency. Compression algorithms can be applied during compaction, further reducing storage costs.

Lower Write Amplification: LSM-trees minimize write amplification by batching writes in memory and flushing them sequentially. This extends the lifespan of SSDs by reducing wear caused by frequent random writes.

Predictable Performance: Write operations are consistently fast since they only involve appending to the WAL and updating in-memory structures. There's no need to navigate complex tree structures or perform random disk I/O during writes.

Disadvantages

Read Amplification: Reading a single key may require checking multiple SSTables across different levels, leading to higher read latency. Even with optimizations like Bloom filters, reads can be slower, especially for non-existent keys.

Compaction Overhead: Background compaction processes consume CPU and I/O resources. During heavy compaction periods, system performance may degrade, and there can be unpredictable latency spikes.

Space Amplification: Before compaction completes, LSM-trees temporarily store multiple versions of the same key across different levels, leading to higher disk space usage compared to in-place update structures.

Range Query Complexity: While LSM-trees support range queries, data scattered across multiple SSTables and levels may require more disk seeks and merging operations to return sorted results.

Use Cases

Ideal Scenarios

Write-Heavy Workloads: Applications with high write throughput requirements, such as time-series databases, logging systems, and event streaming platforms. Examples include metrics collection systems and IoT data ingestion.

Append-Only or Insert-Heavy Scenarios: Systems that primarily insert new data with few updates, like message queues, audit logs, and blockchain databases.

Large-Scale Distributed Systems: LSM-trees work well in distributed environments where data is partitioned across multiple nodes. Their append-only nature simplifies replication and consistency protocols.

SSD-Optimized Storage: LSM's sequential write pattern is ideal for SSDs, maximizing throughput while minimizing wear. This is why many modern cloud-native databases adopt LSM architecture.

Real-World Examples

Cassandra & ScyllaDB: Distributed databases optimized for write-heavy, time-series data
RocksDB & LevelDB: Embedded key-value stores used in blockchain nodes and storage engines
HBase & BigTable: Wide-column stores for massive-scale analytical workloads
ClickHouse (some tables): OLAP database with LSM-based merge tree engines for real-time analytics