Apache Cassandra Compaction Strategies
Cassandra’s Write Path
To understand the importance of compactions in Cassandra, you must first understand how Cassandra writes data to disk. The Cassandra write path in a nutshell:
- Cassandra stores recent writes in memory (in a structure called the Memtable).
- When enough writes have been made, Cassandra flushes the Memtable to disk. Data on disk is stored in relatively simple data structures called Sorted String Tables (SSTable). At the most simplified level, an SSTable could be described as a sorted array of strings.
- Before writing a new SSTable, Cassandra merges and pre-sorts the data in the Memtable according by Primary Key. In Cassandra a Primary Key consists of a Partition Key (unique key that determines which node the data is stored on) and any Clustering Keys that have been defined.
- The SSTable is written to disk as a single contiguous write operation. SStables are immutable. Once they are written to disk they are not modified. Any updates to data, or deletion of data within an SSTable is written to a new SSTable. If data is updated regularly, Cassandra may need to read from multiple SSTables to retrieve a single row.
- Compaction operations occur periodically to re-write and combine SSTables. This is required becuase SSTables are immutable (no modifications once written to disk). Compactions prune deleted data and merge disparate row data into new SSTables in order to reclaim disk space and keep read operations optimised.
If you are unfamiliar with Cassandra’s write path, please read The write path to compaction from Datastax.
Cassandra Compaction Strategies
Multiple Compaction Strategies are included with Cassandra, and each is optimized for a different use case:
Type | Description | When? |
---|---|---|
SizeTiered Compaction Strategy (STCS) | This is the default compaction strategy. This compaction strategy triggers a compaction when multiple SSTables of a similar size are present. Additional of parameters allow STCS to be tuned to increase or decrease the number of compactions it performs and how tombstones are handled. | This compaction strategy is good for insert-heavy and general workloads. |
Leveled Compaction Strategy (LCS) | This strategy groups SSTables into levels, each of which has a fixed size limit which is 10 times larger than the previous level. SSTables are of a fixed, relatively small size (160MB by default) – so if Level 1 might contain 10 SSTables at most, then Level 2 will contain 100 SSTables at most. SSTables are guaranteed to be non-overlapping within each level – if any data overlaps when a table is promoted to the next level, overlapping tables are re-compacted.
For example: when Level 1 is filled, any new SSTables being added to that level are compacted together with any existing tables that contain overlapping data. If these compactions result in Level 1now containing too many tables, the additional table(s) overflow to Level 2.
| This compaction strategy is the best for read-heavy workloads (because tables within a level are non-overlapping, LCS guarantees that 90% of all reads can be satisfied from a single SSTable) or workloads where there are more updates than there are inserts. |
DateTiered Compaction Strategy (DTCS) | This compaction strategy is designed for use with time-series data. DTCS stores data written within a the same time period in the same SSTable. Multiple SSTables that are themselves written in the same time window will be compacted together, up until a certain point, after which the SSTables are no longer compacted. SSTables are also configured with a TTL. SSTables that are older than the TTL will be dropped, incurring zero compaction overhead. | DTCS is highly performant and efficient, but only if the workload matches the strict requirements of DTCS. DTCS is not designed to be used with workloads where there are updates to old data or inserts that are out of order. If your workload does not fit these requirements, you may be better off using STCS and using a bucketing key (such as hour/day/week) to break up your data. |
Configuring a Compaction Strategy
Compaction options are configured at the table level via CQLSH. This allows each table to be optimised based on how it will be used. If a compaction strategy is not specified, SizeTieredCompactionStrategy will be used.
No comments:
Post a Comment