Thursday 15 February 2018

Cassandra Study Links ..

Cassandra Study Links:
======================

https://pandaforme.gitbooks.io/introduction-to-cassandra/content/cap_theorem.html

https://www.edureka.co/blog/introduction-to-cassandra-architecture/

https://teddyma.gitbooks.io/learncassandra/content/model/internal_data_structure.html

https://www.slideshare.net/narsiman/cassandra-internals-43964918

http://file.allitebooks.com/20150811/Mastering%20Apache%20Cassandra.pdf

https://github.com/sjtuhjh/appdocs/blob/master/BooksAndPapers/Cassandra/Cassandra%20High%20Performance%20Cookbook.pdf

https://www.tutorialspoint.com/cassandra/index.htm

http://cassandra.apache.org/doc/latest/getting_started/index.html

https://teddyma.gitbooks.io/learncassandra/content/index.html

https://www.youtube.com/results?search_query=cassandra+tutorial

http://file.allitebooks.com/20171211/Expert%20Apache%20Cassandra%20Administration.pdf

https://www.slideshare.net/nmilford/cassandra-for-sysadmins

https://docs.genesys.com/Documentation/OS/8.1.4/Cassandra/Welcome?action=pdfbook&title=Documentation:OS:Cassandra:Welcome:8.1.4

http://www.tothenew.com/blog/cassandra-authentication-and-create-user/

https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/

Node Down For Me?

For a read or write request to start in Cassandra at least as many nodes must be seen as UP by the coordinator node as the request has specified via the ConsistencyLevel. Otherwise the client will get anUnavailableException and the cluster will appear down for that request. That may not necessarily mean it is down for all keys or all requests.

The Replication Factor, number of nodes, the Consistency Level and luck all play a part of determining how many nodes can be lost in a Cassandra cluster before it is unavailable for 100% of the keys. Before it reaches that point though the cluster may go through a period of partial failure where some keys will not be available at some CL levels.

The partial failure support baked into the system is nice thing to have. But most people will be interested in keeping 100% of the keys available at the required Consistency Level. So most of discussion below talks about keeping the cluster up for 100% of the keys at theQUORUM CL.

Which nodes ?

When it comes to counting the UP nodes for a request we only consider the Natural Endpoints for a key. These are the nodes identified by the placement_strategy (set when the Keyspace was created) as the replicas for a key, and they never change. All read and write operations for a key, using the same partitioner, will select those same endpoints. Otherwise write operations could plonk down data that reads could never find.

The row key is first Decorated by the partitioner (specified inconf/cassandra.yaml) to create the token used to locate the row in the cluster. For example the RandomPartitioner uses an MD5 transform to turn the key into a 128bit token.

Is using the SimpleStrategy it will:

Order the nodes in the ring by their initial_token.
Select the node whose token range includes the token as the first replica.
Select the next RF-1 nodes in the ordered ring as the remaining replicas.

The nodes loop around, so a row may be replicated on the last 2 nodes and the first 1.

The NetworkTopologyStrategy use a more involved algorithm that considers the Data Centre and Rack the node is assigned to by the Snitch. There is a good discussion of the process from Peter Schuller.

The weird Consistency Level

There is one Consistency Level that does not behave like the others, so lets just get it out of the way first. CL ANY for write requests allows the coordinator node to store the mutation in the form of a Hinted Handoffon any node in the cluster, which in practice means on the coordinator itself.

This is useful in cases where extreme write uptime is needed. The sort of extreme where the write cannot be reliably read until a nodetool repair operation has been completed. Hinted Handoffs must be delivered to Natural Endpoints before they can be included in a read operation.

If you write at CL ANY and some of the Natural Endpoints are up, the write and the Hints will be sent to them. The coordinator will only be used to store the Hints in cases where all the Natural Endpoints are down.

For more information on Hinted Handoff see Jonathan’s recent post.

Consistency Levels

For all the other Consistency Levels the read or write request is directed to one of the Natural Endpoints. Hinted Handoffs may be used as part of the request but are do not considered when determining if the cluster is available for the request.

The named Consistency Levels ONE, TWO and THREE are pretty easy to understand. Once, two or three replicas for the key must be seen as UPbefore the operation will start. CL ONE is the most often used of these.

QUORUM is calculated as floor(RF \ 2) + 1. This is most used CL level and in my opinion should be the starting CL for all applications until a reason is found to change (performance is not a reason).

For RF levels below 3 the QUORUM is the same as the RF level so:

RF 1 - QUORUM = 1
RF 2 - QUORUM = 2
RF 3 - QUORUM = 2
RF 4 - QUORUM = 3
RF 5 - QUORUM = 3
RF 6 - QUORUM = 4
RF 7 - QUORUM = 4

When the NetworkTopologyStrategy is used each data centre has it’s own RF, and the standard QUORUM is calculated using the total RF for the cluster.

LOCAL_QUORUM and EACH_QUORUM can be used with theNetworkTopologyStrategy and they instruct the coordinator to also consider the Data Centre the nodes are located in. A write is always sent to all UP replicas, this

For LOCAL_QUORUM only the RF of the local data centre is considered when:

Calculating how many nodes to block for.
Checking if enough nodes are UP for the request.
Counting if CL nodes have responded to the request.

EACH_QUORUM works in a similar way but the tests apply to every DC in the cluster.

ALL requires that all replicas for the row be UP before the request will start.

A failing Range

A simple, but incomplete, way to think about the cluster been available is to focus a one key range and it’s replicas.

Consider RF 3, at QUORUM if one node is lost the range will still be available. If two nodes are lost the range will not be available for QUORUM operations, but will still be available for ONE and ANY requests.

For any number of nodes in the cluster, a range will become unavailable it more than (RF - CL) nodes are DOWN.

Now consider a cluster with 50 nodes, RF 5, QUORUM operations and the SimpleStrategy. If 3 adjacent nodes nodes go down the range assigned to the first one will no longer be available, as their will only be 2 replicas UP. The cluster will be down for 2% of the possible keys, not terrible but it’s no longer up for 100%. The range assigned to the second down node will have 3 UP replicas and the range assigned to the third will have 4 UP replicas. The nodes do not have to be adjacent in the ring for this occur, it could be any nodes in the replica set for the range. It’s just easier to think about when they are adjacent.

Spreading replicas for a key across nodes with different physical infrastructure is a good way to mitigate this risk. TheNetworkTopologyStrategy distributes the replicas for a DC across the available Racks. As defined by either the RackInferringSnitch,PropertyFileSnitch or the EC2Snitch which uses AWS Availability Zones as racks. The SimpleSnitch puts all nodes into rack1 indatacenter1.

In the worst case failure scenario the cluster can sustain up to (RF - CL) failures and still remain available for 100% of the keys.

A failing Cluster

The best case scenario for failure is when the node failures are evenly divided amongst the replicas for a range. So that every RF number of failures only removes one node from the available replica set for each range. To know how many nodes we can lose for a Consistency Level to still be available for 100% of the keys multiply by RF-CL.

For a 5 node cluster with RF 3 at QUORUM this is (5 / 3) * (3 - 2) or 1. For other cluster sizes the number is:

def max_failure(num_nodes, rf, block_for):
    print "For up to %s nodes with RF %s and blocking for %s nodes..." % (
        num_nodes, rf, block_for)
    print "Number Nodes / max_failure";
    for n in range(1, num_nodes + 1):
         print "%s / %s" % (n, ( int(n/rf) * (rf - block_for)))

max_failure(10, 3, 2)

For up to 10 nodes with RF 3 and blocking for 2 nodes...
Number Nodes / max_failure
1 / 0
2 / 0
3 / 1
4 / 1
5 / 1
6 / 2
7 / 2
8 / 2
9 / 3
10 / 3

In the best case failure scenario the cluster can sustain up to floor(number of nodes / RF) * (RF - CL) failed nodes and still remain up for 100% of the keys.

The view from one node

Considering which nodes are UP or DOWN is always from the perspective of the coordinator node. Network Partitions also play a part in the deciding if the cluster is available, as the nodes must be both running and contactable by the coordinator.

At the small scale if a client connects to a node in the cluster that has lost connectivity to other nodes in the cluster it will consider them allDOWN and be unavailable for all QUORUM requests. The client will receive a UnavailableException and should connect to another node and try request. Other nodes in the cluster may be in a bigger partition that contains enough UP replicas for the request to complete.

At a bigger scale when using Amazon AWS it’s more likely that nodes an Availability Zone will lose connectivity with nodes from a different AZ then from nodes in the same AZ.

With two AZ’s operations at QUORUM will require nodes in both AZ’s as neither will hold QUORUM replicas. So a network partition between the two would result in 100% of the keys been down in both AZ’s.

With three AZ’s each AZ will hold one third of the replicas, and any two together may provide enough UP replicas to support QUORUMoperations. The cluster could sustain a network partition so long as each AZ can talk to at least one other AZ.

===============================================================

How are Memtables measured?

Updated: I’ve added information on the newmemtable_total_space_in_mb setting in version 0.8 and improved the information about memtable_throughput. Thanks for the feedback.

In version 0.7 of Cassandra there are 3 configuration settings that can trigger flushing a memtable to disk. Version 0.8 adds support for a global memtable_total_space_in_mb which may replace the previous 3 settings.

First the 0.7 settings.

memtable_flush_after (minutes)

This is the maximum number of minutes a memtable should stay in memory for if it has received writes. When the memtable is created the current time is recorded which is then checked every 10 seconds. If after the time span either the primary memtable for the Column Family or any secondary index memtables have received writes they are replaced and flushed to disk.

Unfortunately as the wiki points out, there is a good reason to make this value small and a good reason to make it large.

A log file cannot be deleted until all of the segments / records it contains have been marked as completed. This happens when a memtable is flushed to disk. Because the log file is shared by all Column Familes, one Column Family that has long living memtables can prevent log files from been deleted and their disk space freed up.

However smaller values can cause multiple memtables to expire at the same time, prior to version 0.7 this could cause flush requests to block. Version 0.7 added memtable_flush_writers andmemtable_flush_queue_size ,but it can still slow down the IO system (see conf/cassandra.yaml for more info). The best approach is to tune the other memtable thresholds to trigger when you want them, and leave this setting as a backup.

The wiki recommends setting a default of 1440 minutes, or 24 hours which is also the default.

memtable_operations (millions)

The Memtable tracks the number of operations applied to it by:

counting the number of top level columns (i.e. Columns for a Standard CF or Super Columns for a Super CF) in a mutation.
considering a row level deletion as a single operation.

Deletions and Insertions are both considered a mutation, inserting 3 columns increases the count by 3 and deleting 3 named columns increases the count by 3. Note that the number of sub columns in a Super Column is ignored. So inserting 10 sub columns into 1 super column increases the count by one, and deleting a super column by name that has 10 sub column increases the count by one.

The operations threshold (and the size threshold) is checked before applying a mutation and the flush is not requested until after the mutation has completed. So it’s possible for the memtable to contain more than memtable_operations when it is flushed to disk.

A different way to think about this setting, and memtable_throughput, is as sstable_min_operations and sstable_min_bytes. In general operation new sstables are created after at leastsstable_min_operations operations have occurred or at mostsstable_min_bytes bytes will be written.

If no value is provided when the CF is created it is set to the default memtable throughput in MB (below) / 64 * 0.3, so it’s 300k ops per 64MB of throughput. If you have a CF that contains many small columns it’s a good idea to look at the log entries for memtable flushes to see if the ops threshold is triggering early and causing small memtables to be frequently written.

memtable_throughput

Throughput for the memtable is tracked and tested at the same time as the operation count. But counting the byte size of the data is more involved and depends on the type of the column. The size of the data when serialised to disk is counted as follows:

standard column byte size is
- length of the key byte array plus 2 bytes to store the length
- 1 byte to indicate if the column has been deleted
- 8 bytes for the timestamp
- length of the value byte array plus 4 bytes to store the length
expiring columns (those with a TTL) add another 8 bytes to the length of a standard column.
deleted columns (tombstones) are the same as standard columns but the value is always 4 bytes long.
counter columns (in Cassandra v0.8) add another 8 bytes to the length of a standard column. Note that for a counter column the value will always be an 8 byte long.
super columns sum the size of all contained columns and then add
- length of the name byte array plus 2 bytes to store the length
- 4 bytes to indicate when it was deleted
- 8 bytes to store the timestamp for the deletion
- 4 bytes to store the number of sub columns

(Currently the calculation for the super column only includes the sum of the sub columns. I think this needs to be changed.)

A row deletion will add zero bytes to the throughput counter.

The byte size of the mutation is always added to the counter, if one mutation replaces columns in the memtable their byte size is notsubtracted from the counter.

Getting this setting wrong is a very easy way to run out of memory. From version 0.7 onwards the worse case scenario is up to CF Count + Secondary Index Count + memtable_flush_queue_size (defaults to 4) +memtable_flush_writers (defaults to 1 per data directory) memtables in memory the JVM at once. It’s best to be conservative, follow the wiki advice and consider that the JVM may take up to 10 times as much memory as it takes to serialise the data to disk.

And that’s the problem with this threshold. It’s not measuring how much memory a memtable is using in the JVM Heap, it’s measuring the maximum amount of bytes it could take to serialise the data (excluding the index and bloom filter) to disk. Which makes it a difficult knob to use when tuning how much memory Cassandra uses.

If a value is not provided when the Column Family is created it will default to 1/16th the maximum size of the JVM Heap at the time. This value stored with the Column Family meta data and will not change again. Typical values are around 128MB to 256MB.

memtable_total_space_in_mb

Version 0.8 adds the per node memtable_total_space_in_mb setting which makes life easier and may eventually replace the 3 previous settings. While it’s fun to play with the per CF settings, it can also be a pain when building real systems that need to stay up.

If no value is set in conf/cassandra.yaml the setting will default to one third of the JVM max Heap size. If it is set to zero the setting is disabled and only the old per CF thresholds will be used. If the global setting is enabled and there are per CF settings both of them will be used.

There are two parts to the global memtable size, measuring the real memory usage of the memtable and flushing. First the measuring.

Rather than track every byte allocated the server periodically works out the ratio between the throughput as measured above and the real in memory bytes as measured by JVM. The in memory byte count is worked out using the Instrumentation Java Package and code fromJonathan Ellis. After a mutation has been applied to the memtable, but before a flush is requested, Cassandra calculates the “Live Ratio” if more than twice as many operations (as calculated above) have been processed since the last time it was calculated.

Measuring the Live Ratio is done asynchronously and involves measuring the real memory size of all the keys, super columns and columns in the memtable and dividing it by the throughput as measured above. For sanity the ratio is clamped between 1.0 and 64.0, if the value is outside of this range a WARN level log message will let you know. Finally the ratio for the Column Family is updated to the new ratio if and only if the new ratio is higher than the previous one. AnINFO level message will let you know when the ratio is calculated, how long it took and if it changed.

Next the MeteredFlusher runs every second and uses a two phase approach to keeping the live memory use under the setting. First it looks at the total live bytes for each Column Family, including it’s secondary indexes, and flushes CF’s that could potentially fill the memory if allowed to create memtables of this size. Live bytes are calculated by multiplying the throughput as perviously measured by the Live Ratio. The Flusher considers the Column Family to be using too much memory if it’s current live size is more thanmemtable_total_space_in_mb divided by the maximum number of memtables the Column Family could have in memory. The calculation for this is similar to the one presented above for memtable_throughputbut it includes secondary indexes and a fudge factor that takes into account how the live size is measured.

For example if memtable_total_space_in_mb is 100MB, andmemtable_flush_writers is the default 1 (with one data directory), andmemtable_flush_queue_size is the default 4, and a Column Family has no secondary indexes. The CF will not be allowed to get above one seventh of 100MB or 14MB, as if the CF filled the flush pipeline with 7 memtables of this size it would take 98MB. At a more sensible 2GB formemtable_total_space_in_mb (1/3 of a 6GB JVM Heap) the CF will be flushed if it is using 292MB of live memory.

(I’ve skipped a couple of things here such as considering the bytes currently been flushed.)

The flusher process will end there if the number of bytes that were flushing when it started plus the bytes for all the CF’s that were not flushed in the first phase is less than memtable_total_space_in_mb.

The second phase flushes the CF’s in order of largest to smallest until the total live size (including the bytes currently been flushed) gets down below the target setting.

This new setting (and the existing flush_largest_memtables_at) should make it harder to shot yourself in the foot with memory management and easier for new users to feel comfortable with the server.

In Motion

You can check the per CF thresholds as well as the current tracked values for a memtable using bin/nodetool, bin/cassandra-cli or JConsole. I’m not aware of any current features to check the Live Ratio or Live Size of a CF.

bin/nodetool cfstats can tell you the current operation count (‘Memtable Columns Count’) and throughput (‘Memtable Data Size’):

$ ./bin/nodetool -h localhost cfstats
Keyspace: dev
    Read Count: 1
    Read Latency: 0.897 ms.
    Write Count: 2
    Write Latency: 0.051 ms.
    Pending Tasks: 0
        Column Family: data
        SSTable count: 2
        Space used (live): 9530
        Space used (total): 9530
        Memtable Columns Count: 1
        Memtable Data Size: 26
        Memtable Switch Count: 1
        Read Count: 1
        Read Latency: 0.897 ms.
        Write Count: 2
        Write Latency: 0.020 ms.
        Pending Tasks: 0
        Key cache capacity: 200000
        Key cache size: 2
        Key cache hit rate: 0.0
        Row cache: disabled
        Compacted row minimum size: 51
        Compacted row maximum size: 86
        Compacted row mean size: 73

‘bin/cassandra-cli’ can tell you the current thresholds using eitherdescribe keyspace or show keyspaces.

[default@dev] describe keyspace;
Keyspace: dev:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
    Options: [datacenter1:2]
  Column Families:
    ColumnFamily: data
      Key Validation Class: org.apache.cassandra.db.marshal.BytesType
      Default column value validator: org.apache.cassandra.db.marshal.BytesType
      Columns sorted by: org.apache.cassandra.db.marshal.AsciiType
      Row cache size / save period in seconds: 0.0/0
      Key cache size / save period in seconds: 200000.0/14400
      Memtable thresholds: 0.29062499999999997/62/1440 (millions of ops/MB/minutes)

Or using JConsole connect to the server, select MBeans and then navigate to org.apacge.cassandra.db.ColumnFamilies.<your-keyspace>.<your-column-family>. There you can find the current thresholds:

MemtableFlushAfterMins
MemtableOperationsInMillions
MemtableThroughputInMB

And the running values:

MemtableColumnCount
MemtableDataSize

=======================================================================

The forces of Write and Read

The forces of Write and Read ?

Requests that write and read data in Cassandra, like any data base, have competing characteristics that need to be balanced. This post compares the approach taken by Cassandra to traditional Relation Database Systems.

When writing data the most efficient approach to splat it somewhere on disk with the minimum of fuss. Preferably at the end of a file, or even in a new file. The write is going to take longer if the data is carefully placed in the correct ordered position in an existing file.

Problem is reading data is going to be much more efficient if the data is in the correct ordered location, and preferably in just one location. If the data is unordered and spread out in a file, or multiple files, the read is going to take longer.

Traditional Relational databases such as SQL Server are optimised for read requests, so write requests ensure that data is written to the correct ordered location. However the transaction log / write ahead loglets the database delay writing the data to the correct location on disk, providing a handy performance boost while still supporting ACIDtransactions.

Roughly speaking Write requests

(very) Roughly speaking a write request in a RDBMS follows this path:

Append the write request to the end of the transaction log.
Locate the rows modified by the request or those adjacent to new rows, by reading index, data or other pages from disk if not already in memory.
Modify the index and data pages in memory to insert or update rows.
Acknowledge the write to the client.
Checkpoint the modified pages by flushing them to disk in the correct ordered location, and mark the log records associated with the changes as no longer needed. The Checkpoint process is typically executed asynchronously (SQL Server MySQL).

(I think in reality the log record is written once a statement completes inserting/updating an individual row. Also a record needs to be written to say if the transaction was committed or aborted, so that during log file recovery the system knows if the transaction should be replayed orignored. I’m also ignoring all the locking.)

The write ahead log lets the write request dump the data somewhereon disk fast, then work on structures which are hopefully in memory. For updates care is taken to check for existing data, which is then read and mutated in memory. The read request ignores the transaction log, it reads the single source of truth from the correctly ordered index and data pages that are hopefully still in memory.

If the RDBMS uses Multi Version Concurrency Control life is a bit more difficult for the read request, so lets ignore that for now.

On the other hand Cassandra is optimised for write requests, so read requests need to do more work to ensure they have the correct data.

Again roughly speaking a write request for Cassandra follows the path:

Appended the write to the end of the write ahead log.
Modify an in memory structure called a Memtable to store the row changes present in the request.
Queue the Memtable to be flushed to disk (in the background) if per Memtable thresholds on operation count, data size or time to be violated. A new empty Memtable is immediately put in it’s place.
Acknowledge the write to the client.
Flush the Memtable to disk by creating a new SSTable (Sorted String Table) on disk, and mark the log records as no longer needed. Typically the flush task is executed asynchronously.

(This is the write process on a single node, I’m ignoring how the write is applied to the cluster and other features required for ‘Eventual Consistency’.)

Again the write ahead log lets the write process dump the datasomewhere on disk fast, and then work on structures which areguaranteed to be in memory. No effort is taken to read the existing data, if a row was updated it may now partially exist in memory and in multiple SSTables on disk.

It’s probably important to point out that the RDBMS Checkpoint process works with an extent which defines it’s on disk allocation unit. Sql Server uses extents that contain 8 8KB pages. MySQL uses a default EXTENT_SIZE of 1MB for the extent, and the InnoDB engine has a hard coded 16KB page size. Once one extent has been written it may be necessary to seek to anther random part of the disk to write to next extent. Though I’m sure they do all sorts of clever things to make it nice and fast, this is the basic unit they deal with.

In Cassandra the maximum Memtable size is configured per Column Family (sort of like a table but I dislike that analogy). Memory thresholds of 128MB, 256MB or higher are common, though Memtables may flush at lower sizes due violating other thresholds or in response to certain system events. The Flush task is able to write the entire contents of the Memtable to disk as a new file without having to perform random IO seeks.

Dear Readers

In Cassandra read requests potentially have a lot more to do that those in a traditional RDBMS. A typical read request in a RDBMS to get a row by Primary Key seeks through the nicely ordered b-tree index pages to find the right data page, scans the page to find the appropriate row and reads the complete row (ignoring off page large data). Reading the index and data pages may require disk access, and a query cache may provide direct access to the results of a previous query without needing to negotiate the index.

Cassandra has two caches that can expedite a read request, a key cache (discussed below) and a row cache. The row cache contains all the current data for the row, and a query that needs to read any column from the row can be fully resolved by reading the cache.

If the requested row cannot be served from the row cache it must be read from disk, however the write path may have left fragments of the row in various SSTables. The read request follows the path:

Check the in memory Bloom Filter for each SSTable on disk, to build a list of candidate SSTables where the row may exist.
If enabled probe the key cache for each candidate SSTable from step 1 to get the position of the row in the data file. The key cache may hold only a sample of keys from the file so misses are possible.
If the key cache missed, probe the SSTable index summary held in memory which contains a regular sampling of the keys (and their positions) in the SSTable index file to find the preceding (sampled) key. Seek to the position of the preceding sampled key in the SSTable index file, and then scan the index file until the requested key and it’s data file position is found.
Seek to the row position in the SSTable data file and then scan the row data to read the columns that match the read request.
Reduce (potentially) multiple values for each column provided by each SSTable and the current Memtable to a single value by considering column timestamps, Time To Live settings and deletes.
Return the result to the client.

(This is the read path on a single node, I’m ignoring how the read is processed by the cluster, features required for ‘Eventual Consistency’, and have not mentioned that memory mapped file access is used by default on 64 bit platforms.)

The read request must piece together the row from potentially many files, which may each involve random IO. Each SSTable which contains any columns for a row must be read to determine if it contains columns that match the request criteria. Hot rows may be present in present in memory in the form of Cassandra key and row caches or OS caches.

Rows which receive many writes (including overwriting the same column) over time have the potential to perform worse than those that are written to within a short time span. Multiple writes to the same row and column are reconciled in memory by the Memtable so that only the current values for a row are written to disk in the SSTable. TheCompaction Process provides an opposing force to the propensity of write requests to spread data out over multiple files.

With all this going on it’s still possible to get millisecond or better read performance. The import thing to note is that it’s possible to have poorly performing reads due to spread out rows.

Compacting for a sustainable future

The Compaction Process is there to help the read requests by reducing the number of SSTables they may have visit. Compaction is also responsible for finalising Distributed Deletes and Time To Live, but I will ignore those for now.

Major Compactions are triggered manually via the nodetool utility and compact all SSTables for a Column Family into one file. Owing to the way compaction chooses which files to process this may result in the new file not been compacted for a very long time. As a result Major compactions are no longer recommended as Minor compactions can do the same things. I’ve mentioned them here just for completeness.

Cassandra checks to see if a Minor compaction is needed whenever an SSTable is written to disk. The process groups the files into buckets where every file in a bucket is within 50% of the average size of files in the bucket, small files (less than 50MB) are put in the first bucket. If the bucket contains more than than the Column Family defined min_compaction_threshold files the compaction process will compact up to max_compaction_threshold files together.

During compaction the row fragments from the SSTables are reconciled to create a single row. When merging columns the one with the highest timestamp is used, if the timestamps are equal the column value is used as a tie breaker.

The result is a single SSTable that contains an aggregated view of the row that was present in the mulitple input SStables. And a shorter path for any read request that previously needed to potentially read from several SSTables.

In motion

It’s reasonably easy to see this process happening in slow motion by constructing a contrived schema in Cassandra and playing with some of the command line tools.

This schema creates two Column Families. The OneOp CF sets the Memtable to flush to disk after just one operation is written, and FiveOp after 5 operations (memtable_operations is expressed in millions of operations). The min_compaction_threshold tells compaction to start after we see 4 files. The script then inserts some sample data, all against the same row and the same columns to demonstrate overwrites.

Note that the Memtable thresholds are checked before the operation starts, but it is not flushed to disk until after the operation completes. For writes it counts the each columns in the request as an operation. In our case it means the SSTables will be created with after 2 or 6 operations rather than the expected 1 or 5.

Start the cassandra cli and paste the script into the cli.

create keyspace ReadWrite;

use ReadWrite;

create column family OneOp 
  with memtable_operations = 0.000001
  and min_compaction_threshold = 4;
create column family FiveOp 
  with memtable_operations = 0.000005
  and min_compaction_threshold = 4;
  
set OneOp['foo1']['bar'] = 'baz1';
set OneOp['foo1']['bar'] = 'baz2';
set OneOp['foo1']['bar'] = 'baz3';
set OneOp['foo1']['bar'] = 'baz3';
set OneOp['foo1']['bar'] = 'baz1';
set OneOp['foo1']['bar'] = 'baz1';


set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';
set FiveOp['foo1']['bar'] = 'baz';

The data directory for the ReadWrite keyspace should now look like.

/var/lib/cassandra/data/ReadWrite:
total 200
drwxr-xr-x  22 aaron  wheel   748B 27 Apr 10:42 .
drwxr-xr-x   4 aaron  wheel   136B 27 Apr 10:42 ..
-rw-r--r--   1 aaron  wheel    75B 27 Apr 10:42 FiveOp-f-1-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 FiveOp-f-1-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 FiveOp-f-1-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 FiveOp-f-1-Statistics.db
-rw-r--r--   1 aaron  wheel    75B 27 Apr 10:42 FiveOp-f-2-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 FiveOp-f-2-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 FiveOp-f-2-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 FiveOp-f-2-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-1-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-1-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-1-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-1-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-2-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-2-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-2-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-2-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-3-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-3-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-3-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-3-Statistics.db

There are 3 SSTables for OneOp and 2 for FiveOp, and all of them contain the a fragment of the ‘foo1’ row.

The nodetool cfhistograms utility shows recent statistics for requests against a Column Family, where recent means “since the last time it was run”. The same statistics are also available via the org.apache.cassandra.db.ColumnFamilies.<keyspace>.<column_family> MBean in JConsole.

Clear the stats for both Column Families on the command line.

$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
$ bin/nodetool -h localhost cfhistograms ReadWrite FiveOp

Now execute two reads via the cli.

[default@ReadWrite] get OneOp['foo1'];
=> (column=626172, value=62617a31, timestamp=1303875486224000)
Returned 1 results.
[default@ReadWrite] get FiveOp['foo1'];
=> (column=626172, value=62617a, timestamp=1303875486248000)
Returned 1 results.
[default@ReadWrite] 

Check the stats for the OneOp Column Family.

$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
ReadWrite/OneOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  0                 0                 0                 0                 3
2                  0                 0                 0                 0                 0
3                  1                 0                 0                 0                 0
4                  0                 0                 0                 0                 0

The histogram is saying that 1 request used 3 SSTables, and there are an estimated 3 columns in all the SSTables the Column Family is tracking. The Column Family should be tracking 3 SSTables, and each should have only one column for the ‘foo1’ row. The insert script overwrote the same column and the Memtable absorbed one overwrite before been flushed to disk.

The stats for the FiveOp Column Family also make sense.

$ bin/nodetool -h localhost cfhistograms ReadWrite FiveOp
ReadWrite/FiveOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  0                 0                 0                 0                 2
2                  1                 0                 0                 0                 0

There are only 2 SSTables for the Column Family, both contain a fragment of the ‘foo1’ row and both contain only one column.

A minor compaction on the OneOp Column Family can be triggered by adding just two more wafer thin rows via the cli.

[default@ReadWrite] set OneOp['foo1']['bar'] = 'baz1';
Value inserted.
[default@ReadWrite] set OneOp['foo1']['bar'] = 'baz2';
Value inserted.
[default@ReadWrite]

This should trigger a minor compaction as there is now 4 SSTables of similar size. The data directory should now look something like this.

$ ls -l /var/lib/cassandra/data/ReadWrite/
total 280
-rw-r--r--  1 aaron  wheel    75 27 Apr 15:38 FiveOp-f-1-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 FiveOp-f-1-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 FiveOp-f-1-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 FiveOp-f-1-Statistics.db
-rw-r--r--  1 aaron  wheel    75 27 Apr 15:38 FiveOp-f-2-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 FiveOp-f-2-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 FiveOp-f-2-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 FiveOp-f-2-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-1-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-1-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-1-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-1-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-1-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-2-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-2-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-2-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-2-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-2-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-3-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-3-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-3-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-3-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-3-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-4-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 16:16 OneOp-f-4-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 16:16 OneOp-f-4-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 16:16 OneOp-f-4-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 16:16 OneOp-f-4-Statistics.db
-rw-r--r--  1 aaron  wheel    76 27 Apr 16:16 OneOp-f-5-Data.db
-rw-r--r--  1 aaron  wheel  1936 27 Apr 16:16 OneOp-f-5-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 16:16 OneOp-f-5-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 16:16 OneOp-f-5-Statistics.db

The SSTables for OneOp numbered 1 through 4 have been compacted, this is tracked in the server and a compacted marker file (e.g. “OneOp-f-4-Compacted”) is written to disk to preserve the information across system restarts. The unused files will be physically deleted during JVM Garbage Collection. If Cassandra detects it is low on disk space when about to write data to disk, it will trigger GC in an effort to reclaim unused space.

There is also a new SSTable OneOp-f-5-Data.db that contains the single reconciled row from the other 4 SSTables. A read request against that row should now only use one SSTable:

[default@ReadWrite] get OneOp['foo1'];
=> (column=626172, value=62617a32, timestamp=1303877811057000)
Returned 1 results.
[default@ReadWrite] 

Check the stats.

$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
ReadWrite/OneOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  1                 0                 0                 0                 1
2                  0                 0                 0                 0                 0
3                  0                 0                 0                 0                 0
4                  0                 0                 0                 0                 0

The request used one SSTable and the Column Family has only 1 column in all the SSTables it is tracking.

Reference blog --> http://thelastpickle.com/blog/

===========================================================================================

Cassandra DBA Information at one place.

Thursday 15 February 2018

Cassandra Study Links ..

Node Down For Me?

Which nodes ?

The weird Consistency Level

Consistency Levels

A failing Range

A failing Cluster

The view from one node

How are Memtables measured?

How are Memtables measured?

memtable_flush_after (minutes)

memtable_operations (millions)

memtable_throughput

memtable_total_space_in_mb

In Motion

The forces of Write and Read

The forces of Write and Read ?

Roughly speaking Write requests

Dear Readers

Compacting for a sustainable future

In motion

Cassandra Authentication and Create User

Blog Archive