Wednesday, 21 March 2018

Cassandra Authentication and Create User

Cassandra Authentication and Create User:

By default when we install cassandra on a machine it do not has any username and password authentication. We can directly use cassandra console by just a simple command cqlsh. Therefore this blog post is regarding to enable the authentication in cassandra and to create a user with different access permissions to databases.

Basic Key Points :

cassandra.yaml file location:
- Cassandra Packaged installs: /etc/cassandra/conf
- Cassandra Binary installs (tarball): <install_location>/conf
- DataStax Enterprise Packaged installs: /etc/dse/cassandra
- DataStax Enterprise Binary installs (tarball): <install_location>/resources/cassandra/conf
OpsCenter : A tool to create and manage cassandra clusters.

Steps to Enable Authentication :

Edit the cassandra.yaml file in all the servers in cassandra cluster and change the following properties
- authorizer : CassandraAuthorizer
- authenticator : PasswordAuthenticator
Now Restart the cassandra servers as
- If Cassandra Packaged installs (installed as a service) then run the command : sudo service cassandra restart
- If Cassandra binary Installs (installed manually) then run following commands
  - ps -eaf | grep cassandra (note the pid of cassandra)
  - kill -9 <pid>
  - <install location>/bin/cassandra
If you are having OpsCenter installed in machine the you can change the above config properties form Opscenter UI.
- Go to OpsCenter UI in browser as <serverIp>/8888
- On the upper right corner click on Cluster Actions dropdown and then click configure
- Now Configure the above properties from UI as
- Change the properties and click the Save Configuration at the end of configuration popup.
- Now restart the cluster from the same right upper corner dropdown menu by clicking restart.
Ahoy You are done.
Now time to access cassandra console by using default username and password.
Run the command : cqlsh <localhost or server Ip> -u cassandra -p cassandra
Now You can create user as
- CREATE USER ‘user_name’ WITH PASSWORD ‘<password>';
- CREATE USER ‘user_name’ WITH PASSWORD ‘<password>’ SUPERUSER; <to create super user>
- Now you can give the access of individual Keyspaces to specific users as
  - GRANT SELECT ON KEYSPACE <keyspace_name> TO ‘<user_name>'; <gives only read only access to user>
  - GRANT ALL ON KEYSPACE <keyspace_name> TO ‘<user_name>'; <gives all access to user>

Conclusion :

So this is the way to enable the authentication is cassandra and to create users with different permissions.

Thursday, 15 February 2018

Cassandra Study Links ..

Cassandra Study Links:
======================

https://pandaforme.gitbooks.io/introduction-to-cassandra/content/cap_theorem.html

https://www.edureka.co/blog/introduction-to-cassandra-architecture/

https://teddyma.gitbooks.io/learncassandra/content/model/internal_data_structure.html

https://www.slideshare.net/narsiman/cassandra-internals-43964918

http://file.allitebooks.com/20150811/Mastering%20Apache%20Cassandra.pdf

https://github.com/sjtuhjh/appdocs/blob/master/BooksAndPapers/Cassandra/Cassandra%20High%20Performance%20Cookbook.pdf

https://www.tutorialspoint.com/cassandra/index.htm

http://cassandra.apache.org/doc/latest/getting_started/index.html

https://teddyma.gitbooks.io/learncassandra/content/index.html

https://www.youtube.com/results?search_query=cassandra+tutorial

http://file.allitebooks.com/20171211/Expert%20Apache%20Cassandra%20Administration.pdf

https://www.slideshare.net/nmilford/cassandra-for-sysadmins

https://docs.genesys.com/Documentation/OS/8.1.4/Cassandra/Welcome?action=pdfbook&title=Documentation:OS:Cassandra:Welcome:8.1.4

http://www.tothenew.com/blog/cassandra-authentication-and-create-user/

https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/

Node Down For Me?

For a read or write request to start in Cassandra at least as many nodes must be seen as UP by the coordinator node as the request has specified via the ConsistencyLevel. Otherwise the client will get anUnavailableException and the cluster will appear down for that request. That may not necessarily mean it is down for all keys or all requests.

The Replication Factor, number of nodes, the Consistency Level and luck all play a part of determining how many nodes can be lost in a Cassandra cluster before it is unavailable for 100% of the keys. Before it reaches that point though the cluster may go through a period of partial failure where some keys will not be available at some CL levels.

The partial failure support baked into the system is nice thing to have. But most people will be interested in keeping 100% of the keys available at the required Consistency Level. So most of discussion below talks about keeping the cluster up for 100% of the keys at theQUORUM CL.

Which nodes ?

When it comes to counting the UP nodes for a request we only consider the Natural Endpoints for a key. These are the nodes identified by the placement_strategy (set when the Keyspace was created) as the replicas for a key, and they never change. All read and write operations for a key, using the same partitioner, will select those same endpoints. Otherwise write operations could plonk down data that reads could never find.

The row key is first Decorated by the partitioner (specified inconf/cassandra.yaml) to create the token used to locate the row in the cluster. For example the RandomPartitioner uses an MD5 transform to turn the key into a 128bit token.

Is using the SimpleStrategy it will:

Order the nodes in the ring by their initial_token.
Select the node whose token range includes the token as the first replica.
Select the next RF-1 nodes in the ordered ring as the remaining replicas.

The nodes loop around, so a row may be replicated on the last 2 nodes and the first 1.

The NetworkTopologyStrategy use a more involved algorithm that considers the Data Centre and Rack the node is assigned to by the Snitch. There is a good discussion of the process from Peter Schuller.

The weird Consistency Level

There is one Consistency Level that does not behave like the others, so lets just get it out of the way first. CL ANY for write requests allows the coordinator node to store the mutation in the form of a Hinted Handoffon any node in the cluster, which in practice means on the coordinator itself.

This is useful in cases where extreme write uptime is needed. The sort of extreme where the write cannot be reliably read until a nodetool repair operation has been completed. Hinted Handoffs must be delivered to Natural Endpoints before they can be included in a read operation.

If you write at CL ANY and some of the Natural Endpoints are up, the write and the Hints will be sent to them. The coordinator will only be used to store the Hints in cases where all the Natural Endpoints are down.

For more information on Hinted Handoff see Jonathan’s recent post.

Consistency Levels

For all the other Consistency Levels the read or write request is directed to one of the Natural Endpoints. Hinted Handoffs may be used as part of the request but are do not considered when determining if the cluster is available for the request.

The named Consistency Levels ONE, TWO and THREE are pretty easy to understand. Once, two or three replicas for the key must be seen as UPbefore the operation will start. CL ONE is the most often used of these.

QUORUM is calculated as floor(RF \ 2) + 1. This is most used CL level and in my opinion should be the starting CL for all applications until a reason is found to change (performance is not a reason).

For RF levels below 3 the QUORUM is the same as the RF level so:

RF 1 - QUORUM = 1
RF 2 - QUORUM = 2
RF 3 - QUORUM = 2
RF 4 - QUORUM = 3
RF 5 - QUORUM = 3
RF 6 - QUORUM = 4
RF 7 - QUORUM = 4

When the NetworkTopologyStrategy is used each data centre has it’s own RF, and the standard QUORUM is calculated using the total RF for the cluster.

LOCAL_QUORUM and EACH_QUORUM can be used with theNetworkTopologyStrategy and they instruct the coordinator to also consider the Data Centre the nodes are located in. A write is always sent to all UP replicas, this

For LOCAL_QUORUM only the RF of the local data centre is considered when:

Calculating how many nodes to block for.
Checking if enough nodes are UP for the request.
Counting if CL nodes have responded to the request.

EACH_QUORUM works in a similar way but the tests apply to every DC in the cluster.

ALL requires that all replicas for the row be UP before the request will start.

A failing Range

A simple, but incomplete, way to think about the cluster been available is to focus a one key range and it’s replicas.

Consider RF 3, at QUORUM if one node is lost the range will still be available. If two nodes are lost the range will not be available for QUORUM operations, but will still be available for ONE and ANY requests.

For any number of nodes in the cluster, a range will become unavailable it more than (RF - CL) nodes are DOWN.

Now consider a cluster with 50 nodes, RF 5, QUORUM operations and the SimpleStrategy. If 3 adjacent nodes nodes go down the range assigned to the first one will no longer be available, as their will only be 2 replicas UP. The cluster will be down for 2% of the possible keys, not terrible but it’s no longer up for 100%. The range assigned to the second down node will have 3 UP replicas and the range assigned to the third will have 4 UP replicas. The nodes do not have to be adjacent in the ring for this occur, it could be any nodes in the replica set for the range. It’s just easier to think about when they are adjacent.

Spreading replicas for a key across nodes with different physical infrastructure is a good way to mitigate this risk. TheNetworkTopologyStrategy distributes the replicas for a DC across the available Racks. As defined by either the RackInferringSnitch,PropertyFileSnitch or the EC2Snitch which uses AWS Availability Zones as racks. The SimpleSnitch puts all nodes into rack1 indatacenter1.

In the worst case failure scenario the cluster can sustain up to (RF - CL) failures and still remain available for 100% of the keys.

A failing Cluster

The best case scenario for failure is when the node failures are evenly divided amongst the replicas for a range. So that every RF number of failures only removes one node from the available replica set for each range. To know how many nodes we can lose for a Consistency Level to still be available for 100% of the keys multiply by RF-CL.

For a 5 node cluster with RF 3 at QUORUM this is (5 / 3) * (3 - 2) or 1. For other cluster sizes the number is:

def max_failure(num_nodes, rf, block_for):
    print "For up to %s nodes with RF %s and blocking for %s nodes..." % (
        num_nodes, rf, block_for)
    print "Number Nodes / max_failure";
    for n in range(1, num_nodes + 1):
         print "%s / %s" % (n, ( int(n/rf) * (rf - block_for)))

max_failure(10, 3, 2)

For up to 10 nodes with RF 3 and blocking for 2 nodes...
Number Nodes / max_failure
1 / 0
2 / 0
3 / 1
4 / 1
5 / 1
6 / 2
7 / 2
8 / 2
9 / 3
10 / 3

In the best case failure scenario the cluster can sustain up to floor(number of nodes / RF) * (RF - CL) failed nodes and still remain up for 100% of the keys.

The view from one node

Considering which nodes are UP or DOWN is always from the perspective of the coordinator node. Network Partitions also play a part in the deciding if the cluster is available, as the nodes must be both running and contactable by the coordinator.

At the small scale if a client connects to a node in the cluster that has lost connectivity to other nodes in the cluster it will consider them allDOWN and be unavailable for all QUORUM requests. The client will receive a UnavailableException and should connect to another node and try request. Other nodes in the cluster may be in a bigger partition that contains enough UP replicas for the request to complete.

At a bigger scale when using Amazon AWS it’s more likely that nodes an Availability Zone will lose connectivity with nodes from a different AZ then from nodes in the same AZ.

With two AZ’s operations at QUORUM will require nodes in both AZ’s as neither will hold QUORUM replicas. So a network partition between the two would result in 100% of the keys been down in both AZ’s.

With three AZ’s each AZ will hold one third of the replicas, and any two together may provide enough UP replicas to support QUORUMoperations. The cluster could sustain a network partition so long as each AZ can talk to at least one other AZ.

===============================================================

How are Memtables measured?

Updated: I’ve added information on the newmemtable_total_space_in_mb setting in version 0.8 and improved the information about memtable_throughput. Thanks for the feedback.

In version 0.7 of Cassandra there are 3 configuration settings that can trigger flushing a memtable to disk. Version 0.8 adds support for a global memtable_total_space_in_mb which may replace the previous 3 settings.

First the 0.7 settings.

memtable_flush_after (minutes)

This is the maximum number of minutes a memtable should stay in memory for if it has received writes. When the memtable is created the current time is recorded which is then checked every 10 seconds. If after the time span either the primary memtable for the Column Family or any secondary index memtables have received writes they are replaced and flushed to disk.

Unfortunately as the wiki points out, there is a good reason to make this value small and a good reason to make it large.

A log file cannot be deleted until all of the segments / records it contains have been marked as completed. This happens when a memtable is flushed to disk. Because the log file is shared by all Column Familes, one Column Family that has long living memtables can prevent log files from been deleted and their disk space freed up.

However smaller values can cause multiple memtables to expire at the same time, prior to version 0.7 this could cause flush requests to block. Version 0.7 added memtable_flush_writers andmemtable_flush_queue_size ,but it can still slow down the IO system (see conf/cassandra.yaml for more info). The best approach is to tune the other memtable thresholds to trigger when you want them, and leave this setting as a backup.

The wiki recommends setting a default of 1440 minutes, or 24 hours which is also the default.

memtable_operations (millions)

The Memtable tracks the number of operations applied to it by:

counting the number of top level columns (i.e. Columns for a Standard CF or Super Columns for a Super CF) in a mutation.
considering a row level deletion as a single operation.

Deletions and Insertions are both considered a mutation, inserting 3 columns increases the count by 3 and deleting 3 named columns increases the count by 3. Note that the number of sub columns in a Super Column is ignored. So inserting 10 sub columns into 1 super column increases the count by one, and deleting a super column by name that has 10 sub column increases the count by one.

The operations threshold (and the size threshold) is checked before applying a mutation and the flush is not requested until after the mutation has completed. So it’s possible for the memtable to contain more than memtable_operations when it is flushed to disk.

A different way to think about this setting, and memtable_throughput, is as sstable_min_operations and sstable_min_bytes. In general operation new sstables are created after at leastsstable_min_operations operations have occurred or at mostsstable_min_bytes bytes will be written.

If no value is provided when the CF is created it is set to the default memtable throughput in MB (below) / 64 * 0.3, so it’s 300k ops per 64MB of throughput. If you have a CF that contains many small columns it’s a good idea to look at the log entries for memtable flushes to see if the ops threshold is triggering early and causing small memtables to be frequently written.

memtable_throughput

Throughput for the memtable is tracked and tested at the same time as the operation count. But counting the byte size of the data is more involved and depends on the type of the column. The size of the data when serialised to disk is counted as follows:

standard column byte size is
- length of the key byte array plus 2 bytes to store the length
- 1 byte to indicate if the column has been deleted
- 8 bytes for the timestamp
- length of the value byte array plus 4 bytes to store the length
expiring columns (those with a TTL) add another 8 bytes to the length of a standard column.
deleted columns (tombstones) are the same as standard columns but the value is always 4 bytes long.
counter columns (in Cassandra v0.8) add another 8 bytes to the length of a standard column. Note that for a counter column the value will always be an 8 byte long.
super columns sum the size of all contained columns and then add
- length of the name byte array plus 2 bytes to store the length
- 4 bytes to indicate when it was deleted
- 8 bytes to store the timestamp for the deletion
- 4 bytes to store the number of sub columns

(Currently the calculation for the super column only includes the sum of the sub columns. I think this needs to be changed.)

A row deletion will add zero bytes to the throughput counter.

The byte size of the mutation is always added to the counter, if one mutation replaces columns in the memtable their byte size is notsubtracted from the counter.

Getting this setting wrong is a very easy way to run out of memory. From version 0.7 onwards the worse case scenario is up to CF Count + Secondary Index Count + memtable_flush_queue_size (defaults to 4) +memtable_flush_writers (defaults to 1 per data directory) memtables in memory the JVM at once. It’s best to be conservative, follow the wiki advice and consider that the JVM may take up to 10 times as much memory as it takes to serialise the data to disk.

And that’s the problem with this threshold. It’s not measuring how much memory a memtable is using in the JVM Heap, it’s measuring the maximum amount of bytes it could take to serialise the data (excluding the index and bloom filter) to disk. Which makes it a difficult knob to use when tuning how much memory Cassandra uses.

If a value is not provided when the Column Family is created it will default to 1/16th the maximum size of the JVM Heap at the time. This value stored with the Column Family meta data and will not change again. Typical values are around 128MB to 256MB.

memtable_total_space_in_mb

Version 0.8 adds the per node memtable_total_space_in_mb setting which makes life easier and may eventually replace the 3 previous settings. While it’s fun to play with the per CF settings, it can also be a pain when building real systems that need to stay up.

If no value is set in conf/cassandra.yaml the setting will default to one third of the JVM max Heap size. If it is set to zero the setting is disabled and only the old per CF thresholds will be used. If the global setting is enabled and there are per CF settings both of them will be used.

There are two parts to the global memtable size, measuring the real memory usage of the memtable and flushing. First the measuring.

Rather than track every byte allocated the server periodically works out the ratio between the throughput as measured above and the real in memory bytes as measured by JVM. The in memory byte count is worked out using the Instrumentation Java Package and code fromJonathan Ellis. After a mutation has been applied to the memtable, but before a flush is requested, Cassandra calculates the “Live Ratio” if more than twice as many operations (as calculated above) have been processed since the last time it was calculated.

Measuring the Live Ratio is done asynchronously and involves measuring the real memory size of all the keys, super columns and columns in the memtable and dividing it by the throughput as measured above. For sanity the ratio is clamped between 1.0 and 64.0, if the value is outside of this range a WARN level log message will let you know. Finally the ratio for the Column Family is updated to the new ratio if and only if the new ratio is higher than the previous one. AnINFO level message will let you know when the ratio is calculated, how long it took and if it changed.

Next the MeteredFlusher runs every second and uses a two phase approach to keeping the live memory use under the setting. First it looks at the total live bytes for each Column Family, including it’s secondary indexes, and flushes CF’s that could potentially fill the memory if allowed to create memtables of this size. Live bytes are calculated by multiplying the throughput as perviously measured by the Live Ratio. The Flusher considers the Column Family to be using too much memory if it’s current live size is more thanmemtable_total_space_in_mb divided by the maximum number of memtables the Column Family could have in memory. The calculation for this is similar to the one presented above for memtable_throughputbut it includes secondary indexes and a fudge factor that takes into account how the live size is measured.

For example if memtable_total_space_in_mb is 100MB, andmemtable_flush_writers is the default 1 (with one data directory), andmemtable_flush_queue_size is the default 4, and a Column Family has no secondary indexes. The CF will not be allowed to get above one seventh of 100MB or 14MB, as if the CF filled the flush pipeline with 7 memtables of this size it would take 98MB. At a more sensible 2GB formemtable_total_space_in_mb (1/3 of a 6GB JVM Heap) the CF will be flushed if it is using 292MB of live memory.

(I’ve skipped a couple of things here such as considering the bytes currently been flushed.)

The flusher process will end there if the number of bytes that were flushing when it started plus the bytes for all the CF’s that were not flushed in the first phase is less than memtable_total_space_in_mb.

The second phase flushes the CF’s in order of largest to smallest until the total live size (including the bytes currently been flushed) gets down below the target setting.

This new setting (and the existing flush_largest_memtables_at) should make it harder to shot yourself in the foot with memory management and easier for new users to feel comfortable with the server.

In Motion

You can check the per CF thresholds as well as the current tracked values for a memtable using bin/nodetool, bin/cassandra-cli or JConsole. I’m not aware of any current features to check the Live Ratio or Live Size of a CF.

bin/nodetool cfstats can tell you the current operation count (‘Memtable Columns Count’) and throughput (‘Memtable Data Size’):

$ ./bin/nodetool -h localhost cfstats
Keyspace: dev
    Read Count: 1
    Read Latency: 0.897 ms.
    Write Count: 2
    Write Latency: 0.051 ms.
    Pending Tasks: 0
        Column Family: data
        SSTable count: 2
        Space used (live): 9530
        Space used (total): 9530
        Memtable Columns Count: 1
        Memtable Data Size: 26
        Memtable Switch Count: 1
        Read Count: 1
        Read Latency: 0.897 ms.
        Write Count: 2
        Write Latency: 0.020 ms.
        Pending Tasks: 0
        Key cache capacity: 200000
        Key cache size: 2
        Key cache hit rate: 0.0
        Row cache: disabled
        Compacted row minimum size: 51
        Compacted row maximum size: 86
        Compacted row mean size: 73

‘bin/cassandra-cli’ can tell you the current thresholds using eitherdescribe keyspace or show keyspaces.

[default@dev] describe keyspace;
Keyspace: dev:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
    Options: [datacenter1:2]
  Column Families:
    ColumnFamily: data
      Key Validation Class: org.apache.cassandra.db.marshal.BytesType
      Default column value validator: org.apache.cassandra.db.marshal.BytesType
      Columns sorted by: org.apache.cassandra.db.marshal.AsciiType
      Row cache size / save period in seconds: 0.0/0
      Key cache size / save period in seconds: 200000.0/14400
      Memtable thresholds: 0.29062499999999997/62/1440 (millions of ops/MB/minutes)

Or using JConsole connect to the server, select MBeans and then navigate to org.apacge.cassandra.db.ColumnFamilies.<your-keyspace>.<your-column-family>. There you can find the current thresholds:

MemtableFlushAfterMins
MemtableOperationsInMillions
MemtableThroughputInMB

And the running values:

MemtableColumnCount
MemtableDataSize

=======================================================================

Cassandra DBA Information at one place.

Wednesday, 21 March 2018