Wednesday, 6 December 2017

Apache Cassandra Interview Questions:

Apache Cassandra Interview Questions:


Q1: How many types of NoSQL databases are there?


Answer: There are four types of NoSQL databases, namely:

  1. Document Stores (MongoDB, Couchbase)
  2. Key-Value Stores (Redis, Volgemort)
  3. Column Stores (Cassandra)
  4. Graph Stores (Neo4j, Giraph)

Q2: What do you understand by Commit log in Cassandra?


Answer: Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.

Q3: Define Mem-table in Cassandra.

Answer: It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Mem-table is in-memory/write-back cache space consisting of content in key and column format. The data in mem- table is sorted by key, and each column family consists of a distinct mem-table that retrieves column data via key. It stores the writes until it is full, and then flushed out.

Q4: What is SSTable?

Answer: SSTable or ‘Sorted String Table,’ refers to an important data file in Cassandra. It accepts regular written memtables which are stored on disk and exist for each Cassandra table. Being immutable, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.

Q5: What is bloom filter?

Answer: Bloom filter is an off-heap data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.

Q6: Establish the difference between a node, cluster & data centres in Cassandra.

Answer: Node is a single machine running Cassandra.
Cluster is a collection of nodes that have similar type of data grouped together.
Data centres are useful components when serving customers in different geographical areas. Different nodes of a cluster are grouped into different data centres.

Q7: Define composite type in Cassandra?

Answer: In Cassandra, composite type allows to define a key or a column name with a concatenation of data of different type. You can use two types of Composite Types:
  1. Row Key
  2. Column Name

Q8: What is Cassandra Data Model?

Answer: Cassandra Data Model consists of four main components, namely:
  1. Cluster: These are made up of multiple nodes and keyspaces.
  2. Keyspace: It is a namespace to group multiple column families, especially one per partition.
  3. Column: It consists of a column name, value and timestamp
  4. Column family: This refers to multiple columns with row key reference.

Q9: Explain what is a keyspace in Cassandra?

Answer: In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consists of one keyspace per node.

Q10: Elaborate on CQL?

Answer: A user can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.

Q11: Talk about the concept of tunable consistency in Cassandra.

Answer: Tunable Consistency is a characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies – Eventual Consistency and Strong Consistency.

Q12: What are the three components of Cassandra write?

Answer: The three components are:
  1. Commitlog write
  2. Memtable write
  3. SStable write
Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in SStable.

Q13: Explain zero consistency.

Answer:  In zero consistency the write operations will be handled in the background, asynchronously. It is the fastest way to write data.

Q14: Mention what are the values stored in the Cassandra Column?

Answer: There are three values in Cassandra Column. They are:
  1. Column Name
  2. Value
  3. Time Stamp

Q15: What do you understand by Kundera?

Answer: Kundera is an object-relational mapping (ORM) implementation for Cassandra which is written using Java annotations.

Q16: What is the concept of SuperColumn in Cassandra?

Answer: Cassandra SuperColumn is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action.

Q17: When do you have to avoid secondary indexes?

Answer: Try not using secondary indexes on columns containing a high count of unique values as that will produce few results.

Q18: List the steps in which Cassandra writes changed data into commitlog?

Answer: Cassandra concatenates changed data to commitlog. Then Commitlog acts as a crash recovery log for data. Until the changed data is concatenated to commitlog, write operation will never be considered successful.

Q19: What is the use of “ResultSet execute(Statement statement)” method?

Answer: This method is used to execute a query. It requires a statement object.

Q20: What is Thrift?

Answer: Thrift is the name of the Remote Procedure Call (RPC) client used to communicate with the Cassandra server.

Q21: Explain the two types of compactions in Cassandra.

Answer: Compaction refers to a maintenance process in Cassandra , in which, the SSTables are reorganized for data optimization of data structures on the disk. There are two types of compaction in Cassandra:
  1. Minor compaction: It starts automatically when a new table is created. Here, Cassandra condenses all the equally sized tables into one.
  2. Major compaction: It is triggered manually using nodetool. It compacts all tables of a ColumnFamily into one.

Q22: Explain what is Cassandra-Cqlsh?

Answer: Cassandra-Cqlsh is a query language that enables users to communicate with its database. By using Cassandra cqlsh, you can do following things:
  1. Define a schema
  2. Insert a data, and
  3. Execute a query

Q23: What is the use of “void close()” method?

Answer: This method is used to close the current session instance.

Q24: What are the collection data types provided by CQL?

Answer: There are three collection data types:
  1. List : A list is a collection of one or more ordered elements.
  2. Map : A map is a collection of key-value pairs.
  3. Set : A set is a collection of one or more elements.

Q25: Describe Replication Factor?

Answer: Replication Factor is the measure of number of data copies existing. It is important to increase the replication factor to log into the cluster.
==========================================================================

Cassandra Interview Questions

What is Apache Cassandra?
Apache Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing and managing large amounts of data across commodity servers. It can server as both
– Real time data store system for online applications
– Also as a read intensive database for business intelligence system
OR
Apache Cassandra is an open source, distributed and decentralized/distributed storage system (database), for managing very large amounts of structured data spread out across the world. It provides highly available service with no single point of failure.It was developed at Facebook for inbox search and it was open-sourced by Facebook in July 2008.
What was the design goal of Cassandra?
The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure.
What is NoSQL Database?
NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.
Cassandra is written in which language?
Java
How many types of NoSQL databases?
– Document Stores (MongoDB, Couchbase)
– Key-Value Stores (Redis, Volgemort)
– Column Stores (Cassandra)
– Graph Stores (Neo4j, Graph)
What do you understand by composite type?
Composite Type is a cool feature of Hector and Cassandra.
It allow to define a key or a column name with a concatenation of data of different type.
With Cassandra Unit, you can use Composite Type in 2 places :
– row key
– column name
Mention what are the main components of Cassandra Data Model?
The main components of Cassandra Data Model are
– Cluster
– Keyspace
– Column
– Column & Family
What is the relationship between Apache Hadoop, HBase, Hive and Cassandra?
Apache Hadoop, File Storage, Grid Compute processing via Map Reduce.
Apache Hive, SQL like interface on top of hadoop.
Apache Hbase, Column Family Storage built like BigTable
Apache Cassandra, Column Family Storage build like BigTable with Dynamo topology and consistency.
Desired to gain proficiency on Cassandra? Explore the blog post on Cassandra Training 
to become a pro in Cassandra.
List out some key features of Apache Cassandra?
It is scalable, fault-tolerant, and consistent.
It is a column-oriented database.
Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.
Created at Facebook, it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model.
Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
What do you understand by Data Replication in Cassandra?
Database replication is the frequent electronic copying data from a database in one computer or server to a database in another so that all users share the same level of information.
Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule,  the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.
What do you understand by Node in Cassandra?
Node is the place where data is stored.
What do you understand by Data center in Cassandra?
Data center is a collection of related nodes.
What do you understand by Cluster in Cassandra?
Cluster is a component that contains one or more data centers.
What do you understand by Commit log in Cassandra?
Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
What do you understand by Mem-table in Cassandra?
Mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
What do you understand by SSTabl in Cassandra?
SSTable is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
What do you understand by Bloom filter in Cassandra?
Bloom filter are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
What do you understand by CQL?
User can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.
What do you understand by Column Family?
Column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
What is the use of “void close()” method?
This method is used to close the current session instance.
What is the use of “ResultSet execute(Statement statement)” method?
This method is used to execute a query. It requires a statement object.
Which command is used to start the cqlsh prompt?
Cqlsh
What is the use of “cqlsh –version” command?
This command will provides the version of the cqlsh you are using.
What are the collection data types provided by CQL?
List : A list is a collection of one or more ordered elements.
Map : A map is a collection of key-value pairs.
Set : A set is a collection of one or more elements.
What is Cassandra database used for?
Apache Cassandra is a second-generation distributed database originally open-sourced by Facebook. Its write-optimized shared-nothing architecture results inexcellent performance and scalability. The Cassandra storage cluster and S3 archival layer are designed to expand horizontally to any arbitrary size with linear cost.Cassandra’s memory footprint is more dependent on the number of column families than on the size of the data set. Cassandra scales pretty well horizontally for storage and IO, but not for memory footprint, which is tied to your schema and your cache settings regardless of the size of your cluster. some of the important link about cassandara is available-here.
What is the syntax to create keyspace in Cassandra?
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE WITH
What is a keyspace in Cassandra?
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one keyspace per node.
What is cqlsh?
cqlsh is a Python-based command-line client for cassandra.
Does Cassandra works on Windows?
Yes, Cassandra works pretty well on windows. Right now we have linux and windows compatible versions available.
What do you understand by Consistency in Cassandra?
Consistency means to synchronize and how up-to-date a row of Cassandra data is on all of its replicas.
Explain Zero Consistency?
In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the one that is used to offer the least confidence that operations will succeed.
What do you understand by Thrift?
Thrift is the name of the RPC client used to communicate with the Cassandra server.
What do you understand by Kundera?
Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.
What does JMX stands for?
JMX stands for Java Management Extension
How Cassandra provide High availability feature?
Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure resistant. That means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.
Interested in mastering Cassandra? Learn more about Cassandra Tutorials in this blog post.
When to use Cassandra?
Being a part of NoSQL family Cassandra offers solution for problem where your requirement is to have very heavy write system and you want to have quite responsive reporting system on top of that stored data. Consider use case of Web analytic where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner.
When should you not use Cassandra? OR When to use RDBMS instead of Cassandra?
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
What are secondary indexes?
Secondary indexes are indexes built over column values. In other words, let’s say you have a user table, which contains a user’s email. The primary index would be the user ID, so if you wanted to access a particular user’s email, you could look them up by their ID. However, to solve the inverse query given an email, fetch the user ID requires a secondary index.
When to use secondary indexes?
You want to query on a column that isn’t the primary key and isn’t part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).
When to avoid secondary indexes?
Try not using secondary indexes on columns contain a high count of unique values and that will produce few results.
I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?
XX%
What happens to existing data in my cluster when I add new nodes?
When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.
What are “Seed Nodes” in Cassandra?
A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Seed node helps the process of bootstrapping for a new node joining a cluster. Its recommended to use the 2 seed node per data center.
What ports does Cassandra use?
By default, Cassandra uses 7000 for cluster communication, 9160 for clients (Thrift), and 8080 for JMX. These are all editable in the configuration file or bin/cassandra.in.sh (for JVM options). All ports are TCP.
What do you understand by High availability?
A high availability system is the one that is ready to serve any request at any time. High availability is usually achieved by adding redundancies. So, if one part fails, the other part of the system can serve the request. To a client, it seems as if everything
 worked fine.
======================================================

Q) What is NoSQL ?

NoSQL (sometimes expanded to “not only sql“) is a broad class of database management systems that differ from the classic model of the relational database management system (rdbms) in some significant ways.
NoSQL systems:
  • Specifically designed for high load
  • Natively support horizontal scalability
  • Fault tolerant
  • Store data in denormalised manner
  • Do not usually enforce strict database schema
  • Do not usually store data in a table
  • Sometimes provide eventual consistency instead of ACID transactions
In contrast to RDBMS, NoSQL systems:
  • Do not guarantee data consistency
  • Usually support a limited query language (subset of SQL or another custom query language)
  • May not provide support for transactions/distributed transactions
  • Do not usually use some advanced concepts of RDBMS, such as triggers, views, stored procedures
NoSQL implementations can be categorised by their manner of implementation:
  • Column-oriented
  • Document store
  • Graph
  • Key-value store
  • Multivalue databases
  • Object databases
  • Tripplestore
  • Tuple store

Q. Explain what is Cassandra?

Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing and managing large amounts of data across commodity servers. It can server as both
  • Real time data store system for online applications
  • Also as a read intensive database for business intelligence system

Q. Why Cassandra? Why not any other no SQL like Hbase ?

Apache Cassandra is an open source, free to use, distributed, decentralized, elastically and linearly scalable, highly available, fault-tolerant, tune-ably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web. Cassandra lies in CA bucket of CAP Theorem.
Our use case was more of write intensive. Since Cassandra provide Consistency and Availability, which was requirement of our use case we preferred Cassandra.
HBase is really good for Low latency read write kind of use cases.

Q. Explain Cassandra Data Model.

– The Cassandra data model has 4 main concepts which are cluster, keyspace, column,column&family.
– Clusters contain many nodes (machines) and can contain multiple keyspaces.
– A keyspace is a namespace to group multiple column families, typically one per application.
– A column contains a name, value and timestamp.
– A column family contains multiple columns referenced by a row keys.

Q. Explain about Cassandra NoSQL.

Cassandra is an open source scalable and highly available “NoSQL” distributed database management system from Apache. Cassandra claims to offer fault tolerant linear scalability with no single point of failure. Cassandra sits in the Column-Family NoSQL camp.The Cassandra data model is designed for large scale distributed data and trades ACID compliant data practices for performance and availability.Cassandra is optimized for very fast and highly available writes.Cassandra is written in Java and can run on a vast array of operating systems and platform.

Q. Explain how Cassandra writes.

Cassandra writes first to a commit log on disk for durability then commits to an in-memory structure called a memtable. A write is successful once both commits are complete. Writes are batched in memory and written to disk in a table structure called an SSTable (sorted string table). Memtables and SSTables are created per column family. With this design Cassandra has minimal disk I/O and offers high speed write performance because the commit log is append-only and Cassandra doesn’t seek on writes. In the event of a fault when writing to the SSTable Cassandra can simply replay the commit log.

Q. What platforms does Cassandra run on?

Cassandra is a Java Application, meaning that a compiled binary distribution of Cassandra can run on any platform that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM). Datastax Strongly recommends using the Oracle Sun Java Runtime Environment (JRE), version 1.6.0_19 or later, for optimal performance. Packaged releases are provided for RedHat, CentOS , Debian and Ubuntu Linux Platforms.

Q. What is the CQL Language?

Cassandra 0.8 is the first release to introduce Cassandra Query Language(CQL), the first standardized query language for Apache Cassandra. CQL pushes all of the implementation details to the server in the form of a CQL parser. Clients built on CQL only need to know how to interpret query result objects. CQL is the start of the first officially supported client API for Apache Cassandra. CQL drivers for the various languages are hosted with the Apache Cassandra project.
CQL Syntax is based on SQL (Structured Query Language), the standard for relational database manipulation. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOINS, for example.

Q. What management tools exist for Cassandra?

Datastax supplies both a free and commercial version of OpsCenter, which is a visual, browser-based management toll for Cassandra. With OpsCenter, a user can visually carry out many administrative tasks, monitor a cluster for performance, and do much more. Downloads of OpsCenter are available on the DataStax Website.
A number of command line tools also ship with Cassandra for querying/writing to the database, performing administration functions, etc.
Cassandra also exposes a number of statistics and management operations via Java Management Extensions(JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java Applications and services. Any statistics or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.
During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant tools such as JConsole, the Cassandra nodetool utility, or the DataStax OpsCenter centralized management console. With the same tools, you can perform certain administrative commands and operation such as flushing caches or doing a repair.

Q. Briefly Explain CAP theorem.

The CAP theorem (also called as Brewer’s theorem after its author, Eric Brewer) states that within a large-scale distributed data system, there are three requirements that have a relationship of sliding dependency: Consistency, Availability, and Partition Tolerance.
CAP theorem states that in any given system, you can strongly support only two of these three.

Q. Why Cassandra is called decentralized no sql data base?

Cassandra is distributed, which means that it is capable of running on multiple machines while appearing to users as a unified whole. Cassandra is decentralized means that there is no single point of failure. All of the nodes in a Cassandra cluster functions exactly the same. There is NO Master NO Slave.

Q. What do you understand by Elastic Scalability?

Elastic Scalability means that your cluster can seamlessly scale up and scale back down. That actually means that adding more servers to cluster would improve and scale performance of cluster in linear fashion without any manual interventions. Vice versa is equally true.

Q. Cassandra is said to be Tune able Consistent. Why?

Consistency essentially means that a read always returns the most recently written value. Cassandra allows you to easily decide the level of consistency you require, in balance with the level of availability. This is controlled by parameters like replication factor and consistency level.

Q. How Cassandra Achieve High Availability and Fault Tolerance?

Cassandra is highly available. You can easily remove few of Cassandra failed node from cluster without actually losing any data and without bring whole cluster down. In similar fashion you can also improve performance by replicating data to multiple data center.

Q. What is basic difference between data center and cluster in terms of Cassandra?

A collection of related nodes is called so. A data center can be a physical data center or virtual data center. Replication is set by data center. Depending on the replication factor, data can be written to multiple data centers. However, data centers should never span physical locations whereas a cluster contains one or more data centers. It can span physical locations.

Q. What is the use of Cassandra and why to use Cassandra?

Cassandra was designed to handle big data workloads across multiple nodes without any single point of failure.  The various factors responsible for using Cassandra are
  • It is fault tolerant and consistent
  • Gigabytes to petabytes scalabilities
  • It is a column-oriented database
  • No single point of failure
  • No need for separate caching layer
  • Flexible schema design
  • It has flexible data storage, easy data distribution, and fast writes
  • It supports ACID (Atomicity, Consistency, Isolation, and Durability)properties
  • Multi-data center and cloud capable
  • Data compression

Q. Explain what is composite type in Cassandra?

In Cassandra, composite type allows to define key or a column name with a concatenation of data of different type. You can use two types of Composite Type
  • Row Key
  • Column Name

Q. How Cassandra stores data?

  • All data stored as bytes
  • When you specify validator, Cassandra ensures those bytes are encoded as per requirement
  • Then a comparator orders the column based on the ordering specific to the encoding
  • While composite are just byte arrays with a specific encoding, for each component it stores a two byte length followed by the byte encoded component followed by a termination bit.

Q. Mention what are the main components of Cassandra Data Model?

The main components of Cassandra Data Model are
  • Cluster
  • Keyspace
  • Column
  • Column & Family

Q. Explain what is a column family in Cassandra?

Column family in Cassandra is referred for a collection of Rows.

Q. Explain what is a cluster in Cassandra?

A cluster is a container for keyspaces. Cassandra database is segmented over several machines that operate together. The cluster is the outermost container which arranges the nodes in a ring format and assigns data to them.  These nodes have a replica which takes charge in case of data handling failure.

Q. List out the other components of Cassandra?

The other components of Cassandra are
  • Node
  • Data Center
  • Cluster
  • Commit log
  • Mem-table
  • SSTable
  • Bloom Filter

Q. Explain what is a keyspace in Cassandra?

In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one keyspace per node.

Q. What is the syntax to create keyspace in Cassandra?

Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE WITH

Q. Mention what are the values stored in the Cassandra Column?

In Cassandra Column, basically there are three values
  • Column Name
  • Value
  • Time Stamp

Q. Mention when you can use Alter keyspace?

ALTER KEYSPACE can be used to change properties such as the number of replicas and the durable_write of a keyspace.

Q. Explain what is Cassandra-Cqlsh?

Cassandra-Cqlsh is a query language that enables users to communicate with its database. By using Cassandra cqlsh, you can do following things
  • Define a schema
  • Insert a data and
  • Execute a query

Q. Mention what does the shell commands “Capture” and “Consistency” determines?

There are various Cqlsh shell commands in Cassandra. Command “Capture”, captures the output of a command and adds it to a file while, command “Consistency” display the current consistency level or set a new consistency level.

Q. What is mandatory while creating a table in Cassandra?

While creating a table primary key is mandatory, it is made up of one or more columns of a table.

Q. Mention what needs to be taken care while adding a Column?

While adding a column you need to take care that the
  • Column name is not conflicting with the existing column names
  • Table is not defined with compact storage option

Q. Mention what is Cassandra- CQL collections?

Cassandra CQL collections help you to store multiple values in a single variable. In Cassandra, you can use CQL collections in following ways
  • List: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds the list of unique elements)
  • SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)
  • MAP: It is a data type used to store a key-value pair of elements

Q. Explain how Cassandra writes data?

Cassandra writes data in three components
  • Commitlog write
  • Memtable write
  • SStable write
Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in SStable

Q. Explain what is Memtable in Cassandra?

  • Cassandra writes the data to a in memory structure known as Memtable
  • It is an in-memory cache with content stored as key/column
  • By key Memtable data are sorted
  • There is a separate Memtable for each ColumnFamily, and it retrieves column data from the key

Q. Explain what is SStable consist of?

SStable consist of mainly 2 files
  • Index file ( Bloom filter & Key offset pairs)
  • Data file (Actual column data)

Q. Explain what is Bloom Filter is used for in Cassandra?

A bloom filter is a space efficient data structure that is used to test whether an element is a member of a set. In other words, it is used to determine whether an SSTable has data for a particular row. In Cassandra it is used to save IO when performing a KEY LOOKUP.

Q. Explain how Cassandra writes changed data into commitlog?

  • Cassandra concatenate changed data to commitlog
  • Commitlog acts as a crash recovery log for data
  • Until the changed data is concatenated to commitlog write operation will be never considered successful
Data will not be lost once commitlog is flushed out to file.

Q. Explain how Cassandra delete Data?

SSTables are immutable and cannot remove a row from SSTables.  When a row needs to be deleted, Cassandra assigns the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered as deleted.

Q. What is a commit log?

It is a crash-recovery mechanism. All data is written first to the commit log (file) for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.

Q. What is SSTable? Is it similar to RDBMS table?

A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.
Whereas RDBMS Table collection of ordered columns fetched by row.

Q. What is Gossip protocol?

Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster.

Q. What is Bloom Filter?

These are quick, nondeterministic, algorithms for testing whether an element is a member of a set. Bloom filters are accessed after every query.

Q. What is Order Preserving partitioner?

This is a kind of Partitioner that stores rows by key order, aligning the physical structure of the data with your sort order. Configuring your column family to use order-preserving partitioning allows you to perform range slices, meaning that Cassandra knows which nodes have which keys. This partitioner is somewhat the opposite of the Random Partitioner; it has the advantage of allowing for efficient range queries, but the disadvantage of unevenly distributing keys.
The order-preserving partitioner (OPP) is implemented by the org.apache.cassandra .dht.OrderPreservingPartitionerclass. There is a special kind of OPP called the collating order-preserving partitioner (COPP). This acts like a regular OPP, but sorts the data in a collated manner according to English/US lexicography instead of byte ordering. For this reason, it is useful for locale-aware applications. The COPP is implemented by the org.apache.cassandra .dht.CollatingOrderPreservingParti tioner class. This is implemented in Cassandra by org.apache.cassandra.dht.OrderPreservingPartitioner.

Q. What are key spaces and column family in Cassandra?

In Cassandra logical division that associates similar data is called as column family. Basic Cassandra data structures: the column, which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a column family, which is a container for rows that have similar, but not identical, column sets. We have a unique identifier for each row could be called a row key. A keyspace is the outermost container for data in Cassandra, corresponding closely to a relational database.

Q. What is the use of HELP command?

It is used to display a synopsis and a brief description of all cqlsh commands.

Q. What is the use of capture command?

Capture command is used to captures the output of a command and adds it to a file.

Q. What is materialized view? Why is it normal practice in Cassandra to have it?

Materialized” means storing a full copy of the original data so that everything you need to answer a query is right there, without forcing you to look up the original data. This is because you don’t have a SQL WHERE clause, you can recreate this effect by writing your data to a second column family that is created specifically to represent that query.

Q. Why Time stamp is so important while inserting data in Cassandra?

This is important because Cassandra use timestamps to determine the most recent write value.

Q. Why are super columns in Cassandra no longer favoured?

Super columns suffer from a number of problems, not least of which is that it is necessary for Cassandra to deserialize all of the sub-columns of a super column when querying (even if the result will only return a small subset). As a result, there is a practical limit to the number of sub-columns per super column that can be stored before performance suffers.
In theory, this could be fixed within Cassandra by properly indexing sub-columns, but consensus is that composite columns are a better solution, and they work without the added complexity.

Q. What are advantages and disadvantages of secondary indexes in Cassandra?

Querying becomes more flexible when you add secondary indexes to table columns. You can add indexed columns to the WHERE clause of a SELECT.
When to use secondary indexes: You want to query on a column that isn’t the primary key and isn’t part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).
When to avoid secondary indexes: Try not using secondary indexes on columns contain a high count of unique values and that will produce few results. Remember it makes writing to DB much slower, you can find value only by exact index and you need to make requests to all servers in cluster to find value by index.

Q. How do you query Cassandra?

We query Cassandra using cql (Cassandra query language). We use cqlsh for interacting with DB.

Q. What is cqlsh?

It’s a Python-based command-line client for cassandra.

Q. Does Cassandra works on Windows?

Yes Cassandra works pretty well on windows. Right now we have linux and windows compatible versions available.

Q. Why renormalization is preferred in Cassandra?

This is because Cassandra does not support joins. User can join data at its own end.

Q. What is the sue of consistency command?

Consistency command is used to copy data to and from Cassandra to a file.

Q. Does Cassandra Support Transactions?

Yes and No, depending on what you mean by ‘transactions’. Unlike relational databases, Cassandra does not offer fully ACID-compliant transactions. There are no locking or transactional dependencies when concurrently updating multiple rows or column families. But if by ‘transactions’ you mean real-time data entry and retrieval, with durability and tunable consistency, then yes.
Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor Does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.
However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.

Q. What is Compaction in Cassandra?

The compaction process merges keys, combines columns, evicts tombstones, consolidates SSTables, and creates a new index in the merged SSTable.

Q. What is Anti-Entropy?

Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring
that data on different nodes is updated to the newest version.

Q. What do you understand by Consistency in Cassandra?

Consistency means to synchronize and how up-to-date a row of Cassandra data is on all of its replicas.

Q. Explain Zero Consistency?

In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the one that is used to offer the least confidence that operations will succeed.

Q. Explain Any Consistency?

Ii assures that our write operation was successful on at least one node, even if the acknowledgment is only for a hint. It is a relatively weak level of consistency.

Q. Explain ONE consistency?

It is used to ensure that the write operation was written to at least one node, including its commit log and memtable.

Q. Explain QUORUM consistency?

A quorum is a number of nodes that is used to represent the consensus on an operation. It is determined by / 2 + 1.

Q. What is the use of SOURCE command?

SOURCE command is used to execute a file that contains CQL statements.

Q. Explain ALL consistency?

Every node as specified in your configuration entry must successfully acknowledge the write operation. If any nodes do not acknowledge the write operation, the write fails. This has the highest level of consistency and the lowest level of performance.

Q. What are consistency levels for read operations?

  • ONE: This is used to returns the value on the first node that responds and Performs a read repair in the background.
  • QUORUM: It queries all nodes and returns the record with the most recent timestamp after quorums of nodes have responded where a “quorum” is (n /2) + 1.
  • DCQUORUM: It ensures that only nodes in the same data center are queried. It is applicable when using Rack-Aware placement strategy.
  • ALL: Queries all nodes and returns the value with the most recent timestamp. This level waits for all nodes to respond, and if one doesn’t, it fails the read operation.

Q. What do you mean by hint handoff?

It is mechanism to ensure availability, fault tolerance, and graceful degradation. If a write operation occurs and a node that is intended to receive that write goes down, a note (the “hint”) is given (“handed off”) to a different live node to indicate that it should replay the write operation to the unavailable node when it comes back online. This does two things: it reduces the amount of time that it takes for a node to get all the data it missed once it comes back online, and it improves write performance in lower consistency levels.

Q. What is Merkle Tree? Where is it used in Cassandra?

Merkle tree is a binary tree data structure that summarizes in short form the data in a  larger dataset. Merkle trees are used in Cassandra to ensure that the peer-to-peer network of nodes receives data blocks unaltered and unharmed.

Q. What do you mean by multiget?

It means a query by column name for a set of keys.

Q. What is Multiget Slice?

It means query to get a subset of columns for a set of keys.

Q. What is a SEED node in Cassandra?

A seed is a node that already exists in a Cassandra cluster and is used by newly added nodes to get up and running. The newly added node can start gossiping with the seed node to get state information and learn the topology of the node ring. There may be
many seeds in a cluster.

Q. What is Slice and Range slice in Cassandra? 

This is a type of read query. Use get_slice() to query by a single column name or a range of column names. Use get_range_slice() to return a subset of columns for a range of keys.

Q. What is Tombstone in Cassandra world?

Cassandra does not immediately delete data following a delete operation. Instead, it marks the data with a “tombstone,” an indicator that the column has been deleted but not removed entirely yet. The tombstone can then be propagated to other replicas.

Q. What is Thrift?

Thrift is the name of the RPC client used to communicate with the Cassandra server.

Q. What is Batch Mutates? 

Like a batch update in the relational world, the batch_mutate operation allows grouping calls on many keys into a single call in order to save on the cost of network round trips. If batch_mutate fails in the middle of its list of mutations, there will be no rollback, so any updates that have already occurred up to this point will remain intact.

Q. What is Hector?

Hector is an open source project written in Java using the MIT license. It was one of the early Cassandra clients and is used in production at Outbrain. It wraps Thrift and offers JMX, connection pooling, and failover.

Q. What is Kundera?

Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.

Q. What is Random Partitioner? 

This is a kind of Partitioner that uses a BigIntegerToken with an MD5 hash to determine where to place the keys on the node ring. This has the advantage of spreading your keys evenly across your cluster, but the disadvantage of causing inefficient range queries. This is the default partitioner.

Q. What is Read Repair? 

This is another mechanism to ensure consistency throughout the node ring. In a read operation, if Cassandra detects that some nodes have responded with data that is inconsistent with the response of other, newer nodes, it makes a note to perform a read repair on the old nodes. The read repair means that Cassandra will send a write request to the nodes with stale data to get them up to date with the newer data returned from the original read operation. It does this by pulling all the data from the node, performing a merge, and writing the merged data back to the nodes that were out of sync. The detection of inconsistent data is made by comparing timestamps and checksums.

Q. What is Snitch in Cassandra? 


A snitch is Cassandra’s way of mapping a node to a physical location in the network.
It helps determine the location of a node relative to another node in order to assist with discovery and ensure efficient request routing.
=================================================================

Cassandra is one of the most favoredNoSQL distributed database management systems by Apache. With open source technology, Cassandra is efficiently designed to store and manage large volumes of data without any failure. Highly scalable for Big Data models and originally designed by Facebook, Apache Cassandra is written in Java comprising flexible schemas.
Unlike traditional or any other database, Apache Cassandradelivers near real-time performance simplifying the work of Developers, Administrators, Data Analysts and Software Engineers.
• Instead of master-slave architecture, Cassandra is established on peer-to-peer architecture ensuring no failure.
• It also assures phenomenal flexibility as it allows insertion of multiple nodes to any Cassandra cluster in any datacenter. Further, any client can forward its request to any server.
• Cassandra facilitates extensible scalability and can be easily scaled up and scaled down as per the requirements. With a high throughput for read and write operations, this NoSQL application need not be restarted while scaling.
• Cassandra is also revered for its strong data replication capability as it allows data storage at multiple locations enabling users to retrieve data from another location if one node fails. Users have the option to set up the number of replicas they want to create.
• Shows brilliant performance when used for massive datasets and thus, the most preferable NoSQL DB by most organizations.
• Operates on column-oriented structure and thus, quickens and simplifies the process of slicing. Even data access and retrieval becomes more efficient with column-based data model.
• Further, Apache Cassandra supports schema-free/schema-optional data model, which un-necessitate the purpose of showing all the columns required by your application.
Tunable Consistency is a phenomenal characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies -Eventual and Consistency and Strong Consistency.
The former guarantees consistency when no new updates are made on a given data item, all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.
For Strong consistency, Cassandra supports the following condition:
R + W > N, where
N – Number of replicas
W – Number of nodes that need to agree for a successful write
R – Number of nodes that need to agree for a successful read
Cassandra performs the write function by applying two commits-first it writes to a commit log on disk and then commits to an in-memory structured known as memtable. Once the two commits are successful, the write is achieved. Writes are written in the table structure as SSTable (sorted string table). Cassandra offers speedier write performance.
DataStaxOpsCenter: internet-based management and monitoring solution for Cassandra cluster and DataStax. It is free to download and includes an additional Edition of OpsCenter
• SPM primarily administers Cassandra metrics and various OS and JVM metrics. Besides Cassandra, SPM also monitors Hadoop, Spark, Solr, Storm, zookeeper and other Big Data platforms. The main features of SPM include correlation of events and metrics, distributed transaction tracing, creating real-time graphs with zooming, anomaly detection and heartbeat alerting.
Similar to table, memtable is in-memory/write-back cache space consisting of content in key and column format. The data in memtable is sorted by key, and each ColumnFamily consist of a distinct memtable that retrieves column data via key. It stores the writes until it is full, and then flushed out.
SSTable expands to ‘Sorted String Table,’ which refers to an important data file in Cassandra and accepts regular written memtables. They are stored on disk and exist for each Cassandra table. Exhibiting immutability, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.
Associated with SSTable, Bloom filter is an off-heap (off the Java heap to native memory) data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.
With a strong requirement to scale systems when additional resources are needed, CAP Theorem plays a major role in maintaining the scaling strategy. It is an efficient way to handle scaling in distributed systems. Consistency Availability and Partition tolerance (CAP) theorem states that in distributed systems like Cassandra, users can enjoy only two out of these three characteristics. One of them needs to be sacrificed. Consistency guarantees the return of most recent write for the client, Availability returns a rational response within minimum time and in Partition Tolerance, the system will continue its operations when network partitions occur. The two options available are AP and CP.
While a node is a single machine running Cassandra, cluster is a collection of nodes that have similar type of data grouped together. DataCentersare useful components when serving customers in different geographical areas. You can group different nodes of a cluster into different data centers.
Using CQL (Cassandra Query Language).Cqlsh is used for interacting with database.
Cassandra Data Model consists of four main components:
Cluster : Made up of multiple nodes and keyspaces
Keyspace : A namespace to group multiple column families, especially one per partition
Column : It consists of a column name, value and timestamp
ColumnFamily : multiple columns with row key reference.
CQL is Cassandra Query language to access and query the Apache distributed database. It consists of a CQL parser that incites all the implementation details to the server. The syntax of CQL is similar to SQL but it does not alter the Cassandra data model.
Compaction refers to a maintenance process in Cassandra , in which, the SSTables are reorganized for data optimization of data structure son the disk. The compaction process is useful during interactive with memtable. There are two type sof compaction in Cassandra:
Minor compaction : started automatically when a new sstable is created. Here, Cassandra condenses all the equally sized sstables into one.
Major compaction : It is triggered manually using nodetool. Compacts all sstables of a ColumnFamily into one.
Unlike relational databases, Cassandra does not support ACID transactions.
Cqlsh expands to Cassandra Query language Shell that configures the CQL interactive terminal. It is a Python-base command-line prompt used on Linux or Windows and exequte CQL commands like ASSUME, CAPTURE, CONSITENCY, COPY, DESCRIBE and many others. With cqlsh, users can define a schema, insert data and execute a query.
Cassandra Super Column is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action: keystore> column family> super column> column data structure in JSON.
Similar to row keys, super column data entries contains no independent values but are used to collect other columns. It is interesting to note that super column keys appearing in different rows do not necessarily match and will not ever.
• ALL: Highly consistent. A write must be written to commitlog and memtable on all replica nodes in the cluster
• EACH_QUORUM: A write must be written to commitlog and memtable on quorum of replica nodes in all data centers.
• LOCAL_QUORUM: A write must be written to commitlog and memtable on quorum of replica nodes in the same center.
• ONE: A write must be written to commitlog and memtableof at least one replica node.
• TWO, Three: Same as One but at least two and three replica nodes, respectively
• LOCAL_ONE: A write must be written for at least one replica node in the local data center
• SERIAL: Linearizable Consistency to prevent unconditional updates
• LOCAL_SERIAL: Same as Serial but restricted to local data center
Both elements work on the principle of tuple having name and value. However, the former‘s value is a string while the value in latter is a Map of Columns with different data types.
Unlike Columns, Super Columns do not contain the third component of timestamp.
As the name suggests, ColumnFamily refers to a structure having infinite number of rows. That are referred by a key-value pair, where key is the name of the column and value represents the column data. It is much similar to a hashmap in java or dictionary in Python. Rememeber, the rows are not limited to a predefined list of Columns here. Also, the ColumnFamily is absolutely flexible with one row having 100 Columns while the other only 2 columns.
Source command is used to execute a file consisting of CQL statements.
Thrift is a legacy RPC protocol or API unified with a code generation tool for CQL. The purpose of using Thrift in Cassandra is to facilitate access to the DB across the programming language.
Tombstone is row marker indicating a column deletion. These marked columns are deleted during compaction. Tombstones are of great significance as Cassnadra supports eventual consistency, where the data must respond before any successful operation.
Since Cassandra is a Java application, it can successfully run on any Java-driven platform or Java Runtime Environment (JRE) or Java Virtual Machine (JVM). Cassandra also runs on RedHat, CentOS, Debian and Ubuntu Linux platforms.
The default settings state that Cassandra uses 7000 ports for Cluster Management, 9160 for Thrift Clients, 8080 for JMX. These are all TCP ports and can be edited in the configuration file: bin/Cassandra.in.sh
• Yes, but keeping in mind the following processes.
• Do not forget to clear the commitlog with ‘nodetool drain’
• Turn off Cassandra to check that there is no data left in commitlog
• Delete the sstable files for the removed CFs
ReplicationFactor is the measure of number of data copies existing. It is important to increase the replication factor to log into the cluster.
Yes, but it will require running repair to alter the replica count of existing data.
Using get_range_slices. You can start iteration with the empty string and after each iteration, the last key read serves as the start key for next iteration.
Cassandra was designed to handle big data workloads across multiple nodes without any single point of failure. The various factors responsible for using Cassandra are
• It is fault tolerant and consistent
• Gigabytes to petabytes scalabilities
• It is a column-oriented database
• No single point of failure
• No need for separate caching layer
• Flexible schema design
• It has flexible data storage, easy data distribution, and fast writes
• It supports ACID (Atomicity, Consistency, Isolation, and Durability)properties
• Multi-data center and cloud capable
• Data compression
In Cassandra, composite type allows to define key or a column name with a concatenation of data of different type. You can use two types of Composite Type
• Row Key
• Column Name
All data stored as bytes.When you specify validator, Cassandra ensures those bytes are encoded as per requirement.Then a comparator orders the column based on the ordering specific to the encoding.While composite are just byte arrays with a specific encoding, for each component it stores a two byte length followed by the byte encoded component followed by a termination bit.
A cluster is a container for keyspaces. Cassandra database is segmented over several machines that operate together. The cluster is the outermost container which arranges the nodes in a ring format and assigns data to them. These nodes have a replica which takes charge in case of data handling failure.
The other components of Cassandra are
• Node
• Data Center
• Cluster
• Commit log
• Mem-table
• SSTable
• Bloom Filter
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one keyspace per node.
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE WITH
In Cassandra Column, basically there are three values
• Column Name
• Value
• Time Stamp
ALTER KEYSPACE can be used to change properties such as the number of replicas and the durable_write of a keyspace.
There are various Cqlsh shell commands in Cassandra. Command “Capture”, captures the output of a command and adds it to a file while, command “Consistency” display the current consistency level or set a new consistency level.
While creating a table primary key is mandatory, it is made up of one or more columns of a table.
While adding a column you need to take care that the Column name is not conflicting with the existing column names .Table is not defined with compact storage option
Cassandra CQL collections help you to store multiple values in a single variable. In Cassandra, you can use CQL collections in following ways
LIST: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds the list of unique elements)
SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)
MAP: It is a data type used to store a key-value pair of elements
Cassandra concatenate changed data to commitlog.Commitlog acts as a crash recovery log for data.Until the changed data is concatenated to commitlog write operation will be never considered successful.Data will not be lost once commitlog is flushed out to file.
SSTables are immutable and cannot remove a row from SSTables. When a row needs to be deleted, Cassandra assigns the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered as deleted.
The snitch is a configurable component of a Cassandra cluster used to define how the nodes are grouped together within the overall network topology (such as rack and data center groupings). Cassandra uses this information to route inter-node requests as efficiently as possible within the confines of the replica placement strategy. The snitch does not affect requests between the client application and Cassandra (it does not control which node a client connects to).
Simple Snitch- It has the strategy of placing the copy of the row on the next available node walking clockwise through the nodes.
Rack Inferring Snitch– It tries to place copies of rows of different racks in the data center. It will know about the rack and data center and will try to place copies in different racks and data centers. From the IP address, it can determine the data center address and the rack. So the IP address will have to be configured in such a way that the second unit of IP address will be used to identify the data center. The third unit identifies the rack.
There are two aspects to Capacity Planning- User data size and Usable Disk capacity. How can one estimate the user data size? If you have one terabyte of information to be stored, there can be some over heads above the one terabyte of data. There might also be some indexes that might take more time and space.
There is something called as Usable disk capacity. Sometimes, extra space is needed to carry out internal processes. Hence, all the disk space is not available to the user so the usable disk capacity has to be calculated first. Therefore, there are certain factors that you have to consider like the various operations happening on the disk, the various internal processes happening on Cassandra cluster etc.
Cassandra is the right choice when you need scalability and high availability without compromising on performance. Cassandra file system is an HDFS file system that is replaceable with your standard HDFS file. You can change the Hadoop configuration and explore and expose Cassandra’s file system as HDFS. With this file system, it is easy to get rid of the name nodes and data node daemons because Cassandra can take care of that.
• The Cassandra file system is decentralized.
• It doesn’t have a single point of failure and has a replication facility.
• It is very similar to HDFS and this is an HDFS compatible system.
• Another important factor about Cassandra file system is that it can be used for indexing.
• Being indexed in an HDFS file system is very difficult since everything gets distributed on blocks but in the Cassandra file system, certainly, you can have the information index and hence, this provides a very unique advantage.
• One can have the index in the Cassandra file system and then the power of Hadoop could be used to traverse the data and do some smart scanning, instead of scanning all the data and finding out respective information.
OLTP is said to be more of an online transactional system or data storage system, where the user does lots of online transactions using the data store. It is also said to have more ad-hoc reads/writes happening on real time basis.
OLAP is more of an offline data store. It is accessed number of times in offline fashion. For example, Bulk log files are read and then written back to data files. Some of the common areas where OLAP is used are Log Jobs, Data mining Jobs, etc.
Cassandra is said to be more of OLTP, as it is real-time, whereas Hadoop is more of OLAP, since it is used for analytics and bulk writes.
In Cassandra, the communication between nodes is often like peer-to-peer communication, where every node talks to the other. If that’s the case, then all the nodes talk to one another, and there is a lot of communication happening. The Gossip Protocol is a method to resolve this communication chaos. In Cassandra, when one node talks to another, the node which is expected to respond, not only provides information about its status, but also provides information about the nodes that it had communicated with before. Through this process, there is a reduction in network log, more information is kept and efficiency of information gathering increases. The main feature of the protocol is to provide the latest information of any node respectively.
Failure Detection :
An important feature of Gossip Protocol is Failure Detection. Basically, when two nodes communicate with one another; for instance, Node A to Node B, then Node A sends a message ‘gossipdigestsynmessage’, which is very similar to TCP protocol to Node B. Here, Node B, once receives the message, sends an acknowledgement message ‘ack’, and then Node A responds with an acknowledgement message to Node B’s ‘ack’ message. This is known as the 3 way handshake.
If in case, the node goes down and does not send the ack message, then it will be a mark down. Even when the nodes are down, the other nodes will be periodically pinging and that is how the failure detection happens.
Starting Cassandra involves connecting to the machine where it is installed with the proper security credentials, and invoking the cassandra executable from the installation’s binary directory. An example of starting Cassandra on Mac could be:
sudo /Applications/Cassandra/apache-cassandra-1.1.1/bin/cassandra
•The basic command line interface (CLI) for logging into and executing commands against Cassandra is the cassandra-cli utility, which is found in the software installation’s bin directory.
•An example of logging into a local machine’s Cassandra installation using the CLI and the default Cassandra port might be:
Welcome to the Cassandra CLI.
Type ‘help;’ or ‘?’ for help.
Type ‘quit;’ or ‘exit;’ to quit.
Cassandra can be used in many different data management situations. Some of the most common use cases for Cassandra include:Serving as the operational/real-time/system-of-record datastore for Web or other online applications needing around-the-clock transactional input capabilities .Applications needing “network independence”, meaning systems that cannot worry about where data lives.
Cassandra is typically not the choice for transactional data that needs per-transaction commit/rollback capabilities. Note that Cassandra does have atomic transactional abilities on a per row/insert basis (but with no rollback capabilities).
The primary difference between Cassandra and Hadoop is that Cassandra targets real-time/operational data, while Hadoop has been designed for batch-based analytic work.
There are many different technical differences between Cassandra and Hadoop, including Cassandra’s underlying data structure (based on Google’s Bigtable), its fault-tolerant, peer-to-peer architecture, multi-data center capabilities, tunable data consistency, all nodes being the same (no concept of a namenode, etc.) and much more.
HBase is an open-source, column-oriented data store modeled after Google Bigtable, and is designed to offer Bigtable-like capabilities on top of data stored in Hadoop. However, while HBase shared the Bigtable design with Cassandra, its foundational architecture is much different.
A Cassandra cluster is much easier to setup and configure than a comparable HBase cluster. HBase’s reliance on the Hadoop namenode equates to there being a single point of failure in HBase, whereas with Cassandra, because all nodes are the same, there is no such issue
In internal performance tests conducted at DataStax (using the Yahoo Cloud Serving Benchmark – YCSB), Cassandra offered literally 5X better performance in writes and 4X better performance on reads than HBase.
MongoDB is a document-oriented database that is built upon a master-slave/sharding architecture. MongoDB is designed to store/manage collections of JSON-styled documents.
By contrast, Cassandra uses a peer-to-peer, write/read-anywhere styled architecture that is based on a combination of Google BigTable and Amazon Dynamo. This allows Cassandra to avoid the various complications and pitfalls of master/slave and sharding architectures. Moreover, Cassandra offers linear performance increases as new nodes are added to a cluster, scales to terabyte-petabyte data volumes, and has no single point of failure.
Cassandra does not use a master/slave architecture, but instead uses a peer-to-peer implementation, which avoids the pitfalls, latency problems, single point of failure issues, and performance headaches associated with master/slave setups.
Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance. When you create a keyspace in Cassandra, you must decide the replica placement strategy: the number of replicas and how those replicas are distributed across nodes in the cluster. The replication strategy relies on the cluster-configured snitch to help it determine the physical location of nodes and their proximity to each other.
The total number of replicas across the cluster is often referred to as the relication factor. A replication factor of 1 means that there is only one copy of each row. A replication factor of 2 means two copies of each row. All replicas are equally important; there is no primary or master replica in terms of how read and write requests are handled.Replication options are defined when you create a keyspace in Cassandra. The snitch is configured per node.
Cassandra provides a number of options to partition your data across nodes in a cluster.
The RandomPartitioner is the default partitioning strategy for a Cassandra cluster. It uses a consistent hashing algorithm to determine which node will store a particular row. The end result is an even distribution of data across a cluster.
The ByteOrderedPartitioner ensures that row keys are stored in sorted order. It is not recommended for most use cases and can result in uneven distribution of data across a cluster.
A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster. When a node first starts, it contacts a seed node to bootstrap the gossip communication process. The seed node designation has no purpose other than bootstrapping new nodes joining the cluster. Seed nodes are not a single point of failure.
Cassandra is capable of offering linear performance benefits when new nodes are added to a cluster.
A new machine can be added to an existing cluster by installing the Cassandra software on the server and configuring the new node so that it knows (1) the name of the Cassandra cluster it is joining; (2) the seed node(s) it should obtain its data from; (3) the range of data that it is responsible for, which is done by assigning a token to the node.
Nodes can be removed from a Cassandra cluster by using the nodetool utility and issuing a decommission command. This can be done without affecting the overall operations or uptime of the cluster.
Cassandra can easily replicate data between different physical datacenters by creating a keyspace that uses the replication strategy currently termed NetworkTopologyStrategy. This strategy allows you to configure Cassandra to automatically replicate data to different data centers and even different racks within datacenters to protect against specific rack/physical hardware failures causing a cluster to go down. It can also replicate data between public Clouds and on-premises machines.
The main Cassandra configuration file is the cassandra.yaml file, which houses all the main options that control how Cassandra operates.
•Cassandra’s architecture make it perfect for full Cloud deployments as well as hybrid implementations that store some data in the Cloud and other data on premises.
•DataStax provides an Amazon AMI that allows you to quickly deploy a Cassandra cluster on EC2. See the online documentation for a step-by-step guide to installing a Cassandra cluster on Amazon.
Cassandra negates the need for extra software caching layers like memcached through its distributed architecture, fast write throughput capabilities, and internal memory caching structures.
•Cassandra is architected in a peer-to-peer fashion and uses a protocol called gossip to communicate with other nodes in a cluster. The gossip process runs every second to exchange information across the cluster.
•Gossip only includes information about the cluster itself (up/down, joining, leaving, version, schema, etc.) and does not manage the data. Data is transferred node-to-node using a message passing like protocol on a distinct port from what client applications connect to.
•The Cassandra partitioner turns a column family key into a token, the replication strategy picks the set of nodes responsible for that token (using information from the snitch) and Cassandra sends messages to those replicas with the request (read or write).
•The gossip protocol is used to determine the state of all nodes in a cluster and if a particular node has gone down.
•The gossip process tracks heartbeats from other nodes and uses an accrual detection mechanism to calculate a per-node threshold that takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate before a node is actually marked as down.
•The configuration parameter phi_convict_threshold in the cassandra.yaml file is used to control Cassandra’s sensitivity of node failure detection. The default value is appropriate for most situations. However in Cloud environments, such as Amazon EC2, the value should be increased to 12 in order to account for network issues that sometimes occur on such platforms.
Yes, data compression is available with Cassandra 1.0 and above. The snappy compression algorithm from Google is used and is able to deliver fairly impressive storage savings, in some cases compressing raw data up to 80+% with no performance penalties for read/write operations. In fact, because of the reduction in physical I/O, compression actually increases performance in some use cases. Compression is enabled/disabled on a per-column family basis and is not enabled by default.
•Currently, the most common method for backing up data in Cassandra is using the snapshot function in the nodetool utility. This is an online operation and does not require any downtime or block any operations on the server.
•Snapshots are sent by default to a snapshots directory that is located in the Cassandra data directory (controlled via the data_file_directories in the cassandra.yaml file). Once taken, snapshots can be moved off-site to be protected.
•Incremental backups (i.e. data backed up since the last full snapshot) can be performed by setting the incremental_backups parameter in the cassandra.yaml file to ‘true’. When incremental backup is enabled, Cassandra copies every flushed SSTable for each keyspace to a backup directory located under the Cassandra data directory. Restoring from an incremental backup involves first restoring from the last full snapshot and then copying each incremental file back into the Cassandra data directory. Eg.
Create a Cassandra snapshot for a single nodenodetool -h 10.10.10.1 snapshot KEYSPACE_NAME
Create a cluster wide Cassandra snapshot
clustertool -h 10.10.10.1 global_snapshot KEYSPACE_NAME
In general, restoring a Cassandra node is done by first following these procedures
step 1: Shut down the node that is to be restored
step 2: Clear the commit log by removing all the files in the commit log directory
e.g.
rm /var/lib/cassandra/commitlog/*
step 3:
Remove the database files for all keyspaces
e.g.
rm /var/lib/cassandra/data/keyspace1/*.db
Take care so as not to remove the snapshot directory for the keyspace
step 4:
Copy the latest snapshot directory contents for each keyspace to the keyspace’s data directory
e.g.
cp -p /var/lib/cassandra/data/keyspace1/snapshots/56046198758643-snapshotkeyspace1/* /var/lib/cassandra/data/keyspace1
step 5:
Copy any incremental backups taken for each keyspace into the keyspace’s data directory
step 6:
Repeat steps 3-5 for each keyspace
step 7:
Restart the node

div class=”panel-group” >
•Yes. First, data durability is fully supported in Cassandra so that any data written to a database cluster is first written to a commit log in the same fashion as nearly every popular RDBMS does.
•Second, Cassandra offers tunable data consistency so that a developer or administrator can choose how strong they wish consistency across nodes to be. The strongest form of consistency is to mandate that any data modifications be made to all nodes, with any unsuccessful attempt on a node resulting in a failed data operation. Cassandra provides consistency in the CAP sense in that all readers will see the same values.
•Other forms of tunable consistency involve having a quorum of nodes written to or just one node for the loosest form of consistency. Cassandra is very flexible and allows data consistency to be chosen on a per operation basis if needed so that very strong consistency can be used when desired, or very loose consistency can be utilized when the use case permits.
•In Cassandra, consistency refers to how up-to-date and synchronized a row of data is on all of its replicas. Cassandra offers a number of built-in features to ensure data consistency:
•Hinted Handoff Writes &ndash:Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online again. Hinted handoff ensures data consistency due to short, transient node outages.
•Read Repair &ndash: Read operations trigger consistency across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas.
•Anti-Entropy Node Repair &ndash: For data that is not read frequently, or to update data on a node that has been down for an extended period, the node repair process (also referred to as anti-entropy repair) ensures that all data on a replica is made consistent. Node repair (using the nodetool utility) should be run routinely as part of regular cluster maintenance operations.
•Most RDBMS’s have an unload utility that allows data to be unloaded to flat files. Once in flat file format, the sstableloader utility can be used to load the data into Cassandra column families.
•Some developers write programs to connect to both an RDBMS and Cassandra and move data in that way.
Read operations trigger consistency checks across all replicas for a requested row using a process called read repair. For reads, there are two types of read requests that a coordinator node can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the read consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas. Read repair is an optional feature and can be configured per column family.
There are a number of CQL (Cassandra Query Language) drivers and native client libraries available for most all popular development languages (e.g. Java, Ruby, etc.)
•In a relational database, you must specify a data type for each column when you define a table. The data type constrains the values that can be inserted into that column. For example, if you have a column defined as an integer datatype, you would not be allowed to insert character data into that column.
•In Cassandra, you can specify a data type for both the column name (called a comparator) as well as for row key and column values (called a validator).Column and row key data in Cassandra is always stored internally as hex byte arrays, but the compartor/validators are used to verify data on insert and translate data on retrieval.
•In the case of comparators (column names), the comparator also determines the sort order in which columns are stored.
Cassandra comes with the following comparators and validators:
BytesType >
Bytes (no validation)
AsciiType >
US-ASCII bytes
UTF8Type >
UTF-8 encoded strings
LongType >
64-bit longs
LexicalUUIDType >
128-bit UUID by byte value
TimeUUIDType >
Version 1 128-bit UUID by timestamp
CounterColumnType* >
64-bit signed integer
* can only be used as a column validator, not valid as a row key validator or column name comparator
•Yes and No, depending on what you mean by “transactions”. Unlike relational databases, Cassandra does not offer fully ACID-compliant transactions. There is no locking or transactional dependencies when concurrently updating multiple rows or column families.
•But if by “transactions” you mean real-time data entry and retrieval, with durability and tunable consistency, then yes.
•Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.
•However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.
•Cassandra is a gossip-based distributed system. ListenAddress is also ‘contact me here address,’ i.e., the address it tells other nodes to reach it at. Telling other nodes ‘contact me on any of my addresses’ is a bad idea; if different nodes in the cluster pick different addresses for you, Bad Things happen.
•If you don’t want to manually specify an IP to ListenAddress for each node in your cluster (understandable!), leave it blank and Cassandra will use InetAddress.getLocalHost() to pick an address. Then it’s up to you or your ops team to make things resolve correctly (/etc/hosts/, dns, etc).
•One exception to this process is JMX, which by default binds to 0.0.0.0 (Java bug 6425769).
•This is a symptom of memory pressure, resulting in a storm of GC operations as the JVM frantically tries to free enough heap to continue to operate. Eventually, the server will crash from OutOfMemory; usually, but not always, it will be able to log this final error before the JVM terminates.
•You can increase the amount of memory the JVM uses, or decrease the insert threshold before Cassandra flushes its memtables. See MemtableThresholds for details.
•Setting your cache sizes too large can result in memory pressure.
• Yes, but it’s important that you do it correctly.
• Empty the commitlog with ‘nodetool drain.’
• Shutdown Cassandra and verify that there is no remaining data in the commitlog.
• Delete the sstable files (-Data.db, -Index.db, and -Filter.db) for any CFs removed, and rename the files for any CFs that were renamed.
• Make necessary changes to your storage-conf.xml.
• Start Cassandra back up and your edits should take effect.
• No, any node in the cluster will work; Cassandra nodes proxy your request as needed. This leaves the client with a number of options for end point selection:
• You can maintain a list of contact nodes (all or a subset of the nodes in the cluster), and configure your clients to choose among them.
• Use round-robin DNS and create a record that points to a set of contact nodes (recommended).
• Use the get_string_property(‘token map’) RPC to obtain an update-to-date list of the nodes in the cluster and cycle through them. Deploy a load-balancer, proxy, etc.
• When using a higher-level client you should investigate which, if any, options are implemented by your higher-level client to help you distribute your requests across nodes in a cluster.
•Unlike all major relational databases and some NoSQL systems, Cassandra does not use b-trees and in-place updates on disk. Instead, it uses a sstable/memtable model like Bigtable’s: writes to each ColumnFamily are grouped together in an in-memory structure before being flushed (sorted and written to disk). This means that writes cost no random I/O, compared to a b-tree system which not only has to seek to the data location to overwrite, but also may have to seek to read different levels of the index if it outgrows disk cache!
•The downside is that on a read, Cassandra has to (potentially) merge row fragments from multiple sstables on disk. We think this is a tradeoff worth making, first because scaling writes has always been harder than scaling reads, and second because as your data corpus grows Cassandra’s read disadvantage narrows vs b-tree systems that have to do multiple seeks against a large index. See MemtableSSTable for more details.
•This happens when you have the same token assigned to each node. Don’t do that.
•Most often this bites people who deploy by installing Cassandra on a VM (especially when using the Debian package, which auto-starts Cassandra after installation, thus generating and saving a token), then cloning that VM to other nodes.
•The easiest fix is to wipe the data and commitlog directories, thus making sure that each node will generate a random token on the next restart.
•Because get_range_slice says, ‘apply this predicate to the range of rows given,’ meaning, if the predicate result is empty, we have to include an empty result for that row key. It is perfectly valid to perform such a query returning empty column lists for some or all keys, even if no deletions have been performed.
•So to special case leaving out result entries for deletions, we would have to check the entire rest of the row to make sure there is no undeleted data anywhere else either (in which case leaving the key out would be an error).
•This is what we used to do with the old get_key_range method, but the performance hit turned out to be unacceptable.
• Yes, but it will require running repair to change the replica count of existing data.
• Alter the ReplicationFactor for the desired keyspace(s) using cassandra-cli.If you’re reducing the ReplicationFactor:
• Run ‘nodetool cleanup’ on the cluster to remove surplus replicated data. Cleanup runs on a per-node basis.If you’re increasing the ReplicationFactor:
• Run ‘nodetool repair’ to run an anti-entropy repair on the cluster. Repair runs on a per-replica set basis. This is an intensive process that may result in adverse cluster performance. It’s highly recommended to do rolling repairs, as an attempt to repair the entire cluster at once will most likely swamp it.
•Currently Cassandra isn’t optimized specifically for large file or BLOB storage. However, files of around 64Mb and smaller can be easily stored in the database without splitting them into smaller chunks. This is primarily due to the fact that Cassandra’s public API is based on Thrift, which offers no streaming abilities; any value written or fetched has to fit in to memory. Other non Thrift interfaces may solve this problem in the future, but there are currently no plans to change Thrift’s behavior. When planning applications that require storing BLOBS, you should also consider these attributes of Cassandra as well:
•The main limitation on a column and super column size is that all the data for a single key and column must fit (on disk) on a single machine(node) in the cluster. Because keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound. This is an inherent limitation of the distribution model.
•When large columns are created and retrieved, that columns data is loaded into RAM which can get resource intensive quickly. Consider, loading 200 rows with columns that store 10Mb image files each into RAM. That small result set would consume about 2Gb of RAM. Clearly as more and more large columns are loaded, RAM would start to get consumed quickly. This can be worked around, but will take some upfront planning and testing to get a workable solution for most applications. You can find more information regarding this behavior here: memtables, and a possible solution in 0.7 here: CASSANDRA-16.
•Please refer to the notes in the Cassandra limitations section for more information: Cassandra Limitations.
•Nodetool relies on JMX, which in turn relies on RMI, which in turn sets up it’s own listeners and connectors as needed on each end of the exchange. Normally all of this happens behind the scenes transparently, but incorrect name resolution for either the host connecting, or the one being connected to, can result in crossed wires and confusing exceptions.
•If you are not using DNS, then make sure that your /etc/hosts files are accurate on both ends. If that fails try passing the -Djava.rmi.server.hostname=$IP option to the JVM at startup (where $IP is the address of the interface you can reach from the remote machine).
•As a special case, mutations against a single key are atomic but not isolated. Reads which occur during such a mutation may see part of the write before they see the whole thing. More generally, batch_mutate operations are not atomic. batch_mutate allows grouping operations on many keys into a single call in order to save on the cost of network round-trips. If batch_mutate fails in the middle of its list of mutations, no rollback occurs and the mutations that have already been applied stay applied. The client should typically retry the batch_mutate operation.
There is work being done to support more multi-tenant capabilities such as scheduling and auth. For more information, see MultiTenant.
Check if selinux is on, if it is turn it OFF.
SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted.
•Cassandra uses mmap to do zero-copy reads. That is, we use the operating system’s virtual memory system to map the sstable data files into the Cassandra process’ address space. This will ‘use’ virtual memory; i.e. address space, and will be reported by tools like top accordingly, but on 64 bit systems virtual address space is effectively unlimited so you should not worry about that.
•What matters from the perspective of ‘memory use’ in the sense as it is normally meant, is the amount of data allocated on brk() or mmap’d /dev/zero, which represent real memory used. The key issue is that for a mmap’d file, there is never a need to retain the data resident in physical memory. Thus, whatever you do keep resident in physical memory is essentially just there as a cache, in the same way as normal I/O will cause the kernel page cache to retain data that you read/write.
•The difference between normal I/O and mmap() is that in the mmap() case the memory is actually mapped to the process, thus affecting the virtual size as reported by top. The main argument for using mmap() instead of standard I/O is the fact that reading entails just touching memory – in the case of the memory being resident, you just read it – you don’t even take a page fault (so no overhead in entering the kernel and doing a semi-context switch).
Updating a keyspace first takes a snapshot. This involves creating hardlinks to the existing SSTables, but Java has no native way to create hard links, so it must fork ‘ln’. When forking, there must be as much memory free as the parent process, even though the child isn’t going to use it all. Because Java is a large process, this is problematic. The solution is to install Java Native Access so it can create the hard links itself.
• The set of nodes (a single node, or several) responsible for any given piece of data is determined by:
• The row key (data is partitioned on row key)
• replication factor (decides how many nodes are in the replica set for a given row)
• The replication strategy (decides which nodes are part of said replica set)
• In the case of the SimpleStrategy, replicas are placed on succeeding nodes in the ring. The first node is determined by the partitioner and the row key, and the remainder are placed on succeeding node. In the case of NetworkTopologyStrategy placement is affected by data-center and rack awareness, and the placement will depend on how nodes in different racks or data centers are placed in the ring.
• It is important to understand that Cassandra does not alter the replica set for a given row key based on changing characteristics like current load, which nodes are up or down, or which node your client happens to talk to.
You probably have one or more Column Families with very low throughput. These will typically not be flushed by crossing the throughput or operations thresholds, causing old commit segments to be retained until the memtable_flush_after_min threshold has been crossed. The default value for this threshold is 60 minutes and may be decreased via cassandra-cli by doing:
update column family XXX with memtable_flush_after=YY;
where YY is a number of minutes.
•Seeds are used during startup to discover the cluster
•If you configure your nodes to refer some node as seed, nodes in your ring tend to send Gossip message to seeds more often ( Refer to ArchitectureGossip for details ) than to non-seeds. In other words, seeds are worked as hubs of Gossip network. With seeds, each node can detect status changes of other nodes quickly.
•Seeds are also referred by new nodes on bootstrap to learn other nodes in ring. When you add a new node to ring, you need to specify at least one live seed to contact. Once a node join the ring, it learns about the other nodes, so it doesn’t need seed on subsequent boot.
•Newer versions of cassandra persist the cluster topology making seeds less important then they were in the 0.6.X series, where they were used every startup
•You can make a seed a node at any time. There is nothing special about seed nodes. If you list the node in seed list it is a seed
•Recommended usage of seeds:
•pick two (or more) nodes per data center as seed nodes.
•sync the seed list to all your nodes.
If you are using replicated CF on the ring, only one seed in the ring doesn’t mean single point of failure. The ring can operate or boot without the seed. However, it will need more time to spread status changes of node over the ring. It is recommended to have multiple seeds in production system.
a) Replication takes the same data and copies it over multiple nodes. Sharding puts different data on different nodes
b) Sharding is particularly valuable for performance because it can improve both read and write performance. Using replication, particularly with caching, can greatly improve read performance but does little for applications that have a lot of writes. Sharding provides a way to horizontally scale writes.
The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure.
NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.
• Document Stores (MongoDB, Couchbase)
• Key-Value Stores (Redis, Volgemort)
• Column Stores (Cassandra)
• Graph Stores (Neo4j, Giraph)
• Apache Hadoop, File Storage, Grid Compute processing via Map Reduce.
• Apache Hive, SQL like interface ontop of hadoop.
• Apache Hbase, Column Family Storage built like BigTable
• Apache Cassandra, Column Family Storage build like BigTable with Dynamo topology and consistency.
Node is the place where data is stored.
Data center is a collection of related nodes.
Cluster is a component that contains one or more data centers.
In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the one that is used to offer the least confidence that operations will succeed.
Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.
Secondary indexes are indexes built over column values. In other words, let’s say you have a user table, which contains a user’s email. The primary index would be the user ID, so if you wanted to access a particular user’s email, you could look them up by their ID. However, to solve the inverse query given an email, fetch the user ID requires a secondary index.
You want to query on a column that isn’t the primary key and isn’t part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).
Try not using secondary indexes on columns contain a high count of unique values and that will produce few results.
A high availability system is the one that is ready to serve any request at any time. High avaliability is usually achieved by adding redundancies. So, if one part fails, the other part of the system can serve the request. To a client, it seems as if everything worked fine.
Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure resistant. That means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.
Hector is an open source project written in Java using the MIT license. It was one of the early Cassandra clients and is used in production at Outbrain. It wraps Thrift and offers JMX, connection pooling, and failover.

No comments:

Post a Comment

Cassandra Authentication and Create User

Cassandra Authentication and Create User: By default when we install cassandra on a machine it do not has any username and password a...