Apache Cassandra Interview Questions:
Q1: How many types of NoSQL databases are there?
Answer: There are four types of NoSQL databases, namely:
- Document Stores (MongoDB, Couchbase)
- Key-Value Stores (Redis, Volgemort)
- Column Stores (Cassandra)
- Graph Stores (Neo4j, Giraph)
Q2: What do you understand by Commit log in Cassandra?
Answer: Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
Q3: Define Mem-table in Cassandra.
Answer: It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Mem-table is in-memory/write-back cache space consisting of content in key and column format. The data in mem- table is sorted by key, and each column family consists of a distinct mem-table that retrieves column data via key. It stores the writes until it is full, and then flushed out.
Q4: What is SSTable?
Answer: SSTable or ‘Sorted String Table,’ refers to an important data file in Cassandra. It accepts regular written memtables which are stored on disk and exist for each Cassandra table. Being immutable, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.
Q5: What is bloom filter?
Answer: Bloom filter is an off-heap data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.
Q6: Establish the difference between a node, cluster & data centres in Cassandra.
Answer: Node is a single machine running Cassandra.
Cluster is a collection of nodes that have similar type of data grouped together.
Data centres are useful components when serving customers in different geographical areas. Different nodes of a cluster are grouped into different data centres.
Q7: Define composite type in Cassandra?
Answer: In Cassandra, composite type allows to define a key or a column name with a concatenation of data of different type. You can use two types of Composite Types:
- Row Key
- Column Name
Q8: What is Cassandra Data Model?
Answer: Cassandra Data Model consists of four main components, namely:
- Cluster: These are made up of multiple nodes and keyspaces.
- Keyspace: It is a namespace to group multiple column families, especially one per partition.
- Column: It consists of a column name, value and timestamp
- Column family: This refers to multiple columns with row key reference.
Q9: Explain what is a keyspace in Cassandra?
Answer: In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consists of one keyspace per node.
Q10: Elaborate on CQL?
Answer: A user can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.
Q11: Talk about the concept of tunable consistency in Cassandra.
Answer: Tunable Consistency is a characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies – Eventual Consistency and Strong Consistency.
Q12: What are the three components of Cassandra write?
Answer: The three components are:
- Commitlog write
- Memtable write
- SStable write
Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in SStable.
Q13: Explain zero consistency.
Answer: In zero consistency the write operations will be handled in the background, asynchronously. It is the fastest way to write data.
Q14: Mention what are the values stored in the Cassandra Column?
Answer: There are three values in Cassandra Column. They are:
- Column Name
- Value
- Time Stamp
Q15: What do you understand by Kundera?
Answer: Kundera is an object-relational mapping (ORM) implementation for Cassandra which is written using Java annotations.
Q16: What is the concept of SuperColumn in Cassandra?
Answer: Cassandra SuperColumn is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action.
Q17: When do you have to avoid secondary indexes?
Answer: Try not using secondary indexes on columns containing a high count of unique values as that will produce few results.
Q18: List the steps in which Cassandra writes changed data into commitlog?
Answer: Cassandra concatenates changed data to commitlog. Then Commitlog acts as a crash recovery log for data. Until the changed data is concatenated to commitlog, write operation will never be considered successful.
Q19: What is the use of “ResultSet execute(Statement statement)” method?
Answer: This method is used to execute a query. It requires a statement object.
Q20: What is Thrift?
Answer: Thrift is the name of the Remote Procedure Call (RPC) client used to communicate with the Cassandra server.
Q21: Explain the two types of compactions in Cassandra.
Answer: Compaction refers to a maintenance process in Cassandra , in which, the SSTables are reorganized for data optimization of data structures on the disk. There are two types of compaction in Cassandra:
- Minor compaction: It starts automatically when a new table is created. Here, Cassandra condenses all the equally sized tables into one.
- Major compaction: It is triggered manually using nodetool. It compacts all tables of a ColumnFamily into one.
Q22: Explain what is Cassandra-Cqlsh?
Answer: Cassandra-Cqlsh is a query language that enables users to communicate with its database. By using Cassandra cqlsh, you can do following things:
- Define a schema
- Insert a data, and
- Execute a query
Q23: What is the use of “void close()” method?
Answer: This method is used to close the current session instance.
Q24: What are the collection data types provided by CQL?
Answer: There are three collection data types:
- List : A list is a collection of one or more ordered elements.
- Map : A map is a collection of key-value pairs.
- Set : A set is a collection of one or more elements.
Q25: Describe Replication Factor?
Answer: Replication Factor is the measure of number of data copies existing. It is important to increase the replication factor to log into the cluster.
==========================================================================
Cassandra Interview Questions
What is Apache Cassandra?
Apache Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing and managing large amounts of data across commodity servers. It can server as both
– Real time data store system for online applications
– Also as a read intensive database for business intelligence system
OR
Apache Cassandra is an open source, distributed and decentralized/distributed storage system (database), for managing very large amounts of structured data spread out across the world. It provides highly available service with no single point of failure.It was developed at Facebook for inbox search and it was open-sourced by Facebook in July 2008.
What was the design goal of Cassandra?
The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure.
What is NoSQL Database?
NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data.
Cassandra is written in which language?
Java
How many types of NoSQL databases?
– Document Stores (MongoDB, Couchbase)
– Key-Value Stores (Redis, Volgemort)
– Column Stores (Cassandra)
– Graph Stores (Neo4j, Graph)
What do you understand by composite type?
Composite Type is a cool feature of Hector and Cassandra.
It allow to define a key or a column name with a concatenation of data of different type.
With Cassandra Unit, you can use Composite Type in 2 places :
– row key
– column name
Mention what are the main components of Cassandra Data Model?
The main components of Cassandra Data Model are
– Cluster
– Keyspace
– Column
– Column & Family
What is the relationship between Apache Hadoop, HBase, Hive and Cassandra?
Apache Hadoop, File Storage, Grid Compute processing via Map Reduce.
Apache Hive, SQL like interface on top of hadoop.
Apache Hbase, Column Family Storage built like BigTable
Apache Cassandra, Column Family Storage build like BigTable with Dynamo topology and consistency.
Desired to gain proficiency on Cassandra? Explore the blog post on Cassandra Training
to become a pro in Cassandra.
List out some key features of Apache Cassandra?
It is scalable, fault-tolerant, and consistent.
It is a column-oriented database.
Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.
Created at Facebook, it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model.
Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
What do you understand by Data Replication in Cassandra?
Database replication is the frequent electronic copying data from a database in one computer or server to a database in another so that all users share the same level of information.
Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.
What do you understand by Node in Cassandra?
Node is the place where data is stored.
What do you understand by Data center in Cassandra?
Data center is a collection of related nodes.
What do you understand by Cluster in Cassandra?
Cluster is a component that contains one or more data centers.
What do you understand by Commit log in Cassandra?
Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
What do you understand by Mem-table in Cassandra?
Mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
What do you understand by SSTabl in Cassandra?
SSTable is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
What do you understand by Bloom filter in Cassandra?
Bloom filter are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
What do you understand by CQL?
User can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.
What do you understand by Column Family?
Column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
What is the use of “void close()” method?
This method is used to close the current session instance.
What is the use of “ResultSet execute(Statement statement)” method?
This method is used to execute a query. It requires a statement object.
Which command is used to start the cqlsh prompt?
Cqlsh
What is the use of “cqlsh –version” command?
This command will provides the version of the cqlsh you are using.
What are the collection data types provided by CQL?
List : A list is a collection of one or more ordered elements.
Map : A map is a collection of key-value pairs.
Set : A set is a collection of one or more elements.
What is Cassandra database used for?
Apache Cassandra is a second-generation distributed database originally open-sourced by Facebook. Its write-optimized shared-nothing architecture results inexcellent performance and scalability. The Cassandra storage cluster and S3 archival layer are designed to expand horizontally to any arbitrary size with linear cost.Cassandra’s memory footprint is more dependent on the number of column families than on the size of the data set. Cassandra scales pretty well horizontally for storage and IO, but not for memory footprint, which is tied to your schema and your cache settings regardless of the size of your cluster. some of the important link about cassandara is available-here.
What is the syntax to create keyspace in Cassandra?
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE WITH
What is a keyspace in Cassandra?
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one keyspace per node.
What is cqlsh?
cqlsh is a Python-based command-line client for cassandra.
Does Cassandra works on Windows?
Yes, Cassandra works pretty well on windows. Right now we have linux and windows compatible versions available.
What do you understand by Consistency in Cassandra?
Consistency means to synchronize and how up-to-date a row of Cassandra data is on all of its replicas.
Explain Zero Consistency?
In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the one that is used to offer the least confidence that operations will succeed.
What do you understand by Thrift?
Thrift is the name of the RPC client used to communicate with the Cassandra server.
What do you understand by Kundera?
Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.
What does JMX stands for?
JMX stands for Java Management Extension
How Cassandra provide High availability feature?
Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure resistant. That means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.
Interested in mastering Cassandra? Learn more about Cassandra Tutorials in this blog post.
When to use Cassandra?
Being a part of NoSQL family Cassandra offers solution for problem where your requirement is to have very heavy write system and you want to have quite responsive reporting system on top of that stored data. Consider use case of Web analytic where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner.
When should you not use Cassandra? OR When to use RDBMS instead of Cassandra?
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
What are secondary indexes?
Secondary indexes are indexes built over column values. In other words, let’s say you have a user table, which contains a user’s email. The primary index would be the user ID, so if you wanted to access a particular user’s email, you could look them up by their ID. However, to solve the inverse query given an email, fetch the user ID requires a secondary index.
When to use secondary indexes?
You want to query on a column that isn’t the primary key and isn’t part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).
When to avoid secondary indexes?
Try not using secondary indexes on columns contain a high count of unique values and that will produce few results.
I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?
XX%
What happens to existing data in my cluster when I add new nodes?
When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.
What are “Seed Nodes” in Cassandra?
A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Seed node helps the process of bootstrapping for a new node joining a cluster. Its recommended to use the 2 seed node per data center.
What ports does Cassandra use?
By default, Cassandra uses 7000 for cluster communication, 9160 for clients (Thrift), and 8080 for JMX. These are all editable in the configuration file or bin/cassandra.in.sh (for JVM options). All ports are TCP.
What do you understand by High availability?
A high availability system is the one that is ready to serve any request at any time. High availability is usually achieved by adding redundancies. So, if one part fails, the other part of the system can serve the request. To a client, it seems as if everything
worked fine.
worked fine.
======================================================
Q) What is NoSQL ?
NoSQL (sometimes expanded to “not only sql“) is a broad class of database management systems that differ from the classic model of the relational database management system (rdbms) in some significant ways.
NoSQL systems:
- Specifically designed for high load
- Natively support horizontal scalability
- Fault tolerant
- Store data in denormalised manner
- Do not usually enforce strict database schema
- Do not usually store data in a table
- Sometimes provide eventual consistency instead of ACID transactions
In contrast to RDBMS, NoSQL systems:
- Do not guarantee data consistency
- Usually support a limited query language (subset of SQL or another custom query language)
- May not provide support for transactions/distributed transactions
- Do not usually use some advanced concepts of RDBMS, such as triggers, views, stored procedures
NoSQL implementations can be categorised by their manner of implementation:
- Column-oriented
- Document store
- Graph
- Key-value store
- Multivalue databases
- Object databases
- Tripplestore
- Tuple store
Q. Explain what is Cassandra?
Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing and managing large amounts of data across commodity servers. It can server as both
- Real time data store system for online applications
- Also as a read intensive database for business intelligence system
Q. Why Cassandra? Why not any other no SQL like Hbase ?
Apache Cassandra is an open source, free to use, distributed, decentralized, elastically and linearly scalable, highly available, fault-tolerant, tune-ably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web. Cassandra lies in CA bucket of CAP Theorem.
Our use case was more of write intensive. Since Cassandra provide Consistency and Availability, which was requirement of our use case we preferred Cassandra.
HBase is really good for Low latency read write kind of use cases.
Our use case was more of write intensive. Since Cassandra provide Consistency and Availability, which was requirement of our use case we preferred Cassandra.
HBase is really good for Low latency read write kind of use cases.
Q. Explain Cassandra Data Model.
– The Cassandra data model has 4 main concepts which are cluster, keyspace, column,column&family.
– Clusters contain many nodes (machines) and can contain multiple keyspaces.
– A keyspace is a namespace to group multiple column families, typically one per application.
– A column contains a name, value and timestamp.
– A column family contains multiple columns referenced by a row keys.
– Clusters contain many nodes (machines) and can contain multiple keyspaces.
– A keyspace is a namespace to group multiple column families, typically one per application.
– A column contains a name, value and timestamp.
– A column family contains multiple columns referenced by a row keys.
Q. Explain about Cassandra NoSQL.
Cassandra is an open source scalable and highly available “NoSQL” distributed database management system from Apache. Cassandra claims to offer fault tolerant linear scalability with no single point of failure. Cassandra sits in the Column-Family NoSQL camp.The Cassandra data model is designed for large scale distributed data and trades ACID compliant data practices for performance and availability.Cassandra is optimized for very fast and highly available writes.Cassandra is written in Java and can run on a vast array of operating systems and platform.
Q. Explain how Cassandra writes.
Cassandra writes first to a commit log on disk for durability then commits to an in-memory structure called a memtable. A write is successful once both commits are complete. Writes are batched in memory and written to disk in a table structure called an SSTable (sorted string table). Memtables and SSTables are created per column family. With this design Cassandra has minimal disk I/O and offers high speed write performance because the commit log is append-only and Cassandra doesn’t seek on writes. In the event of a fault when writing to the SSTable Cassandra can simply replay the commit log.
Q. What platforms does Cassandra run on?
Cassandra is a Java Application, meaning that a compiled binary distribution of Cassandra can run on any platform that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM). Datastax Strongly recommends using the Oracle Sun Java Runtime Environment (JRE), version 1.6.0_19 or later, for optimal performance. Packaged releases are provided for RedHat, CentOS , Debian and Ubuntu Linux Platforms.
Q. What is the CQL Language?
Cassandra 0.8 is the first release to introduce Cassandra Query Language(CQL), the first standardized query language for Apache Cassandra. CQL pushes all of the implementation details to the server in the form of a CQL parser. Clients built on CQL only need to know how to interpret query result objects. CQL is the start of the first officially supported client API for Apache Cassandra. CQL drivers for the various languages are hosted with the Apache Cassandra project.
CQL Syntax is based on SQL (Structured Query Language), the standard for relational database manipulation. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOINS, for example.
CQL Syntax is based on SQL (Structured Query Language), the standard for relational database manipulation. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOINS, for example.
Q. What management tools exist for Cassandra?
Datastax supplies both a free and commercial version of OpsCenter, which is a visual, browser-based management toll for Cassandra. With OpsCenter, a user can visually carry out many administrative tasks, monitor a cluster for performance, and do much more. Downloads of OpsCenter are available on the DataStax Website.
A number of command line tools also ship with Cassandra for querying/writing to the database, performing administration functions, etc.
Cassandra also exposes a number of statistics and management operations via Java Management Extensions(JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java Applications and services. Any statistics or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.
Cassandra also exposes a number of statistics and management operations via Java Management Extensions(JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java Applications and services. Any statistics or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.
During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant tools such as JConsole, the Cassandra nodetool utility, or the DataStax OpsCenter centralized management console. With the same tools, you can perform certain administrative commands and operation such as flushing caches or doing a repair.
Q. Briefly Explain CAP theorem.
The CAP theorem (also called as Brewer’s theorem after its author, Eric Brewer) states that within a large-scale distributed data system, there are three requirements that have a relationship of sliding dependency: Consistency, Availability, and Partition Tolerance.
CAP theorem states that in any given system, you can strongly support only two of these three.
CAP theorem states that in any given system, you can strongly support only two of these three.
Q. Why Cassandra is called decentralized no sql data base?
Cassandra is distributed, which means that it is capable of running on multiple machines while appearing to users as a unified whole. Cassandra is decentralized means that there is no single point of failure. All of the nodes in a Cassandra cluster functions exactly the same. There is NO Master NO Slave.
Q. What do you understand by Elastic Scalability?
Elastic Scalability means that your cluster can seamlessly scale up and scale back down. That actually means that adding more servers to cluster would improve and scale performance of cluster in linear fashion without any manual interventions. Vice versa is equally true.
Q. Cassandra is said to be Tune able Consistent. Why?
Consistency essentially means that a read always returns the most recently written value. Cassandra allows you to easily decide the level of consistency you require, in balance with the level of availability. This is controlled by parameters like replication factor and consistency level.
Q. How Cassandra Achieve High Availability and Fault Tolerance?
Cassandra is highly available. You can easily remove few of Cassandra failed node from cluster without actually losing any data and without bring whole cluster down. In similar fashion you can also improve performance by replicating data to multiple data center.
Q. What is basic difference between data center and cluster in terms of Cassandra?
A collection of related nodes is called so. A data center can be a physical data center or virtual data center. Replication is set by data center. Depending on the replication factor, data can be written to multiple data centers. However, data centers should never span physical locations whereas a cluster contains one or more data centers. It can span physical locations.
Q. What is the use of Cassandra and why to use Cassandra?
Cassandra was designed to handle big data workloads across multiple nodes without any single point of failure. The various factors responsible for using Cassandra are
- It is fault tolerant and consistent
- Gigabytes to petabytes scalabilities
- It is a column-oriented database
- No single point of failure
- No need for separate caching layer
- Flexible schema design
- It has flexible data storage, easy data distribution, and fast writes
- It supports ACID (Atomicity, Consistency, Isolation, and Durability)properties
- Multi-data center and cloud capable
- Data compression
Q. Explain what is composite type in Cassandra?
In Cassandra, composite type allows to define key or a column name with a concatenation of data of different type. You can use two types of Composite Type
- Row Key
- Column Name
Q. How Cassandra stores data?
- All data stored as bytes
- When you specify validator, Cassandra ensures those bytes are encoded as per requirement
- Then a comparator orders the column based on the ordering specific to the encoding
- While composite are just byte arrays with a specific encoding, for each component it stores a two byte length followed by the byte encoded component followed by a termination bit.
Q. Mention what are the main components of Cassandra Data Model?
The main components of Cassandra Data Model are
- Cluster
- Keyspace
- Column
- Column & Family
Q. Explain what is a column family in Cassandra?
Column family in Cassandra is referred for a collection of Rows.
Q. Explain what is a cluster in Cassandra?
A cluster is a container for keyspaces. Cassandra database is segmented over several machines that operate together. The cluster is the outermost container which arranges the nodes in a ring format and assigns data to them. These nodes have a replica which takes charge in case of data handling failure.
Q. List out the other components of Cassandra?
The other components of Cassandra are
- Node
- Data Center
- Cluster
- Commit log
- Mem-table
- SSTable
- Bloom Filter
Q. Explain what is a keyspace in Cassandra?
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one keyspace per node.
Q. What is the syntax to create keyspace in Cassandra?
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE WITH
Q. Mention what are the values stored in the Cassandra Column?
In Cassandra Column, basically there are three values
- Column Name
- Value
- Time Stamp
Q. Mention when you can use Alter keyspace?
ALTER KEYSPACE can be used to change properties such as the number of replicas and the durable_write of a keyspace.
Q. Explain what is Cassandra-Cqlsh?
Cassandra-Cqlsh is a query language that enables users to communicate with its database. By using Cassandra cqlsh, you can do following things
- Define a schema
- Insert a data and
- Execute a query
Q. Mention what does the shell commands “Capture” and “Consistency” determines?
There are various Cqlsh shell commands in Cassandra. Command “Capture”, captures the output of a command and adds it to a file while, command “Consistency” display the current consistency level or set a new consistency level.
Q. What is mandatory while creating a table in Cassandra?
While creating a table primary key is mandatory, it is made up of one or more columns of a table.
Q. Mention what needs to be taken care while adding a Column?
While adding a column you need to take care that the
- Column name is not conflicting with the existing column names
- Table is not defined with compact storage option
Q. Mention what is Cassandra- CQL collections?
Cassandra CQL collections help you to store multiple values in a single variable. In Cassandra, you can use CQL collections in following ways
- List: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds the list of unique elements)
- SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)
- MAP: It is a data type used to store a key-value pair of elements
Q. Explain how Cassandra writes data?
Cassandra writes data in three components
- Commitlog write
- Memtable write
- SStable write
Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in SStable
Q. Explain what is Memtable in Cassandra?
- Cassandra writes the data to a in memory structure known as Memtable
- It is an in-memory cache with content stored as key/column
- By key Memtable data are sorted
- There is a separate Memtable for each ColumnFamily, and it retrieves column data from the key
Q. Explain what is SStable consist of?
SStable consist of mainly 2 files
- Index file ( Bloom filter & Key offset pairs)
- Data file (Actual column data)
Q. Explain what is Bloom Filter is used for in Cassandra?
A bloom filter is a space efficient data structure that is used to test whether an element is a member of a set. In other words, it is used to determine whether an SSTable has data for a particular row. In Cassandra it is used to save IO when performing a KEY LOOKUP.
Q. Explain how Cassandra writes changed data into commitlog?
- Cassandra concatenate changed data to commitlog
- Commitlog acts as a crash recovery log for data
- Until the changed data is concatenated to commitlog write operation will be never considered successful
Data will not be lost once commitlog is flushed out to file.
Q. Explain how Cassandra delete Data?
SSTables are immutable and cannot remove a row from SSTables. When a row needs to be deleted, Cassandra assigns the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered as deleted.
Q. What is a commit log?
It is a crash-recovery mechanism. All data is written first to the commit log (file) for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.
Q. What is SSTable? Is it similar to RDBMS table?
A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.
Whereas RDBMS Table collection of ordered columns fetched by row.
Whereas RDBMS Table collection of ordered columns fetched by row.
Q. What is Gossip protocol?
Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster.
Q. What is Bloom Filter?
These are quick, nondeterministic, algorithms for testing whether an element is a member of a set. Bloom filters are accessed after every query.
Q. What is Order Preserving partitioner?
This is a kind of Partitioner that stores rows by key order, aligning the physical structure of the data with your sort order. Configuring your column family to use order-preserving partitioning allows you to perform range slices, meaning that Cassandra knows which nodes have which keys. This partitioner is somewhat the opposite of the Random Partitioner; it has the advantage of allowing for efficient range queries, but the disadvantage of unevenly distributing keys.
The order-preserving partitioner (OPP) is implemented by the org.apache.cassandra .dht.OrderPreservingPartitionerclass. There is a special kind of OPP called the collating order-preserving partitioner (COPP). This acts like a regular OPP, but sorts the data in a collated manner according to English/US lexicography instead of byte ordering. For this reason, it is useful for locale-aware applications. The COPP is implemented by the org.apache.cassandra .dht.CollatingOrderPreservingParti tioner class. This is implemented in Cassandra by org.apache.cassandra.dht.OrderPreservingPartitioner.
Q. What are key spaces and column family in Cassandra?
In Cassandra logical division that associates similar data is called as column family. Basic Cassandra data structures: the column, which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a column family, which is a container for rows that have similar, but not identical, column sets. We have a unique identifier for each row could be called a row key. A keyspace is the outermost container for data in Cassandra, corresponding closely to a relational database.
Q. What is the use of HELP command?
It is used to display a synopsis and a brief description of all cqlsh commands.
Q. What is the use of capture command?
Capture command is used to captures the output of a command and adds it to a file.
Q. What is materialized view? Why is it normal practice in Cassandra to have it?
Materialized” means storing a full copy of the original data so that everything you need to answer a query is right there, without forcing you to look up the original data. This is because you don’t have a SQL WHERE clause, you can recreate this effect by writing your data to a second column family that is created specifically to represent that query.
Q. Why Time stamp is so important while inserting data in Cassandra?
This is important because Cassandra use timestamps to determine the most recent write value.
Q. Why are super columns in Cassandra no longer favoured?
Super columns suffer from a number of problems, not least of which is that it is necessary for Cassandra to deserialize all of the sub-columns of a super column when querying (even if the result will only return a small subset). As a result, there is a practical limit to the number of sub-columns per super column that can be stored before performance suffers.
In theory, this could be fixed within Cassandra by properly indexing sub-columns, but consensus is that composite columns are a better solution, and they work without the added complexity.
Q. What are advantages and disadvantages of secondary indexes in Cassandra?
Querying becomes more flexible when you add secondary indexes to table columns. You can add indexed columns to the WHERE clause of a SELECT.
When to use secondary indexes: You want to query on a column that isn’t the primary key and isn’t part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).
When to avoid secondary indexes: Try not using secondary indexes on columns contain a high count of unique values and that will produce few results. Remember it makes writing to DB much slower, you can find value only by exact index and you need to make requests to all servers in cluster to find value by index.
Q. How do you query Cassandra?
We query Cassandra using cql (Cassandra query language). We use cqlsh for interacting with DB.
Q. What is cqlsh?
It’s a Python-based command-line client for cassandra.
Q. Does Cassandra works on Windows?
Yes Cassandra works pretty well on windows. Right now we have linux and windows compatible versions available.
Q. Why renormalization is preferred in Cassandra?
This is because Cassandra does not support joins. User can join data at its own end.
Q. What is the sue of consistency command?
Consistency command is used to copy data to and from Cassandra to a file.
Q. Does Cassandra Support Transactions?
Yes and No, depending on what you mean by ‘transactions’. Unlike relational databases, Cassandra does not offer fully ACID-compliant transactions. There are no locking or transactional dependencies when concurrently updating multiple rows or column families. But if by ‘transactions’ you mean real-time data entry and retrieval, with durability and tunable consistency, then yes.
Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor Does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.
However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.
Q. What is Compaction in Cassandra?
The compaction process merges keys, combines columns, evicts tombstones, consolidates SSTables, and creates a new index in the merged SSTable.
Q. What is Anti-Entropy?
Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring
that data on different nodes is updated to the newest version.
that data on different nodes is updated to the newest version.
Q. What do you understand by Consistency in Cassandra?
Consistency means to synchronize and how up-to-date a row of Cassandra data is on all of its replicas.
Q. Explain Zero Consistency?
In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the one that is used to offer the least confidence that operations will succeed.
Q. Explain Any Consistency?
Ii assures that our write operation was successful on at least one node, even if the acknowledgment is only for a hint. It is a relatively weak level of consistency.
Q. Explain ONE consistency?
It is used to ensure that the write operation was written to at least one node, including its commit log and memtable.
Q. Explain QUORUM consistency?
A quorum is a number of nodes that is used to represent the consensus on an operation. It is determined by / 2 + 1.
Q. What is the use of SOURCE command?
SOURCE command is used to execute a file that contains CQL statements.
Q. Explain ALL consistency?
Every node as specified in your configuration entry must successfully acknowledge the write operation. If any nodes do not acknowledge the write operation, the write fails. This has the highest level of consistency and the lowest level of performance.
Q. What are consistency levels for read operations?
- ONE: This is used to returns the value on the first node that responds and Performs a read repair in the background.
- QUORUM: It queries all nodes and returns the record with the most recent timestamp after quorums of nodes have responded where a “quorum” is (n /2) + 1.
- DCQUORUM: It ensures that only nodes in the same data center are queried. It is applicable when using Rack-Aware placement strategy.
- ALL: Queries all nodes and returns the value with the most recent timestamp. This level waits for all nodes to respond, and if one doesn’t, it fails the read operation.
Q. What do you mean by hint handoff?
It is mechanism to ensure availability, fault tolerance, and graceful degradation. If a write operation occurs and a node that is intended to receive that write goes down, a note (the “hint”) is given (“handed off”) to a different live node to indicate that it should replay the write operation to the unavailable node when it comes back online. This does two things: it reduces the amount of time that it takes for a node to get all the data it missed once it comes back online, and it improves write performance in lower consistency levels.
Q. What is Merkle Tree? Where is it used in Cassandra?
Merkle tree is a binary tree data structure that summarizes in short form the data in a larger dataset. Merkle trees are used in Cassandra to ensure that the peer-to-peer network of nodes receives data blocks unaltered and unharmed.
Q. What do you mean by multiget?
It means a query by column name for a set of keys.
Q. What is Multiget Slice?
It means query to get a subset of columns for a set of keys.
Q. What is a SEED node in Cassandra?
A seed is a node that already exists in a Cassandra cluster and is used by newly added nodes to get up and running. The newly added node can start gossiping with the seed node to get state information and learn the topology of the node ring. There may be
many seeds in a cluster.
many seeds in a cluster.
Q. What is Slice and Range slice in Cassandra?
This is a type of read query. Use get_slice() to query by a single column name or a range of column names. Use get_range_slice() to return a subset of columns for a range of keys.
Q. What is Tombstone in Cassandra world?
Cassandra does not immediately delete data following a delete operation. Instead, it marks the data with a “tombstone,” an indicator that the column has been deleted but not removed entirely yet. The tombstone can then be propagated to other replicas.
Q. What is Thrift?
Thrift is the name of the RPC client used to communicate with the Cassandra server.
Q. What is Batch Mutates?
Like a batch update in the relational world, the batch_mutate operation allows grouping calls on many keys into a single call in order to save on the cost of network round trips. If batch_mutate fails in the middle of its list of mutations, there will be no rollback, so any updates that have already occurred up to this point will remain intact.
Q. What is Hector?
Hector is an open source project written in Java using the MIT license. It was one of the early Cassandra clients and is used in production at Outbrain. It wraps Thrift and offers JMX, connection pooling, and failover.
Q. What is Kundera?
Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.
Q. What is Random Partitioner?
This is a kind of Partitioner that uses a BigIntegerToken with an MD5 hash to determine where to place the keys on the node ring. This has the advantage of spreading your keys evenly across your cluster, but the disadvantage of causing inefficient range queries. This is the default partitioner.
Q. What is Read Repair?
This is another mechanism to ensure consistency throughout the node ring. In a read operation, if Cassandra detects that some nodes have responded with data that is inconsistent with the response of other, newer nodes, it makes a note to perform a read repair on the old nodes. The read repair means that Cassandra will send a write request to the nodes with stale data to get them up to date with the newer data returned from the original read operation. It does this by pulling all the data from the node, performing a merge, and writing the merged data back to the nodes that were out of sync. The detection of inconsistent data is made by comparing timestamps and checksums.
Q. What is Snitch in Cassandra?
A snitch is Cassandra’s way of mapping a node to a physical location in the network.
It helps determine the location of a node relative to another node in order to assist with discovery and ensure efficient request routing.
It helps determine the location of a node relative to another node in order to assist with discovery and ensure efficient request routing.
=================================================================
div class=”panel-group” >
•Yes. First, data durability is fully supported in Cassandra so that any data written to a database cluster is first written to a commit log in the same fashion as nearly every popular RDBMS does.
•Second, Cassandra offers tunable data consistency so that a developer or administrator can choose how strong they wish consistency across nodes to be. The strongest form of consistency is to mandate that any data modifications be made to all nodes, with any unsuccessful attempt on a node resulting in a failed data operation. Cassandra provides consistency in the CAP sense in that all readers will see the same values.
•Other forms of tunable consistency involve having a quorum of nodes written to or just one node for the loosest form of consistency. Cassandra is very flexible and allows data consistency to be chosen on a per operation basis if needed so that very strong consistency can be used when desired, or very loose consistency can be utilized when the use case permits.
No comments:
Post a Comment