Search

Suggested keywords:
  • Java
  • Docker
  • Git
  • React
  • NextJs
  • Spring boot
  • Laravel

Introduction to Apache Cassandra

  • Share this:
post-title

Apache Cassandra was designed by Facebook and was open-sourced in July 2008. It is regarded as perfect choice when the users demand scalability and high availability without any impact towards performance. Apache Cassandra is highly scalable, high-performance distributed database designed to handle large voluminous amounts of data across many commodity servers with no failure.  As compared to other popular distributed databases like Riak, HBase and Voldemort, Cassandra offers a robust and expressive interface for modeling and querying data. Cassandra is fully NoSQL style database engine, and as compared to traditional databases, it is capable for storing and accessing largely unstructured data.

Some of the unique points surrounding Apache Cassandra are:

  • Scalable, consistent and fully fault tolerant database
  • Column-Oriented and has distributed design based on Amazon’s Dynamo and data model is based on Google’s Bigtable.
  • Implements a Dynamo-style replication model with no point of failure and adds more powerful “column family” data model.
  • Provides high write and read throughput and Cassandra cluster has no special nodes i.e. the cluster has no masters, no slaves or elected leaders.

Features

The following are the top features of Apache Cassandra:

  1. Elastic Scalability: It is one of the primary features surrounding Cassandra, as it supports easy scale-up or scale-down of cluster and provides strong flexibility for adding or deleting any number of nodes without any hiccup and even no need for restarting the server and provides high throughput for the highest number of nodes.
  2. High Availability and Fault Tolerance: Cassandra features high availability and fault tolerance due to strong data replication, which means, if any one node fails, the data is available at another nodes depending on replication nodes. It provides advanced back-up and recovery options.
  3. Transaction Support: Cassandra supports properties like Atomicity, Consistency, Isolation and Durability (ACID).
  4. Column-Oriented: Its data model is column-oriented and columns are stored based on column names. So, there are number of columns contained in rows.
  5. Tunable Consistency: Cassandra provides tunable consistency i.e. users can determine the consistency level by tuning it via read and write operations. Eventual consistency often conjures up fear and doubt in the minds of application developers. It is important to note, that reaching a consistent state often takes microseconds.
  6. Gossip Protocol: Cassandra uses a gossip protocol to discover node state for all nodes in a cluster. Nodes discover information about other nodes by exchanging state information about themselves and other nodes they know about. This is done with a maximum of 3 other nodes. Nodes do not exchange information with every other node in the cluster in order to reduce network load. They just exchange information with a few nodes and over a period of time state information about every node propagates throughout the cluster. The gossip protocol facilitates failure detection.
  7. Linear Scaling and Design Time schema: Due to its multi master architecture, Cassandra is linearly scalable, doubling the number of nodes in a cluster can handle twice the writes. Cassandra requires defining schema and data types at design time. It enables the users to define schema first.

 

Apache Cassandra V/s Traditional Relational Database Management Systems

The following Table highlights the differences between Apache Cassandra and Traditional RDBMS systems:

Basis of Difference

Apache Cassandra

Traditional RDBMS

Data Types

Deals with Unstructured data and can handle data including sound, video and images. As based on NoSQL DB, it can support huge volumes of Data

It deals with Structured data, just text, characters or numbers with moderate amount.

Schema

Highly-scalable and Flexible. Also known as schema-less

Fixed Schema and generally lots of limitations in data storage

Table Dimension

In Cassandra, Table dimension is: Row x Column Key x Column Value. Row is unit of replication, Column is unit of storage, Relationships are represented using collections.

In RDBMS, Table dimension is: Row x Column. Row is an individual record, Column represents attributes of a relation and there is concept of Foreign Keys, joins etc.

Storage

Handle large data and Keyspace is the outermost storage unit and data transfer rate is extremely fast cum automatic data distribution.

Handles moderate data and database is the outermost storage area and data transfer rate is slow and manual distribution of data is possible in RDBMS.

Misc. Features

Decentralized Deployments

Transactions written in many locations

Deployed in Horizontal fashion

Centralized deployments

Transactions are written in one location

Deployed in vertical fashion

 

Cassandra Architecture

The primary objective of Cassandra is to handle large data workloads across multiple nodes without any failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.

  • Every node in the cluster plays the same role and every node functions independently at the same point of time interconnected with other nodes.
  • Every cluster node can accept read/write requests, without any limitation of data location in cluster.
  • In case of any node failure, the data can be read from other nodes in the network.

Writing and Reading Data

Data written to a Cassandra node is first recorded in on-disk commit log and then written to memory-based structure called memtable. When memtable’s size exceeds a configurable threshold, the data is written to an immutable file on disk called an SSTable. Buffering writes in memory in this way allows writes always to be a fully sequential operation, with many megabytes of disk I/O happening at the same time, rather than one at a time over a long period.

cassandra write procedure


Reading data from Cassandra involves a number of processes that can include various memory caches and other mechanisms designed to produce fast read response times. For a read request, Cassandra consults an in-memory data structure called a Bloom filter that checks the probability of an SSTable having the needed data. The Bloom filter can tell very quickly whether the file probably has the needed data, or certainly does not have it. 

cassandra read procedure

 

Data Distribution and Replication

Data Distribution

Cassandra automatically distributes and maintains data across a cluster, freeing developers and architects to direct their energies into value-creating application features.

Cassandra has an internal component called a partitioner, which determines how data is distributed across the nodes that make up a database cluster.

Cassandra also automatically maintains the balance of data across a cluster even when existing nodes are removed or new nodes are added to a system.

Data Replication

Cassandra features a replication mechanism that is very easy to configure and administer. A Cassandra cluster can have one or more keyspaces. Replication is configured at the keyspace level, allowing different keyspaces to have different replication models. Cassandra is able to replicate data to multiple nodes in a cluster, which helps ensure reliability, continuous availability, and fast I/O operations. Cassandra automatically maintains that replication even when nodes are removed, added, or fail.

Multi-Data Center and Cloud Support

Cassandra’s replication support multiple data centers and cloud availability zones. Users can easily set up replication so that data is replicated across geographically diverse data centers, with users being able to read and write to any data center they choose and the data being automatically synchronized across all locations.

 

Cassandra Query Language (CQL)

The Cassandra Query Language (CQL) is the primary language for communicating with the Cassandra database. CQL is purposefully similar to Structured Query Language (SQL) used in relational databases like MySQL and Postgres. 

The most basic way to interact with Cassandra is using the CQL shell, cqlsh. CQLSH is a platform that allows the user to launch the Cassandra query language (CQL). The user can perform many operations using cqlsh. Some of them include: defining a schema, inserting and altering data, executing a query etc. It basically is a coding platform for Cassandra. CQL adds an abstraction layer that hides implementation details of this structure and provides native syntaxes for collections and other common encodings.

Common ways to access CQL are:

  • Start cqlsh, the Python-based command-line client, on the command line of a Cassandra node.
  • For developing applications, use one of the C#, Java, or Python open-source drivers.

CQL Schema:

Creating Table:

  CREATE (TABLE | COLUMNFAMILY) <tablename> ('<column-definition>' , '<column-definition>') 
(WITH <option> AND <option>

 

Inserting Data into Table

 Insert into KeyspaceName.TableName(ColumnName1, ColumnName2, ColumnName3 . . . .) values (Column1Value, Column2Value, Column3Value . . . .

 

Updating Data into Table

   Update KeyspaceName.TableName

          Set ColumnName1=new Column1Value,

          ColumnName2=new Column2Value,

          ColumnName3=new Column3Value,

          Where ColumnName=ColumnValue

 

Deleting Data from Table

   Delete from KeyspaceName.TableName Where ColumnName1=ColumnValue

 

Selecting Data from Table

    Select ColumnNames from KeyspaceName.TableName Where ColumnName1=Column1Value AND

            ColumnName2=Column2Value

CQL prevents the following:

  • No arbitrary WHERE clause – Apache Cassandra prevents arbitrary predicates in a WHERE statement. Where clauses must have columns specified in your primary key.
  • No JOINS – You cannot join data from two Apache Cassandra tables.
  • No arbitrary GROUP BY – GROUP BY can only be applied to a partition or cluster column. Apache Cassandra 3.10 added GROUP BY support to SELECT statements.
  • No arbitrary ORDER BY clauses – Order by can only be applied to a clustered column.

 

Conclusion

Cassandra is fully replicated distributed database. There is no master, no slave. It's always on, its performant and these are some of the features and characteristics of Cassandra that make it a fantastic solution to the big data challenge.

 

Reference:

Apache Cassandra

Cassandra client libraries

 

 

Dr. Anand Nayyar

About author
Dr. Anand Nayyar is an Academician, Researcher, Author, Writer, Inventor, Innovator, Scientist, Consultant and Orator. He is currently working as Professor, Researcher and Scientist in Graduate School at Duy Tan University, Vietnam. He can be reached at anandnayyar@duytan.edu.vn. YouTube: Gyaan with Anand Nayyar