Cassandra: A Write Optimised distributed database (What it Offers and What it Doesn't)

Cassandra is a fully distributed, masterless database, offering superior scalability and fault tolerance to traditional single master databases like Oracle, MYSQL, etc. Cassandra differs from other available distributed database like Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data.
In this post we will see in details what Cassandra offers which makes it different from other distributed databases and what it does not which are provided by relational database.

Horizontal scalability

An ability to expand the storage and processing capacity of a database by adding more servers to a database cluster.
A traditional single-master database's storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows the data set must be Sharded among multiple independent database instances that know nothing of each other. Its is application's responsibility to know which instance a given piece of data belongs.
Cassandra is deployed as a cluster of instances that are all aware of each other and from the client application's standpoint, the cluster is a single entity; the application need not know which machine a piece of data belongs to. Instead, data can be read or written to any instance(node) in the cluster, this node will forward the request to the instance where the data actually belongs.

High availability

Single instance database running on a single server is highly vulnerable to interruption. If the server is affected by a hardware failure or network connection outage, the application's ability to read and write data is completely lost until the server is restored.
Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine, and will queue the operations and update the failed node when it rejoins the cluster. It is replication factor which decides how many copies of each piece of data should be stored, default value is 3.

Write Optimisation 

In traditional relational and document databases writes are more expensive than reads, so such database are optimised for read performance. Writing and updating database  is a very expensive operation from a standpoint of disk I/O.
Cassandra, on the other hand, is highly optimised for write throughput, and in fact never modifies data on disk; it only appends to existing files or creates new ones. Cassandra provides high write throughput because of low disk I/O cost.

Structured Record

Relational database is well known to provide structured dataset or metadata implicitly.
In Cassandra all records are structured in the same way as they are in a relational database—using tables, rows, and columns. Other distributed database like databases like Riak and Voldemort are purely key-value stores; these databases have no knowledge of the internal structure of a record that's stored at a particular key, however Cassandra provides this facility. 

Secondary Index

In relational database, a secondary index (termed as as an index) is a structure allowing efficient lookup of records by some attribute other than their primary key.
Cassandra has limited supports of secondary indexes, it is not as versatile as indexes in a typical relational database.

Result ordering(Clustering columns)

Even in relational database sorting data on the fly is a fundamentally expensive operation, databases must keep information about record ordering persisted on disk in order to efficiently return results in order. It is secondary index's job to efficiently retrieve ordered dataset.
In Cassandra, secondary indexes can't be used for result ordering, but Cassandra full-fill this requirement using "Clustering columns"- Tables can be structured such that rows are always kept sorted by a given column or columns(clustering columns).

Immediate consistency

In relational database it is guaranteed that once dataset is committed in database, all running process always see updated values, This guarantee is called immediate consistency.
Distributed systems like Cassandra typically do not provide an immediate consistency guarantee because immediate consistency is a direct tradeoff with high availability. Since Cassandra preferred to support high availability in first place so eventual consistency is preferred - which means when data is updated, the system will reflect that update at some point in the future. In the case of Cassandra, that tradeoff is made explicit through tunable consistency - Configurable at query level.


Join Operation 

Relational database provides features of Join operation and it is handy when dataset is normalised. Cassandra does not support joins. Instead, applications using Cassandra typically denormalise data and male appropriate use of clustering in order to perform the sorts of data access.

Below is high level comparison of Cassandra with other relational and Distributed databases.

Note:- Applications using Cassandra can enjoy all the benefits of masterless distributed storage while also getting all the advanced data modeling and access features associated with structured records.


1 Comments

Previous Post Next Post