Monday, October 24, 2011

NoSQL Databases the "new" Popular Kid on the Block

In this post, I'm going to go over the NoSQL ("Not Only SQL") data stores. NoSQL data stores can mean several things (depending on who you ask) but certainly it refers to a broad class of databases that differ from the traditional Relational Database Management Systems (RDBMS). NoSQL data stores are about new approaches to store and access data.


We live in an "information age" and the amount of digital information (data) has grown dramatically in the last decade. This trend will only continue to grow as we move ourselves more and more into the digital world.  Some organizations have been taking notice of this, and are now rethinking in how to better organized their data. Traditionally, RDBMS systems were (and still are) an integral part of any application's back-end system. However, there is a "new" kid on the block offering us alternative ways to store and access our data. The so called NoSQL databases take a different approach at handling data that might not be well suited for a traditional RDBMS. Here are a few examples of the many different NoSQL solutions available. 
Key-Value Stores - Redis, SimpleDB
Column Stores - HBase, Cassandra
Document Store - MongoDB, CouchDB
Graph Store- Neo4J
In general they are all designed to handle huge amounts of data, have no table schema nor joins and are particularly suited to scale horizontally. And is this last one - horizontal scaling - which can make a huge different (performance overhead) when dealing with huge amounts of data.  


A gentle note in scalability before we move on.  This is important, even more with today applications that  must handle (read/write) enormous amount of data.
  • Vertical Scaling or scaling up essentially means adding more computing power to a single machine ( CPU's, RAM, disk space, etc) where the application or database resides.
  • Horizontal Scaling or scaling out essentially means that when on application or database needs more computer power then we add additional servers and distribute the load across them.
Traditional RDBMS tend to have limitations to scale horizontally.  The reasons for this limitation can be better explained by going over a fundamental theorem. 


CAP Theorem
The CAP Theorem was formulated by Eric A. Brewer a UC Berkeley professor. It practically states that there are 3 desirable properties for a distributed system that share date must have. Let see these three in summary.
Consistency:  all clients see the same version of the data, data is correct at all the time.
Availability:  all clients can always find at least one copy of requested data, even if some servers in the cluster are down. It is always on, there is no downtime.
Partition-tolerance: system should continue to function in case network is partitioned or disrupted. If nodes are added or nodes fails the system still works.
The interesting aspect about the CAP Theorem is that it states that we can only have 2 of the 3 properties at the same time. Let's keep this in mind as we continue...


ACID
Most of the RDBMS databases have loosen up the Portion-tolerance property in favor of Consistency and Availability. Most if not all them are ACID compliance, which practically means that database transactions are processed in a way that guarantees immediately data consistency.
Atomic - transactions are all-or-nothing. Either complete or not complete but never leave data in some in between state.  
Consistent - only valid data is persisted into the database. Any modifications change the data from one consistent state to another consistent state.  
Isolated - concurrent transactions must not interfere with one another. In other words other transactions cannot access or see data that has been modified during a transaction that has not yet completed.  
Durable - actions results in a permanent change of state of the system, even in case of system failure.
ACID operations are definitively of great value and are a crucial aspect of RDBMS, they are fundamentally important in many scenarios.  For example, in a financial application is extremely important that my transaction for withdrawing money from one account is ACID-compliant even if this imposes some performance overhead. However, there are other applications that don't really need to pay for this performance overhead.  Most if not all NoSQL databases sacrifice ACID compliance to eliminate this overhead.


BASE
Most of the NoSQL data stores have loosen up on the requirement for Consistency in order to achieve Availability and Partition-tolerance. The result is know as BASE
Basically Available, - system seems to work all the time
Soft-state, - it doesn't have to be consistent all the time
Eventual consistency - becomes consistent at some time later
So what does this means, really? It essentially means that consistency is ensured but not necessarily immediately.  Let's use a hypothetical social networking applications as an example. Suppose this system is distributed across multiple servers, in multiple sites and in multiple countries. If I make a change to my relationship status from single to married, then this change might not be immediately reflected (someone in China might still see my status as single) but this change will eventually at some time be reflected (consistent). So BASE system simply guarantees consistence after some time, where as an ACID systems ensures consistency after every operation. Back to my hypothetical example, is it really that important for my relational status to be ACID... I'll say no! it makes no difference in this case, but it sure does when I am withdrawing money from my bank's ATM.


To conclude this long post :)  NoSQL data stores are not a replacement to traditional RDBMS. They are not a "silver bullet".  It is however, great to have more options available to choose from.

No comments:

Post a Comment