As business grows so as volume of data, RDBMS become inefficient and less cost effective database technologies. That is the reason, RDBMS is now considered to be a declining database technology.
RDBMS is not effective in terms of providing scalable solution to meet the needs high volume of data flowing with high velocity.
What does it mean when we say, RDBMS is not scalable ? : Horizontal scaling(data distribution at multiple nodes/server) is not possible in relational database, only vertical scaling of system(processing speed and storage increase) is feasible however their is upper limit of it.
In order to deal with high volume of data, new database technology evolved coined as "NoSQL" database. NoSQL is abbreviated as "not only SQL". i.e: NoSQL databases unlike relational databases which deal with structured data only, deals with unstructured and semi-structured data and centres around the concept of distributed databases.Data are distributed across various processing nodes/servers and trading of consistency (BASE instead of ACID) for speed and agility. Because of distributed architecture of NoSQL, it is horizontally scalable(as data continues to explode, just add more hardware(commodity hardware)) to keep system up and running.
Performance of relational database vs NoSQL database can be depicted as follows: In relational databases as volume of data increases performance decreases while in NoSQL databases it does not vary with volume of data.
For small volume of data relational database outperform NoSQL database.
When we talk about high volume of data (in TB or PB) flowing with high velocity coins a fancy term Big data. Big data basically consist of 4 V's : Volume, Velocity, Veracity and Variety
- Volume - Amount of data to dealt are in TB or PB.
- Velocity- It describes the frequency at which data is generated, captured and shared.
- Variety - A large volume of data is present in data storage in form of documents, emails ,video and voices. These are unstructured data and emerging data types include geo-spatial and location, log data, machine data, metrics, mobile, RFIDs,search, streaming data, social, text and so on.
- Veracity - As mentioned earlier, we are extending benefit of distributed database at cost of consistency.So, there are chances data retrieved or received is not intended one. So, reliability is always a concern in such system.
Hadoop is a software ecosystem that allows for massively parallel computing using HDFS and MapReduce.With time various software system has been added in hadoop ecosystem like Pig,Hive,Hbase,etc. Please note hadoop has an abstract notion of filesystems, of which HDFS is just one implementation other in this list are WebHDFS,Azure,Local,etc.
Hadoop Distributed File System (HDFS): It is a highly fault-tolerant, high throughput distributed file system designed to run on commodity hardware(affordable and easy to obtain). Hadoop ecosystem states, hardware failure is the norm rather than the exception.So, data is distributed across multiple server(hadoop cluster) and replicated to handle node failure condition. HDFS stores data in fixed size block and the commodity hardware(part of hadoop cluster) where data is stores is called data node.
- HDFS is designed to handle application dealing with batch processing rather than interactive operation.
- HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified,
- The focus of HDFS is on high throughput rather than latency.
Latency : Amount of time needed to complete a task or produce output.
Throughput : No of task completed in per unit of time.
- HDFS stores filesystem metadata and application data separately and follows master slave architecture. Filesystem metadata is stored in NameNode and application data is stored in DataNode. Name node act as master and data node is slave.
NameNode: It manages the file system namespace(NameNode inserts the file name into the file system tree and allocates a data block for it) ,determines mapping of blocks to data nodes and maintains meta-data information of blocks stored in data node. It executes file system namespace operations like opening, closing, and renaming files and directories. For each hadoop cluster there is only one name node which act as master.
DataNode: It provides actual storage of data and responsible for serving read and write requests from the file system’s clients. DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. Refer this post for more detail about namenode and datanode.
- In multi-node cluster name node and data node are on different machine and single node cluster both are existing together in same machine.In multi-node cluster, there is only one name node and multiple data nodes.Sometimes name node is called single point of failure in hadoop cluster.
- In order to deal with single point failure of name node, hadoop maintains a SecondaryNameNode node on different machine(in multi-node cluster) which stores images (metadata of data blocks) at certain checkpoint and is used as backup to restore Name Node.
Read and write operation in hadoop gives broader picture of namenode and datanode.
|Data Node storing blocks B1, B2. and Name Node has Data Node where about|
In next post we will revisit Namenode and DataNode and discuss them in detail.
Next: NameNode, DataNode and Secondary NameNode