Note: Hadoop has went thorough a substantial architectural change from hadoop 1.x to hadoop 2.x, Hadoop 2.x has changed/modified some concept and added new feature to improve performance of hadoop cluster. Explanation below is more relevant and apt for hadoop 1.x. Read Haddop 2.0 vs Hadoop 1.0.
Before going read and write operation in detail, lets first understand: How data blocks are arranged in hadoop cluster/ what network topology is adopted and what is data block placement policy in hadoop cluster ?
Hadoop by default assumes that network is flat - a single level hierarchy(all nodes on single rack in a singe data centre). In a large hadoop cluster number of nodes are in large numbers and grow unheeded(commodity hardware is added to improve storage and fault tolerance) so it is be highly inefficient to adopt flat(switched) network topology because:
- In flat topology message send /broadcast packet will interrupt CPU on each node and operation execution will be order N(No of nodes in network or broadcast domain)
- In non hierarchical network overload on CPU of node also increases. CPU has to communicate many other routers and switches in order to reach at correct datanode.
|Hierarchical topology in hadoop cluster: data-nodes, racks and switches inter-connectivity|
Replica placement:- In hadoop cluster data blocks are replicated to handle failure condition. There is trade-off between reliability and bandwidth utilization(read/write) - If data blocks are replicatedon single node then no redundancy is achieved, on node failure data is lost. On the other hand, if data blocks are replicated on different data centres then redundancy may be maximized however penalty of read/write bandwidth would be huge.
What is default strategy followed for placement of data blocks in racks in hadoop cluster ?
- First replica of data block is placed on the same node where writer program is running. And second replica is placed on different rack than first and the third replicas is placed on same rack as second but on different node.
- If replication factor is more than 3, rest of replicas are placed on random nodes with a constraint that no more than one replica is placed at anyone node and no more than two replicas are placed in the same rack
In hadoop cluster, data is stored in file(create new or append in existing one) and file is break into multiple blocks.Once data is written in file it cannot be deleted only append new data is possible, HDFS follows single-writer, multiple-reader model. Lets understand how hadoop uses this model and perform write operation.
- When HDFS client request Namenode to create a new file(say input.txt), namenode checks permission for that and create appropriate inode.
- The HDFS client that creates/open file for writing is granted lease for the file and it is locked for that client and no other client has write access on it(Remember single writer,multiple reader model). Client has to renew lease before it expires by sending light weight message.Duration of lease is bounded by soft limit and hard limit.
Soft limit:- time duration before which if client did not renewed lease, other client can take over write access form this client.
Hard limit:- time duration(1 hour) after which write access will be revoked automatically, if lease is not renewed.
- From namenode HDFS client requests for block where file will be stored (Remember in HDFS file is stored in blocks).
- Namenode returns unique block id and list of datanodes for block replicas to be stored.
- Client write first block on datanode (block id which was returned by namenode) and passes list of data-nodes(provided by namenode for block replication) to that datanode where first block was written.Along with actual data client also send checksum to datanode for that block.
Client send data block in form of collection of packets to first datanode and datanode form a pipeline with list of datanodes which minimizes the total network distance from the client to the last datanode. Consider following diagram to understand how block splits into packets and one block is written to multiple datanodes:
Block transmission from client to data-node (Diagram reference: aosabook.org)
- The datanodes writes data block in file system and also stores checksum metadata in separate file,which is used while reading to verify data is valid or corrupted.
- Datanode acknowledges namenode and client about block persisted to file system.
- After cycle for one block completes,again client ask namenode for block id and datanodes and above process (4 to 7) repeats for other .
- After all blocks have be consumed, namenode checks if at least one copy of each block has been written.
- If above check is successful,then it release lease and send success message to client.
|Write operation in Hadoop|
In haoop since data is stored in file so when we need to read data we have to read data blocks of file. When a user application/HDFS client wants to read a file present in HDFS, the HDFS client perform following tasks:
- HDFS client requests namenode for the list of datanodes that stores replicas of blocks(file consist of blocks) for the given file name.
- For each block, namenode returns address of datanodes that has a copy of that block(metadata information).
- Once client receives metadata about blocks present at datanode, client request datanode to transfer blocks.
- HDFS client once receives metadata from namenode, locations of each block are ordered by their distance from the reader and client tries to read from closet replica first,if it fails then it try for next replica and so on. Data is streamed from datanode back to client.Below diagram gives an overview of read operation in HDFS.
|Read operation in HDFS|
1. Namenode does not ask datanode directly to transfer block.Client contacts datanode for blocks and is guided by namenode (best datanode for each block), this design allows HDFS to scale large number of concurrent clients. Following block diagram gives an overview of read operation in hadoop.
2. Read operation may fail ,when datanode is not available/corrupted or checksum of block does not matches when block is read by client.
Watch Hadoop and NoSQL Downfall Parody and I bet only hadoop learner can appreciate it.
Next: MapReduce - An efficient data processing technique