Internals of Read and write operation in hadoop

In previous post we learned about HDFS building blocks namenode, datanode , secondry namenode. In this post, I will be discussing read and write operation in HDFS. It is the HDFS client (a library that exports the HDFS filesystem interface) which communicate with name node on behalf of user/user application.In HDFS data's are stored in files and that file is split into multiple as per  to size of blocks in datanodes.
Note: Hadoop has went thorough a substantial architectural change from hadoop 1.x to hadoop 2.x, Hadoop 2.x has changed/modified some concept and added new feature to improve performance of hadoop cluster. Explanation below is more relevant and apt for hadoop 1.x. Read Haddop 2.0 vs Hadoop 1.0.
Before going read and write operation in detail, lets first understand: How data blocks are arranged in hadoop cluster/ what network topology is adopted and what is data block placement policy in hadoop cluster ?
Hadoop by default assumes that network is flat - a single level hierarchy(all nodes on single rack in a singe data centre). In a large hadoop cluster number of nodes are in large numbers and grow unheeded(commodity hardware is added to improve storage and fault tolerance) so it is be highly inefficient to adopt flat(switched) network topology because:
  1. In flat topology message send /broadcast packet will interrupt CPU on each node and operation execution will be order N(No of nodes in network or broadcast domain)
  2. In non hierarchical network overload on CPU of node also increases. CPU has to communicate many other routers and switches in order to reach at correct datanode.
Hierarchical/modular topology which minimizes number of communication with routers/switches and shorten distance between nodes(inter/intra) and increases bandwidth utilization. In hadoop cluster datanodes are grouped under racks and all datanodes are connected with a switch in rack and all racks are connected with top level one or more switches.A hadoop cluster with two racks and 3 datanodes can be depicted as :
Hierarchical topology in hadoop cluster: data-nodes, racks and switches inter-connectivity 
Note:- HDFS provides flexibility to configure a script which decide cluster will use multiple racks or all datanodes will be considered under single racks. Namenode resolves rack location for datanode and it also stores where about of datanode.When datanode registers with namenode, namenode runs that script to decide which rack the datanode belongs too.

Replica placement:- In hadoop cluster data blocks are replicated to handle failure condition. There is trade-off between reliability and bandwidth utilization(read/write) - If data blocks are replicatedon single node then no redundancy is achieved, on node failure data is lost. On the other hand, if data blocks are replicated on different data centres then redundancy may be maximized however penalty of read/write bandwidth would be huge.
What is default strategy followed for placement of data blocks in racks in hadoop cluster ?
  • First replica of data block is placed on the same node where writer program is running. And second replica is placed on different rack than first and the third replicas is placed on same rack as second but on different node.
  • If replication factor is more than 3, rest of replicas are placed on random nodes with a constraint that no more than one replica is placed at anyone node and no more than two replicas are placed in the same rack
Default value of dfs.replication is 3, so unless configured different valuedefault replication is 3. Now we will move on to main objective of this post, read and write operation in hadoop cluster.

Write operation:

In hadoop cluster, data is stored in file(create new or append in existing one) and file is break into multiple blocks.Once data is written in file it cannot be deleted only append new data is possible, HDFS follows single-writer, multiple-reader model. Lets understand how hadoop uses this model and perform write operation.
  1. When HDFS client request Namenode to create a new file(say input.txt), namenode checks permission for that and create appropriate inode. 
  2. The HDFS client that creates/open file for writing is granted lease for the file and it is locked for that client and no other client has write access on it(Remember single writer,multiple reader model). Client has to renew lease before it expires by sending light weight message.Duration of lease is bounded by soft limit and hard limit.
    Soft limit:- time duration before which if client did not renewed lease, other client can take over write access form this client.
    Hard limit:- time duration(1 hour) after which write access will be revoked automatically, if lease is not renewed.
  3. From namenode HDFS client requests for block where file will be stored (Remember in HDFS file is stored in blocks).
  4. Namenode returns unique block id and list of datanodes for block replicas to be stored. 
  5. Client write first block on datanode (block id which was returned by namenode) and passes list of data-nodes(provided by namenode for block replication) to that datanode where first block was written.Along with actual data client also send checksum to datanode for that block.
    Client send data block in form of collection of packets to first datanode and datanode form a pipeline with list of datanodes which minimizes the total network distance from the client to the last datanode. Consider following diagram to understand how block splits into packets and one block is written to multiple datanodes:
    Block  transmission from client to data-node (Diagram reference: aosabook.org)
    First pipeline is set-up, then data streaming is carried out by sending packets and for each packet acknowledgement is send back to client and finally pipeline connection is tear down.
  6. The datanodes writes data block in file system and also stores checksum metadata in separate file,which is used while reading to verify data is valid or corrupted. 
  7. Datanode acknowledges namenode and client about block persisted to file system.
  8. After cycle for one block completes,again client ask namenode for block id and datanodes and above process (4 to 7) repeats for other .
  9. After all blocks have be consumed, namenode checks if at least one copy of each block has been written.
  10. If above check is successful,then it release lease and send success message to client.
Write operation can be summarized in following diagram:
Write operation in Hadoop
Here client wants to create input.txt in HDFS and request for block from namenode and first block is transferred by client in datanode and datanode persist it in file system. Similarly, second block is also written and finally lease is revoked and success ACK is send to client.

Read operation:

In haoop since data is stored in  file so when we need to read data we have to read data blocks of file. When a user application/HDFS client wants to read a file present in HDFS, the HDFS client perform following tasks:
  1. HDFS client requests namenode for the list of datanodes that stores replicas of blocks(file consist of blocks) for the given file name.
  2. For each block, namenode returns address of datanodes that has a copy of that block(metadata information). 
  3. Once client receives metadata about blocks present at datanode, client request datanode to transfer blocks.
  4. HDFS client once receives metadata from namenode, locations of each block are ordered by their distance from the reader and client tries to read from closet replica first,if it fails then it try for next replica and so on. Data is streamed from datanode back to client.Below diagram gives an overview of read operation in HDFS.
Read operation in HDFS



Note :- 
1. Namenode does not ask datanode directly to transfer block.Client contacts datanode for blocks and is guided by namenode (best datanode for each block), this design allows HDFS to scale large number of concurrent clients. Following block diagram gives an overview of read operation in hadoop.
2. Read operation may fail ,when datanode is not available/corrupted or checksum of block does not matches when block is read by client.
Watch Hadoop and NoSQL Downfall Parody  and I bet only hadoop learner can appreciate it.




Next: MapReduce - An efficient data processing technique

4 Comments

  1. Diagrams are good and easily understandable.Explanation of read and write is very good.
    However, it would be nice to read something about YARN,hadoop 2.0 and all.....

    ReplyDelete
    Replies
    1. Thanks, I am happy you got some benefit out of it.
      I make sure to write some Hadoop 2.0 stuff which has improved performance in read and write operation in Hadoop ecosystem.

      Delete
  2. May be this is the best explanation on HDFS read/write

    ReplyDelete
Previous Post Next Post