NameNode and DataNode :HDFS cluster has two types of nodes operating in a master−slave pattern: a namenode (the master) and a number of datanodes (slave/worker).Namenode stores metadata and datanode deals with actual storage. Datanodes are the workhorses of the filesystem and it stores files in smaller blocks(default size 128 MB).
Namenode manages the file-system namespace.Namespace/name-system consist of files and directories.Files and directories are represented on the namenode by inodes(Inode is a data structure used to represent a file system object file or directories in tree structure). Inodes stores all information about files or directory such as permissions, modification and access times,namespace and disk space quotas. Namenode maintains file-system tree and its metadata for all files and directories in tree.
Inodes and memory blocks which stores metadata of namespace is called image(fsimage). Namenode stores image in RAM and persistent record of image is stored in namenode's local file system called checkpoint. Along with the checkpoint, namenode maintains a write ahead log file called journal, in its local file system.
- Any changes made in HDFS is tracked in journal and journal size keep on growing unless changes is persisted and merged with checkpoint.
- Block locations are not stored persistently because on system start-up, this information is obtained from datanode.
- NameNode does not change checkpoint file, on start of namenode namespace image is loaded from checkpoint and new checkpoint and journal is created in native file system.
- In order to increase reliability and availability, replicated volume of checkpoint and journal is stored in local servers.
- When a datanode initializes and joins with namenode namespace ID is assigned to it and namespace ID is persistently stored in all datanodes. Namespace ID is used to maintain file system integrity.
- When a system start, all datanodes of cluster sends handsake message to namenode to prove its identity that datanode belong to given namenode namespace.If namenode finds that namespace ID is not correct for any datanode then that datanode is not allowed ot join cluster and datanode shut-down automatically. Thus, namespace ID restrict datanode with differenct namespace ID to join cluster.On successful handshake process, datanode is registered with namenode and become part of hadoop cluster.
- Storage ID is assigned to each datanodes when it joins a namenode the very first time and it is never changed after that.The storage ID is an internal identifier of the DataNode.
- Since namenode does not store block location, when datanodes are registered with namenode they send block report to namenode. Block report contains block ID, generation stamp and block length for each block replicas datanode possess. Datanode periodically(every hour)send block report to namenode and give updated view of blocks in cluster.
- Along with block report datanodes send light weight message(called Heartbeat) to namenode to mark their presence and blocks existence in cluster. The default heartbeat interval is three seconds.
What happens if namenode does not receive heartbeat from some datanode ?
If the namenoode does not receive a heartbeat from a datanode in ten minutes the namenode marks the datanode to be dead and the block replicas hosted by that datanode to be unavailable. Block replication operation is initiated by namenode for all blocks that were present in dead datanode.
- Heartbeat message contains information about total storage capacity of datanode, fraction of space in use so heartbeat message helps namenode in load balancing and block allocation decision.
- Nameode does not send message to datanode directly. It send response of heartbeat and instruct datanode to provide block report, replicate blocks, shut down node, etc.
- Maintain a standby node called secondary namenode in different server other than where namenode is existing
- Copy the image and log files to remote server periodically and when failure occur read from this location and recover.
|Namenode, DataNode and secondry namenode(checkpoint node) block diagram|
Any changes to HDFS is first logged in journal and it grows until its merged with checkpoint and new journal is created. Consider a situation, when hadoop cluster has not been started for months and log file has is very huge. When cluster restarts, namenode need to restore image(metadata of file system) from checkpoint created earlier and merge it with journal. The time taken to restart will depend on size of journal.
Here comes secondary namenode for rescue.Secondary namenode is managed on different server other than namenode. The main purpose of using secondary namenode is to periodically merge journal with checkpoint and create new journal and checkpoint. Steps which secondary namenode follows are :
- Secondary namenode downloads both checkpoint and journals from namenode .
- Merge both of them locally and create new checkpoint and empty journal
- Updated checkpoint is returned to namenode
Note: Secondary namenode does not act as namenode when namenode failure occurs. Its just matter of naming convention which creates confusion.Secondary namenode is also termed as checkpoint-node.