Sep 13, 2016

Apache Hadoop Interview Questions - Set 1

1. What is Hadoop and how it is related to Big data ?
Answer:- In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity system using various programming model like Mapreduce.(Commodity hardware is a non-expensive system without high-availability traits).
As business expand,volume of  data also grows and unstructured data is getting dumped into different machines for analysis.The major challenge is not to store large data but to retrieve and analyse the big data, that too data present in different machines geographically.
Hadoop framework comes here for rescue. Hadoop has the ability to analyse the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce programming model which enables it to process data sets in parallel.

2. What is Hadoop ecosystem and its building block elements ?
Answer:- The Hadoop ecosystem refers to the various components of the Apache Hadoop software library, as well as to the accessories and tools provided by the Apache Software Foundation for these types of software projects, and to the ways that they work together.
Core components of Hadoop are:
1. MapReduce - a framework for parallel prosessing vast amounts of data.
2. Hadoop Distributed File System (HDFS), a sophisticated distibuted file system.
3.YARN, a Hadoop resource manager.
In addition to these core elements of Hadoop, Apache has also delivered other kinds of accessories or complementary tools for developers. These include Apache Hive, a data analysis tool; Apache Spark, a general engine for processing big data; Apache Pig, a data flow language; HBase, a database tool; and also Ambarl, which can be considered as a Hadoop ecosystem manager, as it helps to administer the use of these various Apache resources together.

3. What is fundamental difference between classic Hadoop 1.0 and Hadoop 2.0 ?
Answer:- 
Hadoop 1.X Hadoop 2.X
Limited up to 4000 nodes per cluster Potentially up to 10000 nodes per cluster
Supports only for MapReduce processing model. Along with MapReduce processing model, support added for other distributed computing models(non MR) like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase co-processors.
Job tracker is bottleneck in Hadoop 1.x - responsible for resource management, scheduling and monitoring.(MR does both processing and cluster resource management.) YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models. Efficient cluster utilisation achieved using YARN.
Map Reduce slots are static. A given slots can run either a Map task or a Reduce task only. Works on concepts of containers. Using containers can run generic tasks.
Only one namespace for managing HDFS. Multiple namespace for managing HDFS.
Because of single NameNode it might lead of single point of failure and in case of NameNode failure, needs manual intervention. SPOF overcome with a standby NameNode and in case of NameNode failure, it is configured for automatic recovery.

4. What is Job tracker and Task tracker. How are they used in Hadoop cluster ?
Answer:- Job Tracker is a daemon that runs on a Namenode for submitting and tracking MapReduce jobs in Hadoop. Some typical tasks of Job Tracker are:
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data.
- It locates TaskTracker nodes with available slots at or near the data.
- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.
Task tracker is a daemon that runs on Datanodes. It accepts tasks like Map, Reduce and Shuffle operations - from a Job Tracker. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialise the job and divide the work and assign them to different task trackers to perform MapReduce tasks.While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

5. Whats the relationship between Jobs and Tasks in Hadoop ?
Answer:-  In hadoop Jobs are submitted by client and Jobs are split into multiple tasks like Map, Reduce and Shuffle.

6. What is HDFS (Hadoop distributed file system)? Why HDFS is termed as Block structured file system ? What is default HDFS block size ?
Answer:- HDFS is a file system designed for storing very large files. HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware (Commodity hardware is a non-expensive system without high-availability traits).
HDFS is termed as Block structured file system because individual files are broken into blocks of fixed size (default block size of an HDFS block is 128 MB). These blocks are stored across a cluster of one or more machines with data storage capacity. Changing the dfs.blocksize property in hdfs-site.xml will change the default block size for all the files placed into HDFS.

7. Why HDFS blocks are large as compared to disk blocks (HDFS default block size is 128 MB and disk block size in Unix/Linux is 8192 bytes) ? 
Answer:- HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files.In order to minimise the seek time while read operation - files are stored in large chunks in order of HDFS block size.
If file size is smaller than 128 MB then file will just use its's own size on a given block, rest will e used by other files.
If a particular file is 110 MB, will the HDFS block still consume  128 MB as the default size?
No, only 110 MB will be consumed by an HDFS block and 18 MB will be free to store something else.
Note:- In Hadoop 1 - default block size is 64 MB and in Hadoop 2 - default block size is 128 MB

8. What is significance of fault tolerance and high throughput in HDFS ?
Answer:- Fault Tolerance: - In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.So, chance of data loss is minimised and data loss can be recovered if there is any failure at one node.
Throughput:- Throughput is the amount of work done in a unit time. In HDFS, when client submit a job- it is divided and shared among different systems. All the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS provides good throughput.

9. What does "Replication factor" mean in Hadoop? What is default replication factor in HDFS ? How to modify default replication factor in HDFS ?
Answer
:- The number of times a file needs to be replicated in HDFS is termed as replication factor.
Default replication factor in HDFS is 3. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
We can change the replication factor on a per-file basis and on all files in the directory using hadoop FS shell.
$ hadoop fs –setrep –w 3 /MyDir/file
$ hadoop fs –setrep –w 3 -R /RootDir/Mydir

10. What is Datanode and Namenode in HDFS ?
Answer:- Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.
Namenode is the master node on which job tracker runs and stores metadata about actual storage of data blocks, so that it can manages the blocks which are present on the datanodes. It is a high-availability machine, Namenode can never be a commodity hardware because the entire HDFS rely on it so it has to be a high-availability machine.

11. Can Namenode and Datanode system have same hardware configuration ?
Answer:- In a single node cluster there is only one machine so Namenode and Datanode can be on same machine. However, in production environment Namenode and datanodes are on different machines. Namenode should be a high-end and high- availability machine.

12. What is the fundamental difference between traditional RDBMS and Hadoop?
Answer:- Traditional RDBMS is used for transnational systems ,whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it.
RDBMS Hadoop
Data size are order of Gigabytes Data size are order of Petabytes or Zettabytes 
Access method support Interactive and batch Access method support batch only
Static schemaDynamic schema
Nonlinear scaling Linear scaling 
High integrity Low integrity 
Suitable for Read and write many timesSuitable for write once, multiple times

12. What is secondary Namenode and what is its significance in hadoop ?
Answer
:- In Hadoop 1, Namenode was single point of failure. In order to make hadoop system up and running it was important to make the Namenode resilient to failure and add ability to recover from failure. If Namenode fails, no data access is possible from datanodes, as Namenode stores metadata about data balock stores on datanodes.
The main file written by the NameNode is called fsimage; This file is read into memory and all future modifications to the filesystem are applied to this in-memory representation of the filesystem. The Namenode does not write out new versions of fsimage as new changes are applied after it is run; instead, it writes another file called edits, which is a list of the changes that have been made since the last version of fsimage was written.
Secondary Namenode is used to periodically merge the Namespace image with the edit log to prevent the edit log from becoming too large. The secondary Namenode usually runs on a separate physical machine because it requires plenty of CPU and as much memory as the Namenode to perform the merge. It maintains a copy of the merged namespace image, which can be used in the event of the Namenode failing. However, the state of the secondary Namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain.
Note:- Secondary Namenode is not standby of primary Namenode, so it is not substitute of Namenode. Read in detail about Namenode,Datanode and Secondry Namenode and Internals of read and write operation in hadoop.

13. What is importance of heartbeat in HDFS ?
Answer:- A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker.
If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task. 14. What is HDFS cluster? Answer:- HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. In other words, collections of Datanode commodity machine and High availability Namenode collectively termed as HDFS cluster. Read in detail about Namenode,Datanode and Secondry Namenode

14. What is the communication channel between client and namenode/datanode?
Answer:- The mode of communication is SSH.

15. What is a rack ? What is Replica Placement Policy ?
Answer:- Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.
When client wants to load a file into the cluster, the content of the file will be divided into blocks and Namenode provides information about 3 datanodes for every block of the file which indicates where the block should be stored.
While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

Location: Hyderabad, Telangana, India