Dec 18, 2016

Simulate Login and Registration of New User using Python data structures

Python comes in very handy when dealing with data, for data crunching, machine learning,
interactions on the web.
Today we aim to make a authentication mechanism using Python.
Using python dictionaries to simulate database for user information.
Give the user two options,
a. Add new User
b. Log In with existing credentials
a. For new user registration in system, input parameters are
NAME,ROLL,EMAIL,PASSWORD(comma separated values).
Validate entries as-
- email Id – email id acceptable(SOMEONE@students.devinline.ac.in)
- Password length at least 8 characters
- Roll number should be a integer [optional if you do not know python regex]
If valid entry- print “Success”
Else print - “Registration Failed”
b. For logging in, input parameters are ROLL/EMAIL,PASSWORD
Input can be roll number or email, both are valid.
If authentication is successful print “Login Success”
Else print “Login Fails”

Sample code - Simulate Login and Registration using Dict


import re

class Students:
    def __init__(self, name, rollNo, email, password):
        self.name = name
        self.rollNo = rollNo
        self.email = email
        self.password = password 


def doDBOperation(choice,studentDict1,studentDict2):
    if choice == 1:
        print "Enter name, rollNo, password, email(Separated by comma)"
        inputArr = map(str,raw_input().strip().split(','))
        name=inputArr[0]
        rollNo=inputArr[1]
        email=inputArr[2]
        password=inputArr[3]
        #print name, rollNo, password, email
        stat = adduser(name, rollNo, email,password) 
        if stat == "SUCCESS":
            student = Students(name, rollNo, email, password)
            studentDict1[rollNo] = student
            studentDict2[email] = student
            print "Success"
        else:
            print "Registration Failed"
    elif choice == 2:
        print "Enter loginId(Roll or Email) and password"
        inputArr = map(str,raw_input().strip().split(','))
        rollorEmail=inputArr[0]
        password=inputArr[1]
        stat=loginService(studentDict1,studentDict2, rollorEmail, password)
        if stat == "SUCCESS":
            print "Login Success!!"
        else:
            print "Login Fails !!"
    print "Enter your choice(1|2)"
def adduser(name, rollNo, email, password):
    if validateEmail(email) and validatePassword(password) and validateRollNo(rollNo) :
        return "SUCCESS"
    else:
        return "FAILURE"

def loginService(studentDict1,studentDict2,rollorEmail,pwd):
    if studentDict1.has_key(rollorEmail) or studentDict2.has_key(rollorEmail):
        if studentDict1.has_key(rollorEmail):
            studentObj = studentDict1[rollorEmail]
        else:
            studentObj = studentDict2[rollorEmail]
        if studentObj.password == pwd:
            return "SUCCESS"
        else:
            return "FAILURE"

def validateEmail(email):
    pattern = "^[a-z]+[.]?[a-z]*@students.devinline.ac.in"
    prog = re.compile(pattern)
    result = prog.match(email)
    if result != None:
        return True
    else:
        return False
    

def validateRollNo(rollNo):
    rollNo = str(rollNo)
    return rollNo.isdigit()

def validatePassword(password):
    if len(password) >= 8:
        return True
    else:
        return  False



if __name__ == "__main__" :
    studentDict1 = { }
    studentDict2 = { }
    print "Welcome to Devinline authentication system"\
        "\nChoose:\
        \n1-Register New User\
        \n2-Log In \
        \nAny other - Exit program"
    while True:
        choice=int(input())
        if choice == 1  or choice == 2: 
            doDBOperation(choice,studentDict1,studentDict2)
        else :
            print "OK Bye"
            break


Sample Output:-

>>>
Welcome to Devinline authentication system
Choose:      
1-Register New User      
2-Log In      
Any other - Exit program
1
Enter name, rollNo, password, email(Separated by comma)
alia,20160047,alia@students.devinline.ac.in,**alia**
Success
Enter your choice(1|2)
2
Enter loginId(Roll or Email) and password
20160047,**alia**
Login Success!!

Nov 27, 2016

Merge list of Iterators into a single Iterator using Google Guava collection framework

It is one of advanced level java interview questions- Device an algorithm to merge list of Iterators into a single Iterator. Google guava provides collections framework which does it in couple of lines.

Below lines shows how to merge multiple iterators and provide single iterator out of it.

Iterable<Integer> iterable = Iterables.concat(list1, list2,list3, list4);
Iterator<Integer> iterator = iterable.iterator();

Sample program demonstrating above concept

package javacore;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

import com.google.common.collect.Iterables;

public class IteratorOverAlIteratorsGuava {

 /**
  * @param args
  */
 public static void main(String[] args) {
  Iterator<Integer> iterator = iterators();
  while (iterator.hasNext()) {
   System.out.println(iterator.next());
  }
 }

 public static Iterator<Integer> iterators() {
  List<Integer> list1 = new ArrayList<>();
  list1.add(0);
  list1.add(1);
  list1.add(2);
  List<Integer> list2 = Arrays.asList(3, 4);
  List<Integer> list3 = new ArrayList<>();
  List<Integer> list4 = Arrays.asList(5, 6, 7, 8, 9);

  final Iterable<Integer> iterable = Iterables.concat(list1, list2,
    list3, list4);
  final Iterator<Integer> iterator = iterable.iterator();
  return iterator;
 }
}

Sample Output:-
0
1
2
3
4
5
6
7
8
9

Oct 9, 2016

Python os module- File operations in Python using python os module

Python provides os module which can be used to perform File operations like - create, read, write and delete. Below program performs following operation -
1. Check for a directory and create if it does not exist
2. Create a file .
3. Write into file
4. Rename file
5. Delete files
6. Delete Directory

Sample program to illustrate various file operation using os module of python :- 


import os
import sys
import tempfile
def dofileoperation():
    dirExist=0
    tempdir=tempfile.gettempdir()
    if os.path.isdir(tempdir):
        print "temp directory exist!!"
        dirExist=1
    else:
        print "No temp directory exists, \
            so create user defined directory"
        dir="C:\tempPython"
        create_temp_dir(dir)
        dirExist=1
    if dirExist==1:
        os.chdir(tempdir)
        current_working_dir = os.getcwd()
        print "Current working directory is " + current_working_dir
        print "Create 'example' directory if does not exist"
        if os.path.isdir('example'):
            pass
        else:
            create_dir('example')

        #change directory to example
        os.chdir('example')
        current_working_dir=os.getcwd()
        print "\nNew working directory is " + current_working_dir

        print "Creating three file in current working directory "
        for i in range(1,3):
            file_handle=create_file_writemode('input'+str(i))
            write_into_file(file_handle,'Hello file\nI \
                am first line\nI am second line')

        print "\nCurent directory listing are " 
        current_dir_listing(current_working_dir)

        print "\nDisplay fully qualified of all \
            file in current directory"
        full_qualified_name(current_working_dir)
        
        first_file= os.path.join(current_working_dir,\
            os.listdir(current_working_dir)[0])
        print "\nDisplaying file contents of file: " + first_file
        read_file_content(first_file)

        print "Rename first file in current directory "
        file_rename(first_file,first_file+'_new')
        print "\nCurent directory listing after file rename " 
        current_dir_listing(current_working_dir)

        print "\nDeleting all files in current directory"
        delete_files_indir(current_working_dir)
        
        print "\nCurent directory listing are " 
        current_dir_listing(current_working_dir)

        print "\nDeleting working directory 'example' "
        delete_dir('example')

        print "\nChecking existance of example dir" 
        if os.path.isdir('example'):
            print 'example directory exists!!'
        else:
            print 'example directory deleted successfully !!'
def create_dir(dir):
    os.mkdir(dir)

def delete_dir(dir):
    #print os.pardir
    os.chdir(os.pardir)
    #print os.getcwd()
    os.rmdir(dir)

def create_file_writemode(filename):
    return open(filename,'w')

def get_filehandle(filename):
    return open(filename)

def write_into_file(file_handle,text):
    file_handle.write(text)
    file_handle.close()

def full_qualified_name(current_working_dir):
    for filename in os.listdir(current_working_dir):
            print os.path.join(current_working_dir,filename)

def current_dir_listing(current_working_dir):
    print os.listdir(current_working_dir)

def read_file_content(filename):
    file_handle = open(filename)
    allLines = file_handle.readlines()
    for eachline in allLines:
        print eachline

def delete_files_indir(current_working_dir):
    for filename in os.listdir(current_working_dir):
        os.remove(os.path.join(current_working_dir,filename))

def file_rename(oldname,newname):
    os.rename(oldname,newname)

if __name__ == '__main__':
    dofileoperation()

Sample output
:-

C:\Python27\sample->python fileop.py
temp directory exist!!
Current working directory is c:\users\nranjan\appdata\local\temp
Create 'example' directory if does not exist

New working directory is c:\users\nranjan\appdata\local\temp\example
Creating three file in current working directory

Curent directory listing are
['input1', 'input2']

Display fully qualified of all             file in current directory
c:\users\nranjan\appdata\local\temp\example\input1
c:\users\nranjan\appdata\local\temp\example\input2

Displaying file contents of file: c:\users\nranjan\appdata\local\temp\example\input1
Hello file

I                 am first line

I am second line
Rename first file in current directory

Curent directory listing after file rename
['input1_new', 'input2']

Deleting all files in current directory

Curent directory listing are
[]

Deleting working directory 'example'

Checking existance of example dir
example directory deleted successfully !!

Oct 8, 2016

Median of stream of numbers in Java

Problem statement: Find median of stream of numbers numbers. Reference GeeksForGeeks

Median is defined as middle element of sorted elements (even count of numbers) / mean of two middle elements(odd count of numbers)
Consider array of numbers as stream of numbers
5, 15, 1, 3, 2, 8, 7, 9, 10, 6, 11, 4
After reading 1st element of stream - 5 -> median - 5
After reading 2nd element of stream - 5, 15 -> median - 10
After reading 3rd element of stream - 5, 15, 1 -> median - 5

Sample program to find median of numbers using Min and Max heap:-

We need to maintain two heaps say left heap as max heap and right heap as min heap. Refer how to implement max and min heap.
public class MedianOfNumberStream {
public static void findMedianOfStream(int[] input) { // stream is passed as
              // an array
 Heap left = new MaxHeap();
 Heap right = new MinHeap();
 float median = 0;
 int size = input.length;
 for (int i = 0; i < size; i++) {
  System.out.print("Current median of " + (i + 1) + " elements is: ");
  median = computeCurrrentMedian(input[i], median, left, right);
  System.out.print(median);
  System.out.println();
 }
}

private static float computeCurrrentMedian(int currentElement,
  float median, Heap left, Heap right) {
 int stat = Util.LeftOrRight(left.getSize(), right.getSize());
 switch (stat) {
 case 1: // Number of elements in left (max) heap > right (min) heap
  if (currentElement < median) {
   // Remove top element from left heap and
   // insert into right heap
   right.insert(left.remove());

   // current element fits in left (max) heap
   left.insert(currentElement);
  } else {
   // current element fits in right (min) heap
   right.insert(currentElement);
  }

  // Both heaps are balanced
  median = Util.average(left.getTop(), right.getTop());

  break;

 case 0: // Number of elements in left (max) heap = right (min) heap

  if (currentElement < median) {
   left.insert(currentElement);
   median = left.getTop();
  } else {
   // current element fits in right (min) heap
   right.insert(currentElement);
   median = right.getTop();
  }

  break;

 case -1: // Number of elements in left (max) heap < right (min) heap

  if (currentElement < median) {
   left.insert(currentElement);
  } else {
   // Remove top element from right heap and
   // insert into left heap
   left.insert(right.remove());
   // current element fits in right (min) heap
   right.insert(currentElement);
  }
  // Both heaps are balanced
  median = Util.average(left.getTop(), right.getTop());
  break;
 }
 return median;
}

public static void main(String[] args) {
 //int A[] = { 5, 15, 1, 3, 2, 8, 7, 9, 10, 6, 11, 4 };
 int B[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
 findMedianOfStream(B);
}
}

class MaxHeap extends Heap {
public MaxHeap() {
 super(Integer.MAX_VALUE/1000, HeapType.MAX_HEAP.getValue());
}
}

class MinHeap extends Heap {
public MinHeap() {
 super(Integer.MAX_VALUE/1000, HeapType.MIN_HEAP.getValue());
}
}

class Util {
public static int LeftOrRight(int a, int b) {
 if (a == b)
  return 0;

 return a < b ? -1 : 1;
}

public static float average(int a, int b) {
 return ((float) a + (float) b) / 2;
}
}

enum HeapType {
MAX_HEAP(0), MIN_HEAP(2);

private final int value;

HeapType(final int newValue) {
 value = newValue;
}

public int getValue() {
 return value;
}
}

class Heap {
int[] heap;
int size;
int minMaxFlag;

public Heap() {
}

public Heap(int max, int minMaxFlag) {
 heap = new int[max];
 size = 0;
 this.minMaxFlag = minMaxFlag;
}

public int getSize() {
 return size;
}

int getTop() {
 int max = Integer.MAX_VALUE;

 if (size >= 0) {
  max = heap[0];
 }

 return max;
}

public int parentIndex(int index) {
 return (index - 1) / 2;
}

public int leftChildIndex(int index) {
 return (2 * index) + 1;
}

public int rightChildIndex(int index) {
 return (2 * index) + 2;
}

public void swap(int index1, int index2) {
 heap[index1] = heap[index1] ^ heap[index2];
 heap[index2] = heap[index1] ^ heap[index2];
 heap[index1] = heap[index1] ^ heap[index2];
}

public void insert(int element) {
 if (size == 0) {
  heap[size++] = element;
 } else {
  heap[size] = element;
  percolateUp(size++);
 }
}

// max/min heap based on flag
public void percolateUp(int index) {
 int temp = heap[index];
 int parent = parentIndex(index);
 if (this.minMaxFlag == 0) {
  while (index > 0 && heap[parent] < temp) {
   heap[index] = heap[parent];
   index = parent;
   parent = parentIndex(index);

  }
 } else {
  while (index > 0 && heap[parent] > temp) {
   heap[index] = heap[parent];
   index = parent;
   parent = parentIndex(index);

  }
 }

 heap[index] = temp;
}

public int remove() {
 int temp = heap[0];
 heap[0] = heap[--size];
 percolateDown(0);
 return temp;
}

public void percolateDown(int index) {
 int lcIndex;
 int rcIndex;
 int temp = heap[index];
 int largeChildIndex;
 int smallChilIndex;
 if (minMaxFlag == 0) {
  while (index < (size / 2)) {
   lcIndex = leftChildIndex(index);
   rcIndex = rightChildIndex(index);
   if (rcIndex < size && heap[lcIndex] < heap[rcIndex]) {
    largeChildIndex = rcIndex;
   } else {
    largeChildIndex = lcIndex;
   }
   if (heap[largeChildIndex] <= temp)
    break;
   heap[index] = heap[largeChildIndex];
   index = largeChildIndex;
  }
 } else {
  while (index < (size / 2)) {
   lcIndex = leftChildIndex(index);
   rcIndex = rightChildIndex(index);
   if (rcIndex < size && heap[lcIndex] > heap[rcIndex]) {
    smallChilIndex = rcIndex;
   } else {
    smallChilIndex = lcIndex;
   }
   if (heap[smallChilIndex] >= temp)
    break;
   heap[index] = heap[smallChilIndex];
   index = smallChilIndex;
  }
 }
 heap[index] = temp;
}
}
Sample Output:-
Current median of 1 elements is: 1.0
Current median of 2 elements is: 1.5
Current median of 3 elements is: 2.0
Current median of 4 elements is: 2.5
Current median of 5 elements is: 3.0
Current median of 6 elements is: 3.5
Current median of 7 elements is: 4.0
Current median of 8 elements is: 4.5
Current median of 9 elements is: 5.0
Current median of 10 elements is: 5.5

Sep 15, 2016

Big Data: An overview of big data technologies learning and its importance (Need of hour technologies)

Do You Want to Become a Big Data Hadoop Expert?

Hadoop is one of the most sought-after skills today. Big data professionals are asked to prove their skills with tools and techniques of Hadoop stack. It has been observed by the market analyzers that though there are many Hadoop professionals in the market, still a huge number of skilled and trained professionals are required.

We are living in the world of Big Data revolution and every organization relies on collecting, analyzing, organizing and synthesizing a huge amount of data in order to survive in the competitive market. Every organization either government or private uses Big data to manage their products and services to attract new customers. In this blog, let’s know more about the career path of a Hadoop Expert.

Introduction to Big Data
The Big Data term is basically used to manage dataset collections, especially those that are complex and large. The sets that cannot be processed and stored with the help of traditional data management tools use Hadoop technology to process them. The main challenges of data processing that are observed include searching, sharing, curating, capturing, analyzing and transferring of stored data.

Following Listed 5 Vs Characterize The Data Processing Challenges:
1.     Volume: Volume refers to the huge amount of data, which keeps on growing day by day and becomes huge to process

2.     Variety: Presence of various data sources contributing to Big data can be from old databases or social media. Data can also be structured or unstructured.

3.     Velocity: The pace with which various data sources contribute to big data by generating traffic may be different. Big data has the power to manage the traffic and massive amount of data.

4.     Veracity: Sometimes data can be present or not so the uncertainty of data availability refers to data inconsistency and incompleteness refers to data veracity.

5.     Value: Though the massive amount of data is available throughout the data sources all of them is not valuable, so turning it to valuable data which can benefit the organization is important and done by Big data Hadoop.

Hadoop and It’s Architecture

Hadoop and its architecture consist mainly of two components that are NameNode and DataNode. Both are described below:

NameNode
NameNode is the master daemon that is responsible to manage several files, clusters, file permission, hierarchy and every change made to the file system. As soon as a change in any file is made like if a file will be deleted then it will be immediately reflected in EditLog. Edit log basically receives a block report and heartbeat from all data nodes to make sure that data nodes are live.

DataNode
DataNodes are daemon slave nodes, that run on slave machines. Actual data is stored on datanode and they are responsible to read and write client requests. As per NameNode decisions the data nodes can delete or replicate the blocks. For this YARN or Yet Another Resource Manager tool is used.

YARN Resource Manager
ResourceManager works at cluster level and runs on the master machine. Resource management and application scheduling are two of the main responsibilities of ResourceManager. Through YARN ResourceManager both of these tasks are managed.

YARN Node Manager
NodeManager component of YARN is Nodelevel and runs on every slave machine. Main responsibilities of NodeManager includes managing and monitoring of containers, it also manages logs and keeps track of node health. NodeManager continuously communicates with ResourceManager.

MapReduce
MapReduce is a core Hadoop component and provides processing logic. This software framework helps in writing applications that can process large data sets by using parallel and distributed algorithms in Hadoop environment. Functions like grouping, sorting, and filtering are performed by the map function and aggregation, summarization and result production are two of the main responsibilities of the map-reduce component.

Career Path  of a Hadoop Developer

There can be many challenges while starting a career as Hadoop developer. Here we have summarized some key factors for the path of Hadoop professionals, which will help you in shaping the career as a successful professional.

Required Educational and Technical Skill
The course can also be joined by non-technical candidates from the backgrounds like graduates of Statistics, Electronics, Physics, Mathematics, Business Analytics and Material Processing. As far as experience is concerned then newbies or less experienced professionals having 1-2 years of experience can become Hadoop developers. Mainly employers judge the candidates based on their knowledge and their zeal to learn the new concepts.

For technical experience technical knowledge of java concepts are required. Though for Hadoop you may not require possessing advance Java concept knowledge. Professionals from other streams also learn Hadoop and switch their career to this most in-demand platform.

Hadoop Certifications and Learning Path
One of the commonly seen questions among Hadoop developers is that “what is the value of certification for this profession?” With the certification, the candidate’s candidature can be judged or even verified. There are various Hadoop certifications available for Hadoop developers like of IBM, EMC, MapR and many more. One can apply and get certified in the technology easily.

As far as learning path for the developers is concerned then the candidates who fulfill the basic educational requirements either with or without relevant experience can apply for the position of Hadoop developer.

There are a number of companies that hire Hadoop professionals and are offered best salary packages. As the demand for Hadoop professionals is higher than availability so it has become the most sought-after skill among Hadoop professionals.

The research shows that a Hortonwork Hadoop professional can earn around $170,472, while Walmart is offering an average package of the $128K package to the Hadoop professionals in California. In the countries like the USA, Hadoop professionals are getting on an average $145K salary package. So you can sense the sensation of Hadoop profession these days.

Final Words

Those who have a passion for data analytics and statistics, Hadoop is one of the great choices for you. You can deep dive into the technology and it can prove as a lucrative career option for you. Good Luck!!


About Author
Manchun Pandit loves pursuing excellence through writing and have a passion for technology. he has successfully managed and run personal technology magazines and websites. he currently writes for JanBaskTraining.com, a global training company that provides e-learning and professional certification training.

Sep 13, 2016

Apache Hadoop Interview Questions - Set 1

1. What is Hadoop and how it is related to Big data ?
Answer:- In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity system using various programming model like Mapreduce.(Commodity hardware is a non-expensive system without high-availability traits).
As business expand,volume of  data also grows and unstructured data is getting dumped into different machines for analysis.The major challenge is not to store large data but to retrieve and analyse the big data, that too data present in different machines geographically.
Hadoop framework comes here for rescue. Hadoop has the ability to analyse the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce programming model which enables it to process data sets in parallel.

2. What is Hadoop ecosystem and its building block elements ?
Answer:- The Hadoop ecosystem refers to the various components of the Apache Hadoop software library, as well as to the accessories and tools provided by the Apache Software Foundation for these types of software projects, and to the ways that they work together.
Core components of Hadoop are:
1. MapReduce - a framework for parallel prosessing vast amounts of data.
2. Hadoop Distributed File System (HDFS), a sophisticated distibuted file system.
3.YARN, a Hadoop resource manager.
In addition to these core elements of Hadoop, Apache has also delivered other kinds of accessories or complementary tools for developers. These include Apache Hive, a data analysis tool; Apache Spark, a general engine for processing big data; Apache Pig, a data flow language; HBase, a database tool; and also Ambarl, which can be considered as a Hadoop ecosystem manager, as it helps to administer the use of these various Apache resources together.

3. What is fundamental difference between classic Hadoop 1.0 and Hadoop 2.0 ?
Answer:- 
Hadoop 1.X Hadoop 2.X
Limited up to 4000 nodes per cluster Potentially up to 10000 nodes per cluster
Supports only for MapReduce processing model. Along with MapReduce processing model, support added for other distributed computing models(non MR) like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase co-processors.
Job tracker is bottleneck in Hadoop 1.x - responsible for resource management, scheduling and monitoring.(MR does both processing and cluster resource management.) YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models. Efficient cluster utilisation achieved using YARN.
Map Reduce slots are static. A given slots can run either a Map task or a Reduce task only. Works on concepts of containers. Using containers can run generic tasks.
Only one namespace for managing HDFS. Multiple namespace for managing HDFS.
Because of single NameNode it might lead of single point of failure and in case of NameNode failure, needs manual intervention. SPOF overcome with a standby NameNode and in case of NameNode failure, it is configured for automatic recovery.

4. What is Job tracker and Task tracker. How are they used in Hadoop cluster ?
Answer:- Job Tracker is a daemon that runs on a Namenode for submitting and tracking MapReduce jobs in Hadoop. Some typical tasks of Job Tracker are:
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data.
- It locates TaskTracker nodes with available slots at or near the data.
- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.
Task tracker is a daemon that runs on Datanodes. It accepts tasks like Map, Reduce and Shuffle operations - from a Job Tracker. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialise the job and divide the work and assign them to different task trackers to perform MapReduce tasks.While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

5. Whats the relationship between Jobs and Tasks in Hadoop ?
Answer:-  In hadoop Jobs are submitted by client and Jobs are split into multiple tasks like Map, Reduce and Shuffle.

6. What is HDFS (Hadoop distributed file system)? Why HDFS is termed as Block structured file system ? What is default HDFS block size ?
Answer:- HDFS is a file system designed for storing very large files. HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware (Commodity hardware is a non-expensive system without high-availability traits).
HDFS is termed as Block structured file system because individual files are broken into blocks of fixed size (default block size of an HDFS block is 128 MB). These blocks are stored across a cluster of one or more machines with data storage capacity. Changing the dfs.blocksize property in hdfs-site.xml will change the default block size for all the files placed into HDFS.

7. Why HDFS blocks are large as compared to disk blocks (HDFS default block size is 128 MB and disk block size in Unix/Linux is 8192 bytes) ? 
Answer:- HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files.In order to minimise the seek time while read operation - files are stored in large chunks in order of HDFS block size.
If file size is smaller than 128 MB then file will just use its's own size on a given block, rest will e used by other files.
If a particular file is 110 MB, will the HDFS block still consume  128 MB as the default size?
No, only 110 MB will be consumed by an HDFS block and 18 MB will be free to store something else.
Note:- In Hadoop 1 - default block size is 64 MB and in Hadoop 2 - default block size is 128 MB

8. What is significance of fault tolerance and high throughput in HDFS ?
Answer:- Fault Tolerance: - In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.So, chance of data loss is minimised and data loss can be recovered if there is any failure at one node.
Throughput:- Throughput is the amount of work done in a unit time. In HDFS, when client submit a job- it is divided and shared among different systems. All the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS provides good throughput.

9. What does "Replication factor" mean in Hadoop? What is default replication factor in HDFS ? How to modify default replication factor in HDFS ?
Answer
:- The number of times a file needs to be replicated in HDFS is termed as replication factor.
Default replication factor in HDFS is 3. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.
The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
We can change the replication factor on a per-file basis and on all files in the directory using hadoop FS shell.
$ hadoop fs –setrep –w 3 /MyDir/file
$ hadoop fs –setrep –w 3 -R /RootDir/Mydir

10. What is Datanode and Namenode in HDFS ?
Answer:- Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.
Namenode is the master node on which job tracker runs and stores metadata about actual storage of data blocks, so that it can manages the blocks which are present on the datanodes. It is a high-availability machine, Namenode can never be a commodity hardware because the entire HDFS rely on it so it has to be a high-availability machine.

11. Can Namenode and Datanode system have same hardware configuration ?
Answer:- In a single node cluster there is only one machine so Namenode and Datanode can be on same machine. However, in production environment Namenode and datanodes are on different machines. Namenode should be a high-end and high- availability machine.

12. What is the fundamental difference between traditional RDBMS and Hadoop?
Answer:- Traditional RDBMS is used for transnational systems ,whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it.
RDBMS Hadoop
Data size are order of Gigabytes Data size are order of Petabytes or Zettabytes 
Access method support Interactive and batch Access method support batch only
Static schemaDynamic schema
Nonlinear scaling Linear scaling 
High integrity Low integrity 
Suitable for Read and write many timesSuitable for write once, multiple times

12. What is secondary Namenode and what is its significance in hadoop ?
Answer
:- In Hadoop 1, Namenode was single point of failure. In order to make hadoop system up and running it was important to make the Namenode resilient to failure and add ability to recover from failure. If Namenode fails, no data access is possible from datanodes, as Namenode stores metadata about data balock stores on datanodes.
The main file written by the NameNode is called fsimage; This file is read into memory and all future modifications to the filesystem are applied to this in-memory representation of the filesystem. The Namenode does not write out new versions of fsimage as new changes are applied after it is run; instead, it writes another file called edits, which is a list of the changes that have been made since the last version of fsimage was written.
Secondary Namenode is used to periodically merge the Namespace image with the edit log to prevent the edit log from becoming too large. The secondary Namenode usually runs on a separate physical machine because it requires plenty of CPU and as much memory as the Namenode to perform the merge. It maintains a copy of the merged namespace image, which can be used in the event of the Namenode failing. However, the state of the secondary Namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain.
Note:- Secondary Namenode is not standby of primary Namenode, so it is not substitute of Namenode. Read in detail about Namenode,Datanode and Secondry Namenode and Internals of read and write operation in hadoop.

13. What is importance of heartbeat in HDFS ?
Answer:- A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker.
If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task. 14. What is HDFS cluster? Answer:- HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. In other words, collections of Datanode commodity machine and High availability Namenode collectively termed as HDFS cluster. Read in detail about Namenode,Datanode and Secondry Namenode

14. What is the communication channel between client and namenode/datanode?
Answer:- The mode of communication is SSH.

15. What is a rack ? What is Replica Placement Policy ?
Answer:- Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.
When client wants to load a file into the cluster, the content of the file will be divided into blocks and Namenode provides information about 3 datanodes for every block of the file which indicates where the block should be stored.
While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.