Oct 4, 2015

Textual description of firstImageUrl

Apache Hadoop installation on Ubuntu - Single Node Cluster setup (Apache hadoop 2.6.1 and ubuntu 13.04)

There are various Hadoop installation guide spread over the internet, initially struggled to install hadoop as one post was not sufficient to do successful installation. I referred multiple and compiled this post with all minute details required for installation for my reference and others too.
Hadoop installation as single node cluster can be divided into following subtask. First three subtasks are prerequisite for hadoop installation.
1. Java installation - Remember, here I struggled a bit, many places it is instructed install sun java 6 however, it is not possible to install Sun Java 6 or 7 now because Oracle (Sun) Java 6 is no longer available to be distributed by Ubuntu due to license issues. We have two option now either install Open JDK or Oracle JDK. I preferred Oracle JDK because it is less buggy than open jdk and it is free too, just we need to manually install it.  
2. Create a dedicated Hadoop user:- In order to distinguish a system user and normal user, it is recommended to create a new user with limited access and do hadoop installation in it. Unix is known for giving only that much access as required. 
3. SSH configuration and setup:- Since hadoop requires ssh for managing its nodes and do communication with them. So we need to configure ssh. 
4. Apache Hadoop distribution setup:- Download Hadoop package and setup various configuration file for task tracker and job tracker.

Lets start with Java installation and followed by other subtasks. At the end of this post I believe we will in position to run "The most famous word count sample program" in our single node hadoop cluster.

Java installation in Ubuntu:-

Java installation in Ubuntu has been discussed separately Oracle JDK installation in Ubuntu. Follow this post and install Java in your system. If we have installed Java correctly in our system, execution of commad : "java - version" should display following output (JDK version and all might be different);.
zytham@ubuntu:~$ java -version 
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

Create a dedicated Hadoop user:- 

It is recommended to use a dedicated Hadoop user account for running Hadoop because Hadoop
installation can be segregated from other software applications and user accounts running on the same machine. (Author Tom white in "Hadoop a definitive guide" advocates, it is good practice to create separate UNIX user for each hadoop processes and services- HDFS, MapReduce,YARN)
Here we will create a new user and run all services in that user context. First we create a group and add a new user in that group. Follow the following sequence of command execution.Remember, in order to create user and group ,we need to privileged user.
zytham@ubuntu:~$ sudo addgroup hadoop
[sudo] password for zytham: 
Adding group `hadoop' (GID 1001) ...
Done.
zytham@ubuntu:~$ sudo adduser --ingroup hadoop hduser1
Adding user `hduser1' ...
Adding new user `hduser1' (1001) with group `hadoop' ...
Creating home directory `/home/hduser1' ...
Copying files from `/etc/skel' ...
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully
Changing the user information for hduser1
Enter the new value, or press ENTER for the default
 Full Name []: HDUSER1
 Room Number []: 2
 Work Phone []: NA
 Home Phone []: NA
 Other []: NA
Is the information correct? [Y/n] Y
Here hduser1 is just a name of user, it can be any thing. The funny part is that all pages over the internet creates the hadoop user and  name it hduser. I want break this pseudo rule and created hadoop user hduser1.  (:P)

SSH configuration and setup SSH certificates:- 

Hadoop control scripts requires SSH to manage nodes and performs cluster wide operation. without SSH configuration hadoop cluster operation can be performed (using dedicated shell or dedicated hadoop application), however using SSH (by generating public and private key and stored in file system that is shared across cluster) password less login facility to hdfs,yarn user can be provided seamlessly.
Since we have single node set-up, we have to configure SSH access to localhost for the hduser1 user( created in the previous section). SSH consist of two component - SSH client and SSH demon .
ssh : The command we use to connect to remote machines - the client.
sshd : The daemon that is running on the server and allows clients to connect to the server.
SSH installation:- 
In order to make demon SSHD work,  we need to install SSH. Generally ssh is already enabled in ubuntu , just we need to install. Execute following command to install ssh.
zytham@ubuntu:~$ sudo apt-get install ssh
Reading package lists... Done
Building dependency tree    
......

Package 'ssh' has no installation candidate
If above command does not work then execute below mentioned command in sequence.(For me it did not work, so need to update)
zytham@ubuntu:~$ sudo apt-add-repository "deb http://archive.ubuntu.com/ubuntu precise main restricted"
zytham@ubuntu:~$ sudo apt-get update
zytham@ubuntu:~$ sudo apt-get install openssh-client=1:5.9p1-5ubuntu1
......
Setting up openssh-client (1:5.9p1-5ubuntu1) ...
Installing new version of config file /etc/ssh/moduli ...

zytham@ubuntu:~$ sudo apt-get install openssh-server
Reading package lists... Done
.....
ssh start/running, process 12267
Setting up ssh-import-id (2.10-0ubuntu1) ...
Processing triggers for ureadahead ...
Processing triggers for ufw ...

Verify that ssh and sshd is in place by executing following command :-
zytham@ubuntu:~$ which sshd
/usr/sbin/sshd
zytham@ubuntu:~$ which ssh
/usr/bin/ssh

Certificate generation:- 

RSA key pairs(public and private key) are generated and public key is added in authorized key files(by doing so, we enable SSH access to our local machine with this newly created key.). Execute following commands to generate public/private key pair using RSA algorithm. First switch to hduser1 and then generate keys with empty pass-phrase ("" at end of second command)
zytham@ubuntu:~$ su hduser1
password:
hduser1@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser1/.ssh/id_rsa): 
Your identification has been saved in /home/hduser1/.ssh/id_rsa.
Your public key has been saved in /home/hduser1/.ssh/id_rsa.pub.
The key fingerprint is:
43:8b:a7:70:fc:d8:58:6c:56:a4:b6:58:4f:67:d1:3e hduser1@ubuntu
The key's randomart image is:
+--[ RSA 2048]----+
|          . ..   |
|         o   ..  |
|        = o o.   |
|     . B * o  E  |
|    . = S .    . |
|     o @ .       |
|      + o        |
|                 |
|                 |
+-----------------+
While executing second command, file name is asked, just press enter and it will take default file name for both public and private key. (id_rsa.pub - pub indicate public key file)
Now we need to add the newly created public key to the list of authorized keys so that Hadoop can use ssh without prompting for a password.Execute following command for that:
hduser1@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Its time to test the SSH setup by connecting to our local machine(localhost) with the hduser1. This step is also add localhost host finger print in hduser1 user's known host file. It is one time activity.Execute the following command :
hduser1@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 56:28:c8:c1:22:af:05:75:df:25:3a:89:6c:e1:72:b4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
hduser1@localhost's password: 
Welcome to Ubuntu 13.04 (GNU/Linux 3.8.0-19-generic x86_64)
.......
If you get something like  above, you are ready for hadoop installation. If not , please debug with following command:
zytham@ubuntu:~$ ssh -vvv localhost

Hadoop distribution set-up/Installation :-

We have full-filled all prerequisite required for hadoop installation and are in position to install hadoop. Hadoop installation is all about 
1. Downloading hadoop distribution package (for example; hadoop-2.6.1.tar.gz) and 
2. Configure various xml files which drives hadoop cluster functioning.

Download Hadoop distribution package:- Download hadoop distribution from download hadoop core dist, I have downloaded hadoop-2.6.1.tar.gz for this installation. Another way to download Hadoop distribution by executing following command:
zytham@ubuntu:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.1/hadoop-2.6.1.tar.gz
For me "hadoop..tar.gz" has been downloaded in Downloads folder. Go to Downloads folder and extract hadoop distribution.Like Java installation, we will do hadoop set-up under /usr/local/hadoop2.6.1. Execute following commands to extract and move extracted version under under /usr/local/hadoop2.6.1 and verify we have a directory "hadoop2.6.1" under /usr/local.
zytham@ubuntu:~$ cd Downloads
zytham@ubuntu:~/Downloads$ sudo tar -xvzf hadoop-2.6.1.tar.gz
zytham@ubuntu:~/Downloads$ sudo mv ./hadoop-2.6.1 /usr/local/hadoop2.6.1
zytham@ubuntu:~/Downloads$ cd /usr/local/
zytham@ubuntu:~/usr/local$ ls 
Now transfer the ownership of hadoop2.6.1 directory to hduser1 and hadoop group. Execute the following command for the same:
zytham@ubuntu:/usr/local$ sudo chown -R  hduser1:hadoop /usr/local/hadoop2.6.1/

Set-up Configuration Files:- In order to complete Hadoop installation, we need to update ~/.bashrc (of ubuntu) and four hadoop configuration files.Lets start with bashrc followed by hadoop xml files.
Open bashrc and append following entry at the end of bashrc. Execute following command to open file.
zytham@ubuntu:/usr/local$ gedit ~/.bashrc
Copy the following lines, append at the end of file and save it. After adding, it should be like following diagram.
#added for hadoop installation
#HADOOP VARIABLES START
export JAVA_HOME=/usr/local/java/jdk1.8.0_60
export HADOOP_INSTALL=/usr/local/hadoop2.6.1
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

Now update hadoop configuration files.
1. Update hadoop-env.sh : Open hadoop-env.sh and update JAVA_HOME. JAVA_HOME setting is referenced by Hadoop when it starts.
zytham@ubuntu:~$ cd /usr/local/hadoop2.6.1/etc/hadoop
zytham@ubuntu:/usr/local/hadoop2.6.1/etc/hadoop$ gedit hadoop-env.sh
update JAVA_HOME with your jdk directory, for me it is /usr/local/java/jdk1.8.0_60 and after update hadoop-end.sh file should look like following:
# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/local/java/jdk1.8.0_60
2. Update conf/*-site.xml:-  core-site.xml , mapred-site.xml , hdfs-site.xml
Update core-site.xml :- Create a directory that hadoop will use to store data blocks and assign hduser1 as owner of this directory. In core-ste.xml,path of this newly created directory is configured. Execute following command and update core-site.xml.
zytham@ubuntu:~$ sudo mkdir -p /app/hadoop2.6.1/tmp
zytham@ubuntu:~$ sudo chown hduser1:hadoop /app/hadoop2.6.1/tmp
zytham@ubuntu:~$ sudo gedit /usr/local/hadoop2.6.1/etc/hadoop/core-site.xml
Update <configuration></configuration> node with following and note that "/app/hadoop2.6.1/tmp" has been updated as value of property hadoop.tmp.dir.
<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop2.6.1/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>

 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>
</configuration>

Update mapred-site.xml:-We need to create mapred-site.xml from mapred-site.xml.template present in /usr/local/hadoop2.6.1/etc/hadoop/ directory. Execute following command to create mapred.site.xml.
zytham@ubuntu:~$ cd /usr/local/hadoop2.6.1/etc/hadoop/
zytham@ubuntu:/usr/local/hadoop2.6.1/etc/hadoop$ sudo cp mapred-site.xml.template mapred-site.xml
zytham@ubuntu:/usr/local/hadoop2.6.1/etc/hadoop$ sudo gedit mapred-site.xml
Now open mapred-site.xml and update configuration node.Execute following command for the same.
zytham@ubuntu:/usr/local/hadoop2.6.1/etc/hadoop$ sudo gedit mapred-site.xml
Use following to update <configuration></configuration> node.
<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
 </property>
</configuration>

Update hdfs-site.xml:- hdfs-site.xml stores inforamtion about all hosts (containg namenodes, datanodes) in cluster. In other words, it contains information about directories which will be used as the namenode and the datanode on that host.Since, we are working for single node setup, we need to create two directories one for namenode and another for datanode and transfer ownership to hduser1 and hadoop group.
zytham@ubuntu:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
zytham@ubuntu:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
zytham@ubuntu:~$ sudo chown -R hduser1:hadoop /usr/local/hadoop_store
Now open hdfs-site.xml and update <configuration></configuration> node with following:
zytham@ubuntu:~$ sudo gedit /usr/local/hadoop2.6.1/etc/hadoop/hdfs-site.xml
Use following to update <configuration> node:
<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
</configuration>
Now we are almost done with hadoop set-up, just need to format hadoop file system which is implemented on top of local file system (It is one time activity). Formatting of  HDFS filesystem id done via the NameNode.
Note:- It is one time activity, if we format file system again after , all data will be lost stored until that time.
Format HDFS file-system using following command
hduser1@ubuntu:~$ su hduser1
password: 
hduser1@ubuntu:~$ cd /usr/local/hadoop2.6.1/bin
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop namenode -foramt
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/10/04 03:33:34 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-foramt]
STARTUP_MSG:   version = 2.6.1
.....
.......
15/10/04 03:33:34 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
Note:-
1. Use ./,  if you are running command from inside bin directory, else it gives error: hadoop: command not found
2. If any occurs we can go to logs (/usr/local/hadoop2.6.1/logs) and inspect in "hadoop-hduser1-namenode-ubuntu.log" and rest google knows  :)

Start services :-  
Setup is done, time to test our hadoop installation
Now start all services in pseudo hadoop cluster(single node that's why pseudo cluster) using following command(or using individual command start-dfs.sh and start-yarn.sh.) For the time being we will use one command for starting all services. Go to directory sbin inside hadoop installation and execute start-all.sh.
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ cd /usr/local/hadoop2.6.1/sbin
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/10/04 03:41:25 WARN util.NativeCodeLoader: 
Unable to load native-hadoop library for your platform...
 using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode,
 logging to /usr/local/hadoop2.6.1/logs/hadoop-hduser1-namenode-ubuntu.out
localhost: starting datanode, 
logging to /usr/local/hadoop2.6.1/logs/hadoop-hduser1-datanode-ubuntu.out
Starting secondary namenodes [0.0.0.0]
.....
starting yarn daemons
starting resourcemanager, 
logging to /usr/local/hadoop2.6.1/logs/yarn-hduser1-resourcemanager-ubuntu.out
localhost: starting nodemanager,
logging to /usr/local/hadoop2.6.1/logs/yarn-hduser1-nodemanager-ubuntu.out

We can verify the whether services started or not using netstat command (jps is also used, however it will only work with open jdk not with Oracle JDK, Ohh!! Disadvantage). Execute following command and see the following diagram(highlighted port number)
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ netstat -plten | grep java 
You should see something similar to following. It indicates that hadoop services and running.

Stop services:- 
So, if services can be started there should be some way to stop them.Execute ./stop-all.sh we can stop all services in one shot (or stop-dfs.sh and stop-yarn.sh to stop all the daemons running on our machine individually).
Do not execute it now,believe me it works. We will execute stop-all command after viewing Hadoop Web user interface and verify that demons are running.Use following command to stop all services.
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./stop-all.sh

Hadoop Web interfaces:- 

Open your browser and hit following urls in different tabs:
http://localhost:50070   http://localhost:50075   http://localhost:50090/
Refer the following diagram and http://localhost:50070 will display name node information.
Please note, localhost:54310, is not randomly assigned port. While configuring core-site.xml, value assigned for property "fs.default.name" is "hdfs://localhost:54310".

Now, stop all services by executing following commands and refresh tab opened earlier, it should be stopped displaying namenode, datanode, secondrynode inforamtion.
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./stop-dfs.sh
15/10/04 05:02:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library
 for your platform... 
using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
15/10/04 05:03:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where 
applicable
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./stop-yarn.sh
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
Once you refresh tabs now,you should see the most famous web error:
Unable to connect
  Firefox can't establish a connection to the server at localhost:50070.
This is all about apace hadoop 2.6.1 installation in ubuntu 13.04.

We are not yet done !! Just installation is complete. Now in next post we will run world famous "Map-reduce word count" sample program in our single node pseudo cluster and view processed output.

References :-
1. http://www.michael-noll.com/
2. Hadoop: The Definitive Guide: Tom White

Related posts:- 
  1. Big data evolution - Relational model to Hadoop ecosystem
  2. NameNode, DataNode and Secondary NameNode
  3. Read and write operation in hadoop ecosystem
  4. Mapreduce word count example execution 
Location: Hyderabad, Telangana, India