Oct 5, 2015

Hadoop Mapreduce word count example - execute Wordcount jar on single node cluster

In previous post we successfully installed Apache Hadoop 2.6.1 on Ubuntu 13.04. The main agenda of this post is to run famous mapreduce word count sample program in our single node hadoop cluster set-up. Running word count problem is equivalent to "Hello world" program of MapReduce world. Before executing word count mapreduce sample program, we need to download input files and upload it to hadoop file system.

Download input data from following URL:- 

Download each text files from following URL and store the files in a some directory, For me it is downloaded in /home/zytham/Downloads/hadoop_data
1. http://www.gutenberg.org/cache/epub/20417/pg20417.txt
2. http://www.gutenberg.org/files/5000/5000-8.txt

Upload input file to HDFS :-

Switch to hduser1, if you are not in that context, remember while doing hadoop 2.6.1 installation in Ubuntu 13.04, we created hduser1 and set-up hadoop in context of hduser1.
Start hadoop services :- First start the Hadoop cluster using following command 
hduser1@ubuntu:~$ cd /usr/local/hadoop2.6.1/sbin
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./start-all.sh
Note:- If you do not start services and try to upload files to hdfs you will get error some thing like
Call From ubuntu/127.0.1.1 to localhost:54310 failed on connection exception: java.net.ConnectException: Connection refused;
Copy local file to HDFS:- Copy downloaded files from /home/zytham/Downloads/hadoop_data to hadoop filesystem (a file system managed by hadoop).Execute following command to create a hdfs directory and copy files from local file system to newly created hdfs directory.
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hdfs dfs -mkdir -p /user/hduser1/hdfsdata/hadoop_data
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop dfs -copyFromLocal /home/zytham/Downloads/hadoop_data /user/hduser1/hdfsdata/hadoop_data
Verify that you have copied all three files in hdfs, execute following command and you should see all three files.
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop dfs -ls /user/hduser1/hdfsdata/hadoop_data
Do you want to also verify ,why hdfs directory is special and different from local file system ? Try to execute following command and you will not able to access that directory,remember it is visible to Hadoop and managed by it.
hduser1@ubuntu:/usr/local/hadoop2.6.1$ cd /user/hduser1/hdfsdata/hadoop_data

bash: cd: /user/hduser1/hdfsdata/hadoop_data: No such file or directory

Run map-reduce Hadoop word count example:- 

For convenience I have created a Wordcount sample program jar, download word count sample program jar and save it in some directory of your convenience. I have placed in hadoop installation directory "/home/zytham/hadoop_poc/WordcountSample.jar". Now execute the word-count jar file in single node hadoop pseudo cluster with following command.
./hadoop jar <word_count_sample_jar> <classNameOfSampleJar> <Input_files_location> <Output_directory_location>

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop jar /home/zytham/hadoop_poc/WordcountSample.jar WordCountExample /user/hduser1/hdfsdata/hadoop_data /user/hduser1/wordcountOuput

15/10/04 15:29:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/10/04 15:29:36 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/10/04 15:29:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/10/04 15:29:37 INFO input.FileInputFormat: Total input paths to process : 3
..........................
..........................
15/10/04 15:29:43 INFO mapred.LocalJobRunner: reduce task executor complete.
15/10/04 15:29:43 INFO mapreduce.Job:  map 100% reduce 100%
15/10/04 15:29:43 INFO mapreduce.Job: Job job_local884144492_0001 completed successfully
15/10/04 15:29:44 INFO mapreduce.Job: Counters: 38
 File System Counters
  FILE: Number of bytes read=4011472
  FILE: Number of bytes written=8420485
  FILE: Number of read operations=0
  FILE: Number of large read operations=0
  FILE: Number of write operations=0
  HDFS: Number of bytes read=11928267
  HDFS: Number of bytes written=883509
  HDFS: Number of read operations=37
  HDFS: Number of large read operations=0
  HDFS: Number of write operations=6
 Map-Reduce Framework
  Map input records=78578
  Map output records=629920
  Map output bytes=6083556
  Map output materialized bytes=1462980
  Input split bytes=397
  Combine input records=629920
  Combine output records=101397
  Reduce input groups=82616
  Reduce shuffle bytes=1462980
  Reduce input records=101397
  Reduce output records=82616
  Spilled Records=202794
  Shuffled Maps =3
  Failed Shuffles=0
  Merged Map outputs=3
  GC time elapsed (ms)=180
  CPU time spent (ms)=0
  Physical memory (bytes) snapshot=0
  Virtual memory (bytes) snapshot=0
  Total committed heap usage (bytes)=807419904
 Shuffle Errors
  BAD_ID=0
  CONNECTION=0
  IO_ERROR=0
  WRONG_LENGTH=0
  WRONG_MAP=0
  WRONG_REDUCE=0
 File Input Format Counters 
  Bytes Read=3676562
 File Output Format Counters 
  Bytes Written=883509

If you get output something similar to above, you are on right track and output of this mapreduce program is stored in "/user/hduser1/wordcountOuput". We will now see the output processed by hadoop.
First verify output directory and see what are the files it contains. Execute followoing command for the same.
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop dfs -ls /user/hduser1/wordcountOuput
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/10/04 15:33:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hduser1 supergroup          0 2015-10-04 15:29 /user/hduser1/wordcountOuput/_SUCCESS
-rw-r--r--   1 hduser1 supergroup     883509 2015-10-04 15:29 /user/hduser1/wordcountOuput/part-r-00000

Now, execute following command to see processed output in terminal(Ouput shown below is just partial one, you have to scroll and see complete output):
hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop dfs -cat /user/hduser1/wordcountOuput/part-r-00000
........
.......
worst 10
worst. 1
worsted 2
worsted! 1
worsting 1
worth 36
worth. 5
worth._ 2
worthful 1
worthier 1
worthless. 1
worthy 21
worthy, 1


æsthetic 1
è 3
état_. 1
� 5
�: 1
�crit_ 1
�pieza; 1

Using "getmerge" command we can download mapreduce output to local file system. Use following command to merge output files present in hdfs output folder.
hduser@ubuntu:/usr/local/hadoop/bin$ ./hadoop dfs -getmerge /user/hduser1/wordcountOuput /tmp/wordCountLocal
Before executing this command create a new directory or specify local file system directory accordingly (here "/tmp/wordCountLocal" is my local output directory)


Location: Hyderabad, Telangana, India