Jan 9, 2016

Textual description of firstImageUrl

Development and deployment of Spark applications using SBT(Scala build tool)

Apache spark is an in-memory computation framework in Hadoop ecosystem. Apache spark allows developer to write application code in Scala,Python,R and Java.The main agenda of this post is to write a spark application in Scala and deploy using SBT(Scala build tool).

Prerequisite :- Apache spark and Scala should be installed. Here I am using Spark-1.5.2 and Scala 2.10.6. First we will install SBT and followed by configuring assembly plugin required for build and then create sample spark application.Internet connection is mandatory for packaging project first time.

How to check spark and Scala is set up or not ? 
zytham@ubuntu:~$ spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
Note:- If SPARK_HOME/bin is not in path variable, go to SPARK_HOME/bin and execute spark-shell command. If you do not get prompt, first install Scala and Apache spark, then follow this tutorial.

1. SBT installation:-
SBT is an open source build tool for Scala and Java projects, similar to Java's Maven or Ant.SBT is the de facto build tool for the Scala community.Execute following command to download SBT tar ball and extract it.
zytham@ubuntu:~$ wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.8/sbt-0.13.8.tgz
.....
Length: 1059183 (1.0M) [application/unknown]
Saving to: ‘sbt-0.13.8.tgz’

100%[======================================>] 10,59,183   17.0KB/s   in 26s    

2016-01-09 21:49:11 (39.5 KB/s) - ‘sbt-0.13.8.tgz’ saved [1059183/1059183]
Extract tar ball using following command.
zytham@ubuntu:~$ tar -xvf sbt-0.13.8.tgz 
sbt/
sbt/conf/
sbt/conf/sbtconfig.txt
sbt/conf/sbtopts
sbt/bin/
sbt/bin/sbt.bat
sbt/bin/sbt
sbt/bin/sbt-launch.jar
sbt/bin/sbt-launch-lib.bash
Move extracted files at some location and verify using all SBT files are in place.
zytham@ubuntu:~$ sudo mv sbt /opt
zytham@ubuntu:~$ cd /opt/
zytham@ubuntu:/opt$ ls
data                drill    eclipse.desktop  sbt        spark        zookeeper
datastax-ddc-3.2.1  eclipse  gnuplot-5.0.1    scala2.10  spark-1.5.2
In order to create and build projects from any directory using sbt, we nee do add sbt executable into the PATH shell variable. Add sbt bin in path variable in bashrc file using.
zytham@ubuntu:/opt/spark-1.5.2/bin$ gedit ~/.bashrc
Add these two lines at the end  of the file.
export SBT_HOME=/opt/sbt/
export PATH=$SBT_HOME/bin:$PATH

2. Install sbt assembly plugin:- 
sbt-assembly is an sbt plugin that creates a JAR out of our project with all of its dependencies except Hadoop and spark dependencies(These are termed as provided dependencies and provided by cluster itself at runtime).SBT manages a plugin definition file and we need to make entry in that file for any new entry(similar to pom.xml in Maven).
There are two ways to add sbt-assembly to plugin definition file (if existing or create one if doesn’t exist). we can use either :
  • the global file (for version 0.13 and up) at ~/.sbt/0.13/plugins/plugins.sbt
    OR 
  • the project-specific file at PROJECT_HOME/project/plugins.sbt 
Here we are using global definition file. Since plugin definition file does not exist, create a new file plugin.sbt and add sbt-assembly entry in it and press <Ctrl+D> to exit.
zytham@ubuntu:/opt$ mkdir -p ~/.sbt/0.13/plugins
zytham@ubuntu:/opt$ cat >> ~/.sbt/0.13/plugins/plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

3. Creating sample spark application:- Word count example
Load an input file and create an RDD. Count all words and display on console using collect() method.
  1. Create a project directory name it as "WordCountExample" followed by directory structure /src/main/scala/
    zytham@ubuntu:~$ mkdir WordCountExample
    zytham@ubuntu:~$ cd WordCountExample/
    zytham@ubuntu:~/WordCountExample$ mkdir -p src/main/scala
    
  2. Create a scala file with following code lines.
    zytham@ubuntu:~$ cd src/main/scala
    zytham@ubuntu:~/WordCountExample/src/main/$ gedit Wordcount.scala
    
    Copy below sample code lines in Wordcount.scala
    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
    object WordCount {
      def main(args: Array[String]) = {
    
        //Start the Spark context
        val conf = new SparkConf()
          .setAppName("WordCount")
          .setMaster("local")
        val sc = new SparkContext(conf)
    
        //Read some example file to a test RDD
        val test = sc.textFile("input.txt")
    
        test.flatMap { line => //for each line
          line.split(" ") //split the line in word by word.
        }
          .map { word => //for each word
            (word, 1) //Return a key/value tuple, with the word as key and 1 as value
          }
          .reduceByKey(_ + _) //Sum all of the value with same key
          .saveAsTextFile("output.txt") //Save to a text file
    
        //Stop the Spark context
        sc.stop
      }
    }
    
  3. In project home directory create a .sbt configuration file with following lines.
    zytham@ubuntu:~/WordCountExample/src/main/scala$ cd ~/WordCountExample/
    zytham@ubuntu:~/WordCountExample$ gedit WordcountExample.sbt
    
    Configuration file lines
    name := "WordCount Spark Application"
    version := "1.0"
    scalaVersion := "2.10.6"
    libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
  4. View project directory structure and files.
    zytham@ubuntu:~/WordCountExample$ find .
    .
    ./WordcountExample.sbt
    ./src
    ./src/main
    ./src/main/scala
    ./src/main/scala/Wordcount.scala
    
4.Build/package using sbt:- 
zytham@ubuntu:~/WordCountExample$ sbt package
[info] Loading global plugins from /home/zytham/.sbt/0.13/plugins
.....
[info] Compiling 1 Scala source to /home/zytham/WordCountExample/target/scala-2.10/classes...
[info] Packaging /home/zytham/WordCountExample/target/scala-2.10/wordcount-spark-application_2.10-1.0.jar ...
[info] Done packaging.
[success] Total time: 101 s, completed Jan 31, 2016 11:42:25 AM
Note:- It may take some time, since it downloads some jar files and internet connection is mandatory. On successful build it creates a jar file(wordcount-spark-application_2.10-1.0.jar) at location "<Project_ome>/target/scala-2.10". (Name of directory and jar file might be different depending on what we have configured in configuration file Wodcountexample.sbt)

5. Deploy generated jar/Submit job to spark cluster:- 
spark-submit(present in <SPARK_HOME>/bin) executable is used to submit job in spark cluster.Use following command. Download input file from here and place it in home directory.
zytham@ubuntu:~/WordCountExample$ spark-submit --class "WordCount" --master local[2] target/scala-2.10/wordcount-spark-application_2.10-1.0.jar 
On successful execution, an output directory is created with name "ouput.txt" and file part-00000 contains (word and count) pairs.Execute following command to see output and verify the same.
zytham@ubuntu:~/WordCountExample$ cd output.txt/
zytham@ubuntu:~/WordCountExample/output.txt$ ls
part-00000  _SUCCESS
zytham@ubuntu:~/WordCountExample/output.txt$ head -10 part-00000 
(spark,2)
(is,1)
(Learn,1)
(This,1)
(time,1)

Explanation of word count example 
:- On applying flatmap unction on RDD test, each line is split with respect to space and array of string is obtained. This string array is converted into map with each word of list as key and 1 as value (collection of tuple is produced).Finally, reduceByKey is applied on for each tuple and aggregated output (unique word and corresponding count) is written to file. Lets take an example and understand the flow of method used in the above program unit.Suppose input.txt has two lines :
This is spark time
Learn spark
Flow of method's used in word count example 
An eclipse project can also be created using sbt, just we need to add an entry for sbt-eclipse in plugin configuration file in ~/.sbt/0.13/plugins/plugins.sbt
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")
and using "sbt eclipse" command  instead of "sbt package" eclipse project can be created.
zytham@ubuntu:~/WordCountExample$ sbt eclipse
[info] Loading global plugins from /home/zytham/.sbt/0.13/plugins
[info] Set current project to WordCount Spark Application (in build file:/home/zytham/WordCountExample/)
[info] About to create Eclipse project files for your project(s).
[info] Successfully created Eclipse project files for project(s):
[info] WordCount Spark Application
Now in scala IDE, we can import this spark application and execute it from there too.

Download scala IDE
:-
Execute following commands to download and extract tar ball.
zytham@ubuntu:~/Downloads$ wget http://downloads.typesafe.com/scalaide-pack/4.1.1-vfinal-luna-211-20150728/scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64.tar.gz
zytham@ubuntu:~/Downloads$ tar -xvf scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64.tar
For running eclipse IDE, execute following command form the directory where it has been extracted.
zytham@ubuntu:~/Downloads$ ~/eclipse/eclipse

Location: Hyderabad, Telangana, India