Jan 1, 2016

Textual description of firstImageUrl

Setup Apache Spark in eclipse(Scala IDE) : Word count example using Apache spark in Scala IDE

Apache spark - a very known in memory computing engine to process big data workloads. Scala IDE(an eclipse project) can be used to develop spark application. The main agenda of this post is to setup development environment for spark application in scala IDE and run word count example.

Download Scala IDE:- 
Scala IDE is an eclipse project which provides a very intuitive development environment for Scala and Spark application. Download Scala IDE and install it.  

Create a Maven project:-
Maven is a popular package management tool for Java-based languages that allows us to link libraries present in public repositories.We can use Maven itself to build our project, or use other tools like Scala’s sbt tool or Gradle.
1. Go to: File-> New -> Project -> Maven project  and create a maven project.Fill Group Id and Artifact Id & click finish.
Group Id = com.devinline.spark and Artifact Id = SparkSample

 Update pom.xml:- Download pom.xml sample and update it in above maven project. It has spark dependency jar entry which will be downloaded while building. 

3. Add Scala Nature to this project :- 
Right click on project -> configure - > Add Scala Nature. 

4. Update Scala compiler version for Spark:- 
Scala IDE by default uses latest version(2.11) of Scala compiler, however Spark uses version 2.10.So we need to update appropriate version for IDE. 
Right click on project- > Go to properties -> Scala compiler -> update Scala installation version to 2.10.5
5. Remove Scala Library Container from build path :- (Optional)
Jars required in already added via spark core(via pom.xml), so multiple jars is not required.
Right click on the project -> Build path -> Configure build path  and remove Scala Library Container.

6. Update source folder src/main/java to src/main/scala (Right click -> Refactor -> Rename  to scala).Now create a package under this name it as com.devinline.spark.

7. Create a Scala object under package created above name it as WordCount.scala
Right click on package -> New -> Scala Object  and add WordCount at the end of Name.

8. Update WordCount.scala with following code lines
package com.devinline.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object WordCount {
  def main(args: Array[String]) = {

    //Start the Spark context
    val conf = new SparkConf()
    val sc = new SparkContext(conf)

    //Read some example file to a test RDD
    val test = sc.textFile("input.txt")

    test.flatMap { line => //for each line
      line.split(" ") //split the line in word by word.
      .map { word => //for each word
        (word, 1) //Return a key/value tuple, with the word as key and 1 as value
      .reduceByKey(_ + _) //Sum all of the value with same key
      .saveAsTextFile("output.txt") //Save to a text file

    //Stop the Spark context
Explanation:- On applying flatmap unction on RDD test, each line is split with respect to space and array of string is obtained. This string array is converted into map with each word of list as key and 1 as value (collection of tuple is produced).Finally, reduceByKey is applied on for each tuple and aggregated output (unique word and corresponding count) is written to file. Lets take an example and understand the flow of method used in the above program unit.Suppose input.txt has two lines :
 This is spark time
 Learn spark
Flow of method's used in word count example  

9. Download sample input file and place is at some location as per your convenience. Modify location of input.txt in above sample code accordingly(sc.textFile("<Your_input.txt_Location>")).

10. Execute wordcount program :-  Right click on WordCount.scala - > Run as -> Scala application. It should create an output directory output.txt  and it should contain two file : part-00000 and _SUCCESS.
Sample output in part-00000 is :-

Location: Hyderabad, Telangana, India


  1. Good job! Fruitful article. I like this very much. It is very useful for my research. It shows your interest in this topic very well. I hope you will post some more information about the software. Please keep sharing!!
    Hadoop Training in Chennai
    Big Data Training in Chennai
    Blue Prism Training in Chennai
    CCNA Course in Chennai
    Cloud Computing Training in Chennai
    Data Science Course in Chennai
    Big Data Training in Chennai Annanagar
    Hadoop Training in Velachery


  2. I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much.
    WordPress website development Chennai

  3. Home buying mistakes costs the investors more. Thanks for sharing an informative post. This helps me to rectify the mistakes when buying a home.
    2BHK apartments in Chennai
    Properties in Chennai
    Luxury flats in Chennai
    New projects in Chennai
    Luxury apartments in Chennai

  4. This is an awesome post. Really very informative and creative contents.
    WordPress website development company in Chennai

  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    apache spark online training

  6. Learned a lot from your post and it is really good. Share more tech updates regularly.awesome blog it's very nice and useful i got many more information it's really nice i like your blog styleweb design company in velachery

  7. I m Really looking forward to read more. Your site is very helpful for us .. This is one of the awesome post i got the best information through your site and Visit also this site
    Black satta king
    disawar satta king
    gaziabad satta king
    faridabad satta king
    gali satta king

  8. The article was up to the point and described the information very effectively. Thanks to blog author for wonderful and informative post.
    website development company pakistan