Jan 1, 2016

Textual description of firstImageUrl

Setup Apache Spark in eclipse(Scala IDE) : Word count example using Apache spark in Scala IDE

Apache spark - a very known in memory computing engine to process big data workloads. Scala IDE(an eclipse project) can be used to develop spark application. The main agenda of this post is to setup development environment for spark application in scala IDE and run word count example.

Download Scala IDE:- 
Scala IDE is an eclipse project which provides a very intuitive development environment for Scala and Spark application. Download Scala IDE and install it.  

Create a Maven project:-
Maven is a popular package management tool for Java-based languages that allows us to link libraries present in public repositories.We can use Maven itself to build our project, or use other tools like Scala’s sbt tool or Gradle.
1. Go to: File-> New -> Project -> Maven project  and create a maven project.Fill Group Id and Artifact Id & click finish.
Group Id = com.devinline.spark and Artifact Id = SparkSample

2.
 Update pom.xml:- Download pom.xml sample and update it in above maven project. It has spark dependency jar entry which will be downloaded while building. 

3. Add Scala Nature to this project :- 
Right click on project -> configure - > Add Scala Nature. 

4. Update Scala compiler version for Spark:- 
Scala IDE by default uses latest version(2.11) of Scala compiler, however Spark uses version 2.10.So we need to update appropriate version for IDE. 
Right click on project- > Go to properties -> Scala compiler -> update Scala installation version to 2.10.5
  
5. Remove Scala Library Container from build path :- (Optional)
Jars required in already added via spark core(via pom.xml), so multiple jars is not required.
Right click on the project -> Build path -> Configure build path  and remove Scala Library Container.

6. Update source folder src/main/java to src/main/scala (Right click -> Refactor -> Rename  to scala).Now create a package under this name it as com.devinline.spark.

7. Create a Scala object under package created above name it as WordCount.scala
Right click on package -> New -> Scala Object  and add WordCount at the end of Name.

8. Update WordCount.scala with following code lines
package com.devinline.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object WordCount {
  def main(args: Array[String]) = {

    //Start the Spark context
    val conf = new SparkConf()
      .setAppName("WordCount")
      .setMaster("local")
    val sc = new SparkContext(conf)

    //Read some example file to a test RDD
    val test = sc.textFile("input.txt")

    test.flatMap { line => //for each line
      line.split(" ") //split the line in word by word.
    }
      .map { word => //for each word
        (word, 1) //Return a key/value tuple, with the word as key and 1 as value
      }
      .reduceByKey(_ + _) //Sum all of the value with same key
      .saveAsTextFile("output.txt") //Save to a text file

    //Stop the Spark context
    sc.stop
  }
}
Explanation:- On applying flatmap unction on RDD test, each line is split with respect to space and array of string is obtained. This string array is converted into map with each word of list as key and 1 as value (collection of tuple is produced).Finally, reduceByKey is applied on for each tuple and aggregated output (unique word and corresponding count) is written to file. Lets take an example and understand the flow of method used in the above program unit.Suppose input.txt has two lines :
 This is spark time
 Learn spark
Flow of method's used in word count example  

9. Download sample input file and place is at some location as per your convenience. Modify location of input.txt in above sample code accordingly(sc.textFile("<Your_input.txt_Location>")).

10. Execute wordcount program :-  Right click on WordCount.scala - > Run as -> Scala application. It should create an output directory output.txt  and it should contain two file : part-00000 and _SUCCESS.
Sample output in part-00000 is :-
(spark,2)
(is,1)
(Learn,1)
(This,1)
(time,1)

Location: Hyderabad, Telangana, India

25 comments:

  1. Good job! Fruitful article. I like this very much. It is very useful for my research. It shows your interest in this topic very well. I hope you will post some more information about the software. Please keep sharing!!
    Hadoop Training in Chennai
    Big Data Training in Chennai
    Blue Prism Training in Chennai
    CCNA Course in Chennai
    Cloud Computing Training in Chennai
    Data Science Course in Chennai
    Big Data Training in Chennai Annanagar
    Hadoop Training in Velachery

    ReplyDelete

  2. I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much.
    WordPress website development Chennai

    ReplyDelete
  3. Home buying mistakes costs the investors more. Thanks for sharing an informative post. This helps me to rectify the mistakes when buying a home.
    2BHK apartments in Chennai
    Properties in Chennai
    Luxury flats in Chennai
    New projects in Chennai
    Luxury apartments in Chennai

    ReplyDelete
  4. This is an awesome post. Really very informative and creative contents.
    WordPress website development company in Chennai

    ReplyDelete
  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    apache spark online training

    ReplyDelete
  6. Learned a lot from your post and it is really good. Share more tech updates regularly.awesome blog it's very nice and useful i got many more information it's really nice i like your blog styleweb design company in velachery

    ReplyDelete
  7. I m Really looking forward to read more. Your site is very helpful for us .. This is one of the awesome post i got the best information through your site and Visit also this site
    Black satta king
    disawar satta king
    gaziabad satta king
    faridabad satta king
    gali satta king

    ReplyDelete
  8. The article was up to the point and described the information very effectively. Thanks to blog author for wonderful and informative post.
    website development company pakistan

    ReplyDelete
  9. Getting below error. Can you please assist

    java.lang.IllegalStateException: unread block data
    at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(Unknown Source)
    at java.io.ObjectInputStream.readObject0(Unknown Source)
    at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
    at java.io.ObjectInputStream.readSerialData(Unknown Source)
    at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
    at java.io.ObjectInputStream.readObject0(Unknown Source)
    at java.io.ObjectInputStream.readObject(Unknown Source)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
    19/11/24 15:36:11 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalStateException: unread block data
    at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(Unknown Source)
    at java.io.ObjectInputStream.readObject0(Unknown Source)
    at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
    at java.io.ObjectInputStream.readSerialData(Unknown Source)
    at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
    at java.io.ObjectInputStream.readObject0(Unknown Source)
    at java.io.ObjectInputStream.readObject(Unknown Source)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

    ReplyDelete
  10. Today eveyone wants to rank on Goolge. Do you want your business number one on Google?
    Come and visit SEO Company in Bangalore
    That will help you to increase your visibilty on Google.

    ReplyDelete
  11. Do you need to promote and advance your business online? Piama Media Labs is the Best SEO Company in Bangalore. That will help you to increase your visibility on Google.

    ReplyDelete
  12. Great Article. Thank you for sharing! Really an awesome post for every one.

    IEEE Final Year projects Project Centers in Chennai are consistently sought after. Final Year Students Projects take a shot at them to improve their aptitudes, while specialists like the enjoyment in interfering with innovation. For experts, it's an alternate ball game through and through. Smaller than expected IEEE Final Year project centers ground for all fragments of CSE & IT engineers hoping to assemble. Final Year Project Domains for IT It gives you tips and rules that is progressively critical to consider while choosing any final year project point.

    Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Spring Framework Corporate TRaining the authors explore the idea of using Java in Big Data platforms.
    Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

    ReplyDelete