Jan 1, 2016

Textual description of firstImageUrl

Setup Apache Spark in eclipse(Scala IDE) : Word count example using Apache spark in Scala IDE

Apache spark - a very known in memory computing engine to process big data workloads. Scala IDE(an eclipse project) can be used to develop spark application. The main agenda of this post is to setup development environment for spark application in scala IDE and run word count example.

Download Scala IDE:- 
Scala IDE is an eclipse project which provides a very intuitive development environment for Scala and Spark application. Download Scala IDE and install it.  

Create a Maven project:-
Maven is a popular package management tool for Java-based languages that allows us to link libraries present in public repositories.We can use Maven itself to build our project, or use other tools like Scala’s sbt tool or Gradle.
1. Go to: File-> New -> Project -> Maven project  and create a maven project.Fill Group Id and Artifact Id & click finish.
Group Id = com.devinline.spark and Artifact Id = SparkSample

2.
 Update pom.xml:- Download pom.xml sample and update it in above maven project. It has spark dependency jar entry which will be downloaded while building. 

3. Add Scala Nature to this project :- 
Right click on project -> configure - > Add Scala Nature. 

4. Update Scala compiler version for Spark:- 
Scala IDE by default uses latest version(2.11) of Scala compiler, however Spark uses version 2.10.So we need to update appropriate version for IDE. 
Right click on project- > Go to properties -> Scala compiler -> update Scala installation version to 2.10.5
  
5. Remove Scala Library Container from build path :- (Optional)
Jars required in already added via spark core(via pom.xml), so multiple jars is not required.
Right click on the project -> Build path -> Configure build path  and remove Scala Library Container.

6. Update source folder src/main/java to src/main/scala (Right click -> Refactor -> Rename  to scala).Now create a package under this name it as com.devinline.spark.

7. Create a Scala object under package created above name it as WordCount.scala
Right click on package -> New -> Scala Object  and add WordCount at the end of Name.

8. Update WordCount.scala with following code lines
package com.devinline.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object WordCount {
  def main(args: Array[String]) = {

    //Start the Spark context
    val conf = new SparkConf()
      .setAppName("WordCount")
      .setMaster("local")
    val sc = new SparkContext(conf)

    //Read some example file to a test RDD
    val test = sc.textFile("input.txt")

    test.flatMap { line => //for each line
      line.split(" ") //split the line in word by word.
    }
      .map { word => //for each word
        (word, 1) //Return a key/value tuple, with the word as key and 1 as value
      }
      .reduceByKey(_ + _) //Sum all of the value with same key
      .saveAsTextFile("output.txt") //Save to a text file

    //Stop the Spark context
    sc.stop
  }
}
Explanation:- On applying flatmap unction on RDD test, each line is split with respect to space and array of string is obtained. This string array is converted into map with each word of list as key and 1 as value (collection of tuple is produced).Finally, reduceByKey is applied on for each tuple and aggregated output (unique word and corresponding count) is written to file. Lets take an example and understand the flow of method used in the above program unit.Suppose input.txt has two lines :
 This is spark time
 Learn spark
Flow of method's used in word count example  

9. Download sample input file and place is at some location as per your convenience. Modify location of input.txt in above sample code accordingly(sc.textFile("<Your_input.txt_Location>")).

10. Execute wordcount program :-  Right click on WordCount.scala - > Run as -> Scala application. It should create an output directory output.txt  and it should contain two file : part-00000 and _SUCCESS.
Sample output in part-00000 is :-
(spark,2)
(is,1)
(Learn,1)
(This,1)
(time,1)

Location: Hyderabad, Telangana, India

41 comments:

  1. Good job! Fruitful article. I like this very much. It is very useful for my research. It shows your interest in this topic very well. I hope you will post some more information about the software. Please keep sharing!!
    Hadoop Training in Chennai
    Big Data Training in Chennai
    Blue Prism Training in Chennai
    CCNA Course in Chennai
    Cloud Computing Training in Chennai
    Data Science Course in Chennai
    Big Data Training in Chennai Annanagar
    Hadoop Training in Velachery

    ReplyDelete

  2. I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much.
    WordPress website development Chennai

    ReplyDelete
  3. Home buying mistakes costs the investors more. Thanks for sharing an informative post. This helps me to rectify the mistakes when buying a home.
    2BHK apartments in Chennai
    Properties in Chennai
    Luxury flats in Chennai
    New projects in Chennai
    Luxury apartments in Chennai

    ReplyDelete
  4. This is an awesome post. Really very informative and creative contents.
    WordPress website development company in Chennai

    ReplyDelete
  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    apache spark online training

    ReplyDelete
  6. Learned a lot from your post and it is really good. Share more tech updates regularly.awesome blog it's very nice and useful i got many more information it's really nice i like your blog styleweb design company in velachery

    ReplyDelete
  7. I m Really looking forward to read more. Your site is very helpful for us .. This is one of the awesome post i got the best information through your site and Visit also this site
    Black satta king
    disawar satta king
    gaziabad satta king
    faridabad satta king
    gali satta king

    ReplyDelete
  8. The article was up to the point and described the information very effectively. Thanks to blog author for wonderful and informative post.
    website development company pakistan

    ReplyDelete
  9. Getting below error. Can you please assist

    java.lang.IllegalStateException: unread block data
    at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(Unknown Source)
    at java.io.ObjectInputStream.readObject0(Unknown Source)
    at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
    at java.io.ObjectInputStream.readSerialData(Unknown Source)
    at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
    at java.io.ObjectInputStream.readObject0(Unknown Source)
    at java.io.ObjectInputStream.readObject(Unknown Source)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
    19/11/24 15:36:11 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalStateException: unread block data
    at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(Unknown Source)
    at java.io.ObjectInputStream.readObject0(Unknown Source)
    at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
    at java.io.ObjectInputStream.readSerialData(Unknown Source)
    at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
    at java.io.ObjectInputStream.readObject0(Unknown Source)
    at java.io.ObjectInputStream.readObject(Unknown Source)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

    ReplyDelete
  10. Today eveyone wants to rank on Goolge. Do you want your business number one on Google?
    Come and visit SEO Company in Bangalore
    That will help you to increase your visibilty on Google.

    ReplyDelete
  11. Do you need to promote and advance your business online? Piama Media Labs is the Best SEO Company in Bangalore. That will help you to increase your visibility on Google.

    ReplyDelete
  12. Great Article. Thank you for sharing! Really an awesome post for every one.

    IEEE Final Year projects Project Centers in Chennai are consistently sought after. Final Year Students Projects take a shot at them to improve their aptitudes, while specialists like the enjoyment in interfering with innovation. For experts, it's an alternate ball game through and through. Smaller than expected IEEE Final Year project centers ground for all fragments of CSE & IT engineers hoping to assemble. Final Year Project Domains for IT It gives you tips and rules that is progressively critical to consider while choosing any final year project point.

    Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Spring Framework Corporate TRaining the authors explore the idea of using Java in Big Data platforms.
    Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

    ReplyDelete
  13. Thank you for excellent article.Great information for new guy like antimalware service executable

    ReplyDelete
  14. Taldeen is one of the best plastic manufacturing company in Saudi Arabia. They are manufacturing Handling Solutions Plastic products like Plastic Pallets and plastic crates. Here is the link of the product
    Handling Solutions
    Plastic Pallets
    Here is the details of best BSc Medical Imaging Technology Colleges in Bangalore. You can get the college details from the below link. BSc Medical Imaging Technology Course is one of the best demanding course in recent times in India
    BSc Medical Imaging Technology Colleges In Bangalore
    Christian College Bangalore providing BSc Medical Imaging Technology Course. Here is the link about the details of BSc Medical Imaging Technology. You can click the below link for more information about BSc Medical Imaging Technology.
    BSc Cardiac Care Technology Colleges In Bangalore
    Christian College Bangalore providing BSc Optometry Course. Here is the link about the details of BSc Optometry. You can click the below link for more information about BSc Optometry. BSc Optometry is one of the most demanding course in recent times.
    Optometry Colleges In Bangalore
    BBA Aviation course is the best (Most Demanded) management course in India. Here, Christian College Bangalore providing BBA Aviation course. You can get the details of Christian College BBA Aviation from the below mentioned link. If you are interested in BBA Aviation, just visit the below link to know about BBA Aviation.
    BBA Aviation Colleges In Bangalore
    GrueBleen is one of the Branding and Marketing agency Based in Riyadh- Saudi Arabia. The main functions of GrueBleen is Advertising, Branding, Marketing, Office Branding, Exhibition Management and Digital Marketing. Visit the below link to know more about GrueBleen Creative Club.
    Branding Agency Riyadh
    Marketing Agency Riyadh
    Agriculture Solutions – Taldeen is a plastic manufacturing company in Saudi Arabia. They are manufacturing agricultural plastic products like greenhouse cover and hay cover. Visit the below link to know more details
    Agriculture Solutions
    Greenhouse Cover
    Medical Imaging Technology – One of the most demanding allied health science course in recent times in India. Check out the details of Best BSc Medical Imaging Technology Colleges Details with the following link.
    BSc Medical Imaging Technology Colleges In Bangalore
    BSc Perfusion Technology – If you are looking to study BSc Perfusion Technology in Bangalore, just check out the following link. In that link you can get the details of Best BSc Medical Imaging Technology colleges in Bangalore
    BSc Perfusion Technology Colleges in Bangalore
    GrueBleen – One of the best social media marketing agency in Riyadh- Saudi Arabia. Visit here for the all service details of GrueBleen.
    Social Media Marketing Agency

    ReplyDelete
  15. Wow this is very informative to me and us. keep it up. We give best offers for Washing machine repair Dubai Abu Dhabi and across UAE.

    ReplyDelete
  16. Shweta gaur is one of the famous makeup artist in all over India. We are providing the best makeup artist courses and more other courses in over branches in Delhi.
    Bridal Makeup Makeup Artist in Delhi Makeup Artist Best Makeup Artist in Delhi Best Makeup Artist in East Delhi Top Makeup Artist in Delhi Top Makeup Artist in India Bridal Makeup

    ReplyDelete

  17. This is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.

    Website Design and Development Company

    Website Design Company

    Website Development Company

    Wordpress Customization comapany

    SEO Company

    digital marketing company

    ReplyDelete
  18. This is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.

    Website Design and Development Company

    Website Design Company

    Website Development Company

    Wordpress Customization comapany

    SEO Company

    digital marketing company

    ReplyDelete
  19. You should be a part of a contest for one of the greatest sites online press. I most certainly will recommend this website!

    ReplyDelete
  20. With the help of creative designing team TSS advertising company provides different branding and marketing strategies in advertising industry...

    https://www.tss-adv.com/branding-and-marketing

    ReplyDelete
  21. I need to to thank you for this fantastic read!! I absolutely loved every bit of it. I have you book marked to look at new stuff you post…
    Click Here for more
    Jio Information Available
    Check 2019-20 List
    Find Helpline Resources

    ReplyDelete
  22. I read this article. I think You have put a lot of effort to create this article. I appreciate your work.
    Visit us for Custom Printed Puma Sweat Jacket.

    ReplyDelete
  23. I read this article. I think You have put a lot of effort to create this article. I appreciate your work.
    Visit us for Custom Printed Puma Sweat Jacket.
    AWS training in Chennai

    AWS Online Training in Chennai

    AWS training in Bangalore

    AWS training in Hyderabad

    AWS training in Coimbatore

    AWS training

    ReplyDelete
  24. Amazing blog. Ogen Infosystem is one of the leading SEO Services Provider Company in Delhi, India. Here you will get the well professional SEO Experts to promote your business.
    SEO Service in Delhi

    ReplyDelete
  25. Good Post! , it was so good to read and useful to improve my knowledge as an updated one, keep blogging.After seeing your article I want to say that also a well-written article with some very good information which is very useful for the readers....thanks for sharing it and do share more posts likethis. https://www.3ritechnologies.com/course/online-python-certification-course/

    ReplyDelete
  26. Thanks for this amazing blog, visit Ogen Infosystem for creative web design and SEO services at an affordable price.
    Top 5 Website Designing Company in Delhi

    ReplyDelete