Apache spark Setup in windows 7 - standalone mode

Apache Spark is a general-purpose cluster computing system to process big data workloads. It is very possible to use spark with Hadoop HDFS, Amazon EC2 and others persistence storage system including local file system. For leaning Apache spark, it is very possible to setup it in standalone mode and start executing spark API's in Scala,Python or R shell. In this post we will setup spark and execute some sparks API's.

Download Apache spark:-
Download pre-build version of Apache spark and unzip it in some directory. I have placed it in following location E:\spark-1.5.2-bin-hadoop2.6.
Note:- It is also possible to download source code and build using Maven or SBT.Refer this for other options of download.

Download and install Scala:-
Download Scala executables and install it.It is prerequisite for working with Apache spark, spark is written in Scala. Scala installed at "C:\Program Files (x86)\scala".

Set-up SCALA_HOME and HADOOP_HOME :-
Once we are done with the installation of Spark and Scala, configure environment variable for SCALA_HOME and HADOOP_HOME.
SCALA_HOME  =  C:\Program Files (x86)\scala

As of now we do not want to stick with Hadoop ,we just want to learn Apache spark. So we need to download winutils.exe and configure it as HAOOP_HOME.Unzip it and add path before bin directory as HADOOP_HOME.
HADOOP_HOME = E:\dev\hadoop\hadoop-common-2.2.0-bin-master

Update PATH environment variable :- 
Add Spark bin directory in PATH environment variable so that Scala or python shell can be started without visiting bin directory every time.

Start Spark’s shells(Scala or Python version) :- 
Python version :  pyspark
Scala version :     spark-shell 
Start cmd,type pyspark and press enter. If we have followed steps properly, it should open Python version of the Spark shell and as shown below.

Similarly, we can start Scala version of the Spark shell by typing spark-shell and press enter in cmd.
Note:- Here we will get some error on console regarding hive directory write permission, we can ignore it,we can start executing spark API's and learn Apache spark.

Sample API's execution in python or scala shell :-
Create a RDD, display total number of lines in file and followed by first line of that file.
In Python version of the Spark shell
>>> lines = sc.textFile("README.md") # Create an RDD called lines
>>> lines.count() # Count the number of items in this RDD
98
>>> lines.first() # First item in this RDD, i.e. first line of README.md
u'# Apache Spark'

Note:- If you execute the same set of commands, console will be flooded with lines. I have suppressed it by changing log level to warning.

4 Comments

  1. Replies
    1. Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. big data projects for students But it’s not the amount of data that’s important.Project Centres in Chennai

      Python Training in Chennai Python Training in Chennai The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training Project Centers in Chennai

      Delete
  2. If you are looking for Dating Call Girls in Shimla, you are in the right place.Hot and Sexy Call Girls in Shimla First of all, we want to that you for going to our website.Air hostess Escorts in Agra We are a top Girls Escort agency and call girl Service agency in Real Call Girls Photo in Faridabad. We have been in escort and call girl agency in Real Call Girls Photo in Faridabad for many years. Our escort and call girls in Gurgaon are expertly qualified and known for their beauty and supportive characteristics. This High Profile Escorts in Gurgaon can provide you with an actual escort and call girls like sensation or first evening encounter.

    ReplyDelete
Previous Post Next Post