Dec 13, 2015

Textual description of firstImageUrl

Apache spark Setup in windows 7 - standalone mode

Apache Spark is a general-purpose cluster computing system to process big data workloads. It is very possible to use spark with Hadoop HDFS, Amazon EC2 and others persistence storage system including local file system. For leaning Apache spark, it is very possible to setup it in standalone mode and start executing spark API's in Scala,Python or R shell. In this post we will setup spark and execute some sparks API's.

Download Apache spark:-
Download pre-build version of Apache spark and unzip it in some directory. I have placed it in following location E:\spark-1.5.2-bin-hadoop2.6.
Note:- It is also possible to download source code and build using Maven or SBT.Refer this for other options of download.

Download and install Scala:-
Download Scala executables and install it.It is prerequisite for working with Apache spark, spark is written in Scala. Scala installed at "C:\Program Files (x86)\scala".

Once we are done with the installation of Spark and Scala, configure environment variable for SCALA_HOME and HADOOP_HOME.
SCALA_HOME  =  C:\Program Files (x86)\scala

As of now we do not want to stick with Hadoop ,we just want to learn Apache spark. So we need to download winutils.exe and configure it as HAOOP_HOME.Unzip it and add path before bin directory as HADOOP_HOME.
HADOOP_HOME = E:\dev\hadoop\hadoop-common-2.2.0-bin-master

Update PATH environment variable :- 
Add Spark bin directory in PATH environment variable so that Scala or python shell can be started without visiting bin directory every time.

Start Spark’s shells(Scala or Python version) :- 
Python version :  pyspark
Scala version :     spark-shell 
Start cmd,type pyspark and press enter. If we have followed steps properly, it should open Python version of the Spark shell and as shown below.

Similarly, we can start Scala version of the Spark shell by typing spark-shell and press enter in cmd.
Note:- Here we will get some error on console regarding hive directory write permission, we can ignore it,we can start executing spark API's and learn Apache spark.

Sample API's execution in python or scala shell :-
Create a RDD, display total number of lines in file and followed by first line of that file.
In Python version of the Spark shell
>>> lines = sc.textFile("") # Create an RDD called lines
>>> lines.count() # Count the number of items in this RDD
>>> lines.first() # First item in this RDD, i.e. first line of
u'# Apache Spark'

Note:- If you execute the same set of commands, console will be flooded with lines. I have suppressed it by changing log level to warning.

Location: Hyderabad, Telangana, India