Mar 30, 2015

Textual description of firstImageUrl

Internals of Read and write operation in hadoop

In previous post we learned about HDFS building blocks namenode, datanode , secondry namenode. In this post, I will be discussing read and write operation in HDFS. It is the HDFS client (a library that exports the HDFS filesystem interface) which communicate with name node on behalf of user/user application.In HDFS data's are stored in files and that file is split into multiple as per  to size of blocks in datanodes.
Note: Hadoop has went thorough a substantial architectural change from hadoop 1.x to hadoop 2.x, Hadoop 2.x has changed/modified some concept and added new feature to improve performance of hadoop cluster. Explanation below is more relevant and apt for hadoop 1.x. Read Haddop 2.0 vs Hadoop 1.0.
Before going read and write operation in detail, lets first understand: How data blocks are arranged in hadoop cluster/ what network topology is adopted and what is data block placement policy in hadoop cluster ?
Hadoop by default assumes that network is flat - a single level hierarchy(all nodes on single rack in a singe data centre). In a large hadoop cluster number of nodes are in large numbers and grow unheeded(commodity hardware is added to improve storage and fault tolerance) so it is be highly inefficient to adopt flat(switched) network topology because:
  1. In flat topology message send /broadcast packet will interrupt CPU on each node and operation execution will be order N(No of nodes in network or broadcast domain)
  2. In non hierarchical network overload on CPU of node also increases. CPU has to communicate many other routers and switches in order to reach at correct datanode.
Hierarchical/modular topology which minimizes number of communication with routers/switches and shorten distance between nodes(inter/intra) and increases bandwidth utilization. In hadoop cluster datanodes are grouped under racks and all datanodes are connected with a switch in rack and all racks are connected with top level one or more switches.A hadoop cluster with two racks and 3 datanodes can be depicted as :
Hierarchical topology in hadoop cluster: data-nodes, racks and switches inter-connectivity 
Note:- HDFS provides flexibility to configure a script which decide cluster will use multiple racks or all datanodes will be considered under single racks. Namenode resolves rack location for datanode and it also stores where about of datanode.When datanode registers with namenode, namenode runs that script to decide which rack the datanode belongs too.

Replica placement:- In hadoop cluster data blocks are replicated to handle failure condition. There is trade-off between reliability and bandwidth utilization(read/write) - If data blocks are replicatedon single node then no redundancy is achieved, on node failure data is lost. On the other hand, if data blocks are replicated on different data centres then redundancy may be maximized however penalty of read/write bandwidth would be huge.
What is default strategy followed for placement of data blocks in racks in hadoop cluster ?
  • First replica of data block is placed on the same node where writer program is running. And second replica is placed on different rack than first and the third replicas is placed on same rack as second but on different node.
  • If replication factor is more than 3, rest of replicas are placed on random nodes with a constraint that no more than one replica is placed at anyone node and no more than two replicas are placed in the same rack
Default value of dfs.replication is 3, so unless configured different valuedefault replication is 3. Now we will move on to main objective of this post, read and write operation in hadoop cluster.

Write operation:

In hadoop cluster, data is stored in file(create new or append in existing one) and file is break into multiple blocks.Once data is written in file it cannot be deleted only append new data is possible, HDFS follows single-writer, multiple-reader model. Lets understand how hadoop uses this model and perform write operation.
  1. When HDFS client request Namenode to create a new file(say input.txt), namenode checks permission for that and create appropriate inode. 
  2. The HDFS client that creates/open file for writing is granted lease for the file and it is locked for that client and no other client has write access on it(Remember single writer,multiple reader model). Client has to renew lease before it expires by sending light weight message.Duration of lease is bounded by soft limit and hard limit.
    Soft limit:- time duration before which if client did not renewed lease, other client can take over write access form this client.
    Hard limit:- time duration(1 hour) after which write access will be revoked automatically, if lease is not renewed.
  3. From namenode HDFS client requests for block where file will be stored (Remember in HDFS file is stored in blocks).
  4. Namenode returns unique block id and list of datanodes for block replicas to be stored. 
  5. Client write first block on datanode (block id which was returned by namenode) and passes list of data-nodes(provided by namenode for block replication) to that datanode where first block was written.Along with actual data client also send checksum to datanode for that block.
    Client send data block in form of collection of packets to first datanode and datanode form a pipeline with list of datanodes which minimizes the total network distance from the client to the last datanode. Consider following diagram to understand how block splits into packets and one block is written to multiple datanodes:
    Block  transmission from client to data-node (Diagram reference:
    First pipeline is set-up, then data streaming is carried out by sending packets and for each packet acknowledgement is send back to client and finally pipeline connection is tear down.
  6. The datanodes writes data block in file system and also stores checksum metadata in separate file,which is used while reading to verify data is valid or corrupted. 
  7. Datanode acknowledges namenode and client about block persisted to file system.
  8. After cycle for one block completes,again client ask namenode for block id and datanodes and above process (4 to 7) repeats for other .
  9. After all blocks have be consumed, namenode checks if at least one copy of each block has been written.
  10. If above check is successful,then it release lease and send success message to client.
Write operation can be summarized in following diagram:
Write operation in Hadoop
Here client wants to create input.txt in HDFS and request for block from namenode and first block is transferred by client in datanode and datanode persist it in file system. Similarly, second block is also written and finally lease is revoked and success ACK is send to client.

Read operation:

In haoop since data is stored in  file so when we need to read data we have to read data blocks of file. When a user application/HDFS client wants to read a file present in HDFS, the HDFS client perform following tasks:
  1. HDFS client requests namenode for the list of datanodes that stores replicas of blocks(file consist of blocks) for the given file name.
  2. For each block, namenode returns address of datanodes that has a copy of that block(metadata information). 
  3. Once client receives metadata about blocks present at datanode, client request datanode to transfer blocks.
  4. HDFS client once receives metadata from namenode, locations of each block are ordered by their distance from the reader and client tries to read from closet replica first,if it fails then it try for next replica and so on. Data is streamed from datanode back to client.Below diagram gives an overview of read operation in HDFS.
Read operation in HDFS

Note :- 
1. Namenode does not ask datanode directly to transfer block.Client contacts datanode for blocks and is guided by namenode (best datanode for each block), this design allows HDFS to scale large number of concurrent clients. Following block diagram gives an overview of read operation in hadoop.
2. Read operation may fail ,when datanode is not available/corrupted or checksum of block does not matches when block is read by client.
Watch Hadoop and NoSQL Downfall Parody  and I bet only hadoop learner can appreciate it.

Next: MapReduce - An efficient data processing technique

Mar 28, 2015

Python Input and Output

In order to accept user input via console and display result, python has in built methods like raw_input() and print(). We have seen uses of print() in earlier post, reading input via console will be focus of this post. Along with console reading, Python’s support for reading and writing text files too.In this post we will see both of these in detail.

Console Input and Output:-

For acquiring information from the user console python has raw_input() (input() in python 3.0 or later) in built method. Lets write a sample program, calculate mean of numbers between range provided as user input.  i.e  user provides range (start and end) and output is displayed via print().
Open Python IDLE and Copy the following code in a file. Once we execute the following code lines, prompt will appear and will ask for user input. calculateMean(start, end) is called and mean is calculated.Using print() method mean is displayed. 
# Definition of calculateMean function 
def calculateMean(start,end):
      sum = 0
      n = (end - start)+1;
      while start <= end:
        sum = sum+start
        start = start+1
      return (float(sum) / n)

# Read user input via console 
start = int(raw_input("Enter start of range: "))
end = int(raw_input("Enter end of range: "))

# call calculateMean function
mean = calculateMean(start,end)
print "mean of numbers between %d and %d is %f" % (start, end, mean )
Sample outputs are:
Enter start of range: 8
Enter end of range: 10
mean of numbers between 8 and 10 is 9.000000
Enter start of range: 1
Enter end of range: 10
mean of numbers between 1 and 10 is 5.500000

Files input and output:-

Like others programming languages, python has built in methods to perform read and write operation in files.The very first step is to open the specified file in read or write mode. It returns a proxy for interactions with the underlying file. Once we get handle(proxy) corresponding to the file, using read() , readline() or write() we can perform required operation.Finally, once we are done we close the file using close() method. It can be summarized as following : 
# Open for writing 
f = open('input.txt', 'w')
# Write text to file
f.write("input string")
# Close the file
Let's write a sample code for demonstrating file operation.Open a file in write mode , read the content from input file (display on console) and append block of lines in the existing file.
Open Python IDLE and create a file and copy following code lines in it. Lets walk through code lines and divide our program in thee section:
Read operation : input_block - block of lines is created for writing in file. File input.txt is opened in read mode(Default mode is read mode, if do not specify) and using readline() method file is read line by line and displayed on console. Loop is terminated once length of line is zero.And file is closed using close().
Append operation: After that file is opened in append mode(write at end of file without disturbing previous content). Using f.write(input_block) block of statements is written in file and file is closed.
Read operation after append : Once again file is opened in read mode and file content is read and displayed on console. We can spot the difference in output of section 1 and section 2, highlighted in blue colour.
input_block = """
1.hello writing!!
111.hello writing!!
11111.hello writing!!

# Section 1: read operation
# Read the file content and display on console
print 'Read the file content and display on console before append'
# open file and get proxy(handle)
f = open('input.txt')
while True:
    line = f.readline()
    # Zero length indicates EOF
    if len(line) == 0:
    print line,
#Close the file

# Section 2: append operation
# Open for appending
f = open('input.txt', 'a')
# Write text to file
# Close the file

# Section 3: read operation after append
# Read the file content and display on console
print 'Read the file content and display on console after append'
# open file and get proxy(handle)
f = open('input.txt')
while True:
    line = f.readline()
    # Zero length indicates EOF
    if len(line) == 0:
    print line,
#Close the file
Sample output:
Read the file content and display on console before append
Read first line.
Append your test below
----Do not delete me------

Read the file content and display on console after append
Read first line.
Append your test below
----Do not delete me------

1.hello writing!!
111.hello writing!!
11111.hello writing!!
---End of sample output---

Previous: Python Functions  Next:Exception handling in python

Mar 22, 2015

Textual description of firstImageUrl

Oracle JDK installation in Ubuntu

Since Oracle’s Sun Java JDK packages has been removed from the Ubuntu partner repositories. Now we are left with two option:  either go with Open JDK or Oracle JDK. The fundamental difference between open jdk and oracle jdk is OpenJDK is a reference model and open source, while Oracle JDK is an implementation of the OpenJDK and is not open source and Oracle JDK is more stable than OpenJDK. I preferred Oracle JDK (more support and not buggy). Yes its true, Oracle JDK installation is little more time consuming than running direct command but it is one time activity.

Java installation in Ubuntu:-

Please follow below steps and install oracle JDK. Refer Install-Oracle-Java-on-Ubuntu-Linux (Diagrams might be helpful).
  1. First check your machine is 32 bit or 64 bit Linux box (32 bit and 64 bit refers to architecture of operating system).
    Open terminal and execute following command <file /sbin/init> and verify it is 32 bit or 64 bit.
    zytham@ubuntu:/usr/lib$ file /sbin/init
    /sbin/init: ELF 64-bit LSB  shared object, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, BuildID[sha1]=7a4c688d009fc1f06ffc692f5f42ab09e68582b2, stripped
    64-bit LSB indicates 64 bit OS architecture.
  2. Remove all other Java installation if any. Execute following command to remove Open JDk or others installation, its better to remove to avoid any conflict.
    zytham@ubuntu:~$ sudo apt-get purge openjdk-\*
  3. Download latest oracle JDK from oracle official website and it will be downloaded in Downloads folder of Linux box. Note,I have downloaded 64 bit oracle JDK and file name is "jdk-8u60-linux-x64.tar.gz",select the correct compressed binaries as per your OS architecture.
    Now create a new folder under /usr/local and move the downloaded JDK in that folder under and verify it has been moved successfully.Execute following command for the same in sequence.
    zytham@ubuntu:~$ sudo mkdir -p /usr/local/java
    zytham@ubuntu:~$ sudo mv /home/zytham/Downloads/jdk-8u60-linux-x64.tar.gz /usr/local/java/
    zytham@ubuntu:/usr/local/java$ ls /usr/local/java/
  4. Unpack the compressed Java binaries present in  /usr/local/java using following command and verify a new folder jdk1.x.x created.
    zytham@ubuntu:~$ cd /usr/local/java/
    zytham@ubuntu:/usr/local/java$ sudo tar -xvzf jdk-8u60-linux-x64.tar.gz 
    zytham@ubuntu:/usr/local/java$ ls
    jdk1.8.0_60  jdk-8u60-linux-x64.tar.gz
  5. Update system variable :-  Open /etc/profile using vi or gedit using following command.
    zytham@ubuntu:~$ sudo gedit /etc/profile
    Now add following line in opened file(updating system variables in PATH file) and save it. After update it should look like as following diagram.
    export JAVA_HOME
    export JRE_HOME
    export PATH
    Test whether we have updated correctly or not:- source is used to reload the file in system and that's why we are able to see recently updated value. If we do not execute first command echo $JAVA_HOME will not return any output.
    zytham@ubuntu:~$ source /etc/profile
    zytham@ubuntu:~$ echo $JAVA_HOME
  6. Now we are 2 steps away from completing java execution:- If we have multiple Java version in our system or we have Open JDK installed, we need to communicate Ubuntu Linux system where is our recently installed Oracle JDK and if we want to make it default java. Execute following command:
    1. Inform Ubuntu Linux system where our Oracle Java JDK is located.
    zytham@ubuntu:/usr/bin$ sudo update-alternatives --install "/usr/bin/java" "java" "/usr/local/java/jdk1.8.0_60/jre/bin/java" 1
    [sudo] password for zytham: 
    update-alternatives: using /usr/local/java/jdk1.8.0_60/jre/bin/java to provide /usr/bin/java (java) in auto mode

    2. Inform Ubuntu Linux system that Oracle JDK must be the default Java.
    Inform your Ubuntu Linux system that Oracle JDK must be the default Java.
    zytham@ubuntu:/usr/bin$ sudo update-alternatives --set java /usr/local/java/jdk1.8.0_60/jre/bin/java
  7. It's time to test Java is in place in our system or not.Execute following command and do copy/paste sample java program in file opened and save it.
    zytham@ubuntu:~/Desktop$ gedit
    copy following sample code:-
    class Sample{
      public static void main(String []strArg){
        System.out.println("Installed   successfully!!!"); 
    Compile above java program and run using the following commands. If you have followed the steps correctly you will see the success message.
    zytham@ubuntu:~/Desktop$ javac
    zytham@ubuntu:~/Desktop$ java Sample 
    Installed successfully!!!
    zytham@ubuntu:~/Desktop$ gedit

Mar 21, 2015

Textual description of firstImageUrl

NameNode, DataNode and Secondary NameNode

In previous post we learned that, NameNode(NN) and DataNode(DN) are main building block of HDFS(Hadoop Distributed File system).In this post we will see NN and DN in details,interaction among and architecture in HDFS.

NameNode and DataNode :

HDFS cluster has two types of nodes operating in a master−slave pattern: a namenode (the master) and a number of datanodes (slave/worker).Namenode stores metadata and datanode deals with actual storage. Datanodes are the workhorses of the filesystem and it stores files in smaller blocks(default size 128 MB).
Namenode  manages the file-system namespace.Namespace/name-system consist of files and directories.Files and directories are represented on the namenode by inodes(Inode is a data structure used to represent a file system object file or directories in tree structure). Inodes stores all information about files or directory such as permissions, modification and access times,namespace and disk space quotas. Namenode maintains file-system tree and its metadata for all files and directories in tree.
Inodes and memory blocks which stores metadata of namespace is called image(fsimage)Namenode stores image in RAM and persistent record of image is stored in namenode's local file system called checkpointAlong with the checkpoint, namenode maintains a write ahead log file called journal, in its local file system.
  • Any changes made in HDFS is tracked in journal and journal size keep on growing unless changes is persisted and merged with checkpoint.
  • Block locations are not stored persistently because on system start-up, this information is obtained from datanode. 
  • NameNode does not change checkpoint file, on start of namenode namespace image is loaded from checkpoint and new checkpoint and journal is created in native file system. 
  • In order to increase reliability and availability, replicated volume of  checkpoint and journal is stored in local servers. 
Datanodes manages data blocks(default size 128 MB). Each block in datanode consist of two files. One file contains actual data and another file contains meta-data like checksum of data stored and generation stamp. Datanodes are assigned two identifiers called namespace ID and storage ID.
  • When a datanode initializes and joins with  namenode namespace ID is assigned to it and namespace ID is persistently stored in all datanodes. Namespace ID is used to maintain file system integrity.
  • When a system start, all datanodes of cluster sends handsake message to namenode to prove its identity that datanode belong to given namenode namespace.If namenode finds that namespace ID is not correct for any datanode then that datanode is not allowed ot join cluster and datanode shut-down automatically. Thus, namespace ID restrict datanode with differenct namespace ID to join cluster.On successful handshake process, datanode is registered with namenode and become part of hadoop cluster.
  • Storage ID is assigned to each datanodes when it joins a namenode the very first time and it is never changed after that.The storage ID is an internal identifier of the DataNode. 
  • Since namenode does not store block location, when datanodes are registered with namenode they send block report to namenode. Block report contains block ID, generation stamp and block length for each block replicas datanode possess. Datanode periodically(every hour)send block report to namenode and give updated view of blocks in cluster.
  • Along with block report datanodes send light weight message(called Heartbeat) to namenode to mark their presence and blocks existence in cluster. The default heartbeat interval is three seconds.
    What happens if namenode does not receive heartbeat from some datanode ?
    If the namenoode does not receive a heartbeat from a datanode in ten minutes the namenode marks the datanode to be dead and the block replicas hosted by that datanode to be unavailable. Block replication operation is initiated by namenode for all blocks that were present in dead datanode. 
  • Heartbeat message contains information about total storage capacity of datanode, fraction of space in use so heartbeat message helps namenode in load balancing and block allocation decision.
  • Nameode does not send message to datanode directly. It send response of heartbeat and instruct datanode to provide block report, replicate blocks, shut down node, etc.
Since hadoop cluster consist of plural datanodes and singular namenode. In order to deal with single point failure of  namenode, hadoop suggest two ways of doing this :
  1. Maintain a standby node called secondary namenode in different server other than where namenode is existing
  2. Copy the image and log files to remote server periodically and when failure occur read from this location and recover.
NameNode , DataNode and Secondary NameNode can be represented as follows :
Namenode, DataNode and secondry namenode(checkpoint node) block diagram
How does secondary namenode act as standby node for name node ?/What is use of secondry namenode in hadoop cluster ?/What is checkpoint node ?
Any changes to HDFS is first logged in journal and it grows until its merged with checkpoint and new journal is created. Consider a situation, when hadoop cluster has not been started for months and log file has is very huge. When cluster restarts, namenode need to restore image(metadata of file system) from checkpoint created earlier and merge it with journal. The time taken to restart will depend on size of journal.
Here comes secondary namenode for rescue.Secondary namenode is managed on different server other than namenode. The main purpose of using secondary namenode is to periodically merge journal with checkpoint and create new journal and checkpoint. Steps which secondary namenode follows are :
  1. Secondary namenode downloads both checkpoint and journals from namenode .
  2. Merge both of them locally and create new checkpoint and empty journal
  3. Updated checkpoint is returned to namenode
Secondry namenode avoids overhead of merging logs at the moment of restarting the cluster and namenode restores faster. Generally, in every one hour secondary name node repeat above steps and update checkpoint.
Note: Secondary namenode does not act as namenode when namenode failure occurs. Its just matter of naming convention which creates confusion.Secondary namenode is also termed as checkpoint-node.

Textual description of firstImageUrl

Git Commands Recap : Tagging, Branching, Merging

 Git Tagging : git tag <tag_version>

git tag command is used to tag a given commit with version/tag name. Below command open default editor and tag current commit with tag v1.0. (which can be validated with "git log" command )
➜  custom_cloned_repo git:(master) ✗ git tag -a v1.0
  latest commit is tagged with tag: v1.0

What does "-a flag" significance ?
This "-a" flag tells Git to create an annotated flag. If we don't provide the flag (i.e. git tag v1.0) then it'll create a lightweight tag.

Annotated tags are recommended because they include a lot of extra information such as:
  • the person who made the tag
  • the date the tag was made
  • a message for the tag
List all tags that are in the repository:
➜  custom_cloned_repo git:(master) ✗ git tag v2.0   
➜  custom_cloned_repo git:(master) ✗ git tag

Deleting A Tag : Delete a given tag in repository using flag "-d" or "--delete" 
➜  custom_cloned_repo git:(master) ✗ git tag
➜  custom_cloned_repo git:(master) ✗ git tag -d v2.0
Deleted tag 'v2.0' (was 67d8efc)
➜  custom_cloned_repo git:(master) ✗ git tag        
Using --delete also, we can delete tag like "git tag --delete v2.0"

Adding A Tag To A Past Commit: By appending 6 digits of hash post version details will add tag in given commit. Below command add tag with hash 59142f, which can be validated from image below.
➜  custom_cloned_repo git:(master) ✗ git tag -a v3.0 59142f
   v3.0 is added in second commit from top.

Git Branching : git branch <branch_name>

Create branch in repository: The command "git brach" lists all branches in given repository.
If we append a branch in above command, it will create a new brach. An asterisk will appear next to the name of the active branch.

➜  custom_cloned_repo git:(master) ✗ git branch
* master
➜  custom_cloned_repo git:(master) ✗ git branch fron_end_branch
➜  custom_cloned_repo git:(master) ✗ git branch                
* master

Switch between branches : git checkout <branch_name>
git checkout <branch_name> is used to switch between branches. On executing "git checkout br_name"
  • It remove all files and directories from the Working Directory that Git is tracking.
  • go into the repository and pull out all of the files and directories of the commit that the branch points to.
➜  custom_cloned_repo git:(master) ✗ git status         
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

Untracked files:
  (use "git add <file>..." to include in what will be committed)


nothing added to commit but untracked files present (use "git add" to track)
➜  custom_cloned_repo git:(master) ✗ git checkout fron_end_branch  
Switched to branch 'fron_end_branch'
➜  custom_cloned_repo git:(fron_end_branch) ✗ git status
On branch fron_end_branch
Untracked files:
  (use "git add <file>..." to include in what will be committed)


nothing added to commit but untracked files present (use "git add" to track)

Delete a branch
: Using flag "-d" branch can be deleted. To force deletion, you need to use a capital D flag like "git branch -D fron_end_branch"
➜  custom_cloned_repo git:(master) ✗ git branch
* master
➜  custom_cloned_repo git:(master) ✗ git branch -d fron_end_branch  
Deleted branch fron_end_branch (was 67d8efc).
➜  custom_cloned_repo git:(master) ✗ git branch                    
* master 

 Git Merging : git merge 

Combining branches together is called merging. There are two main types of merges in Git:
  • Regular merge : This combines two divergent branches, a commit is going to be made. Whichever branch the HEAD pointer is pointing at, that's the branch that will have the merge commit.
  • Fast-forward merge: A Fast-forward merge will just move the currently checked out branch forward until it points to the same commit that the other branch (in this case, footer) is pointing to.
Fast forward merge:
➜  custom_cloned_repo git:(master) ✗ git branch                    
* master
➜  custom_cloned_repo git:(master) ✗ git branch front_end
➜  custom_cloned_repo git:(master) ✗ git checkout front_end  
Switched to branch 'front_end'
➜  custom_cloned_repo git:(front_end) ✗ vi front.js
➜  custom_cloned_repo git:(front_end) ✗ git add front.js 
➜  custom_cloned_repo git:(front_end) ✗ git commit -m "commit js file change"
[front_end c8188dd] commit js file change
 1 file changed, 3 insertions(+)
 ➜  custom_cloned_repo git:(front_end) ✗ git checkout master  
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)
➜  custom_cloned_repo git:(master) ✗ git merge front_end  
Updating 67d8efc..c8188dd
 front.js | 3 +++
 1 file changed, 3 insertions(+)

Regular merge: Command for regular merge is same as above, just a new commit will be created.
➜  custom_cloned_repo git:(master) ✗ git merge forked_front_end  

Mar 15, 2015

Textual description of firstImageUrl

Big data evolution - Relational model to Hadoop ecosystem

Database technologies has went through dramatic transformation from pre-stage flat-file system  to relational database system(RDBMS). Undoubtedly, RDBMS has been de facto database system for longer period of time for both small and large scale industries because of its relational model(structured/tabular storage of data).
As business grows so as volume of data, RDBMS become inefficient and less cost effective database technologies. That is the reason, RDBMS is now considered to be a declining database technology.
RDBMS is not effective in terms of providing scalable solution to meet the needs high volume of data flowing with high velocity.
What does it mean when we say, RDBMS is not scalable ? : Horizontal scaling(data distribution at multiple nodes/server) is not possible in relational database, only vertical scaling of system(processing speed and storage increase) is feasible however their is upper limit of it.

In order to deal with high volume of data, new database technology evolved coined as "NoSQL" database. NoSQL is abbreviated as "not only SQL". i.e: NoSQL databases unlike relational databases which deal with structured data only, deals with unstructured and semi-structured data and centres around the concept of distributed databases.Data are distributed across various processing nodes/servers and trading of consistency (BASE instead of ACID) for speed and agility. Because of distributed architecture of NoSQL, it is horizontally scalable(as data continues to explode, just add more hardware(commodity hardware)) to keep system up and running.
Performance of  relational database vs NoSQL database can be depicted as follows: In relational databases as volume of data increases performance decreases while in NoSQL databases it does not vary with volume of data.
For small volume of data relational database outperform NoSQL database.
MondoDB, CouchDB, Neo4j, HBase and Cassandra are well known NoSQL databases. Lets a leap ahead and understand what is Big data and how hadoop ecosystem came into picture?

When we talk about high volume of data (in TB or PB) flowing with high velocity coins a fancy term Big data. Big data basically consist of 4 V's : Volume, Velocity, Veracity and Variety

  • Volume - Amount of data to dealt are in TB or PB. 
  • Velocity- It describes the frequency at which data is generated, captured and shared.
  • Variety - A large volume of data is present in data storage in form of documents, emails ,video and voices. These are unstructured data and emerging data types include geo-spatial and location, log data, machine data, metrics, mobile, RFIDs,search, streaming data, social, text and so on. 
  • Veracity - As mentioned earlier, we are extending benefit of distributed database at cost of consistency.So, there are chances data retrieved or received is not intended one. So, reliability is always a concern in such system.  
In year 2003, google came up with concept of Google file system and MapReduce(data processing technique on large clusters). Inspired form it, Doug Cutting and  Michael J. Cafarella, developed hadoop file system and hadoop mapreduce and yahoo played an important role to nurture it.
Hadoop is a software ecosystem that allows for massively parallel computing using HDFS and MapReduce.With time various software system has been added in hadoop ecosystem  like Pig,Hive,Hbase,etc. Please note hadoop has an abstract notion of filesystems, of which HDFS is just one implementation other in this list are WebHDFS,Azure,Local,etc.
Hadoop Distributed File System (HDFS): It is a highly fault-tolerant, high throughput distributed file system designed to run on commodity hardware(affordable and easy to obtain). Hadoop ecosystem states, hardware failure is the norm rather than the exception.So, data is distributed across multiple server(hadoop cluster) and replicated to handle node failure condition. HDFS stores data in fixed size block and the commodity hardware(part of hadoop cluster) where data is stores is called data node.
  • HDFS is designed to handle application dealing with batch processing rather than interactive operation.
  • HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, 
  • The focus of HDFS is on high throughput rather than latency.
     Latency : Amount of time needed to complete a task or produce output.
    Throughput : No of task completed in per unit of time.  
  • HDFS stores filesystem metadata and application data separately and follows master slave architecture. Filesystem metadata is stored in NameNode and application data is stored in DataNode. Name node act as master and data node is slave.
    NameNode: It manages the file system namespace(NameNode inserts the file name into the file system tree and allocates a data block for it) ,determines mapping of blocks to data nodes and maintains meta-data information of blocks stored in data node. It executes file system namespace operations like opening, closing, and renaming files and directories. For each hadoop cluster there is only one name node which act as master.
    DataNode:  It provides actual storage of data and responsible for serving read and write requests from the file system’s clients. DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. Refer this post for more detail about namenode and datanode.
  • In multi-node cluster name node and data node are on different machine and single node cluster both are existing together in same machine.In multi-node cluster, there is only one name node and multiple data nodes.Sometimes name node is called single point of failure in hadoop cluster.
  • In order to deal with single point failure of name node, hadoop maintains a SecondaryNameNode node on different machine(in multi-node cluster) which stores images (metadata of data blocks) at certain checkpoint and is used as backup to restore Name Node.
    Read and write operation in hadoop gives broader picture of namenode and datanode.
Data Node storing blocks B1, B2. and Name Node has Data Node where about  

In next post we will revisit Namenode and DataNode and discuss them in detail.

Mar 3, 2015

Textual description of firstImageUrl

Git Commands Recap : Add, rm, commit, diff

Git Add : git add <file_name1>,<file_name2>

If a file need to be committed to repo, it needs to be brought first to staging area(staging index) from working directory. The "git add" command is used to move files from working directory to staging index(Changes need to be committed area). Here we create three files and use git add command to move it in staged which can be verified using git status command.

➜  custom_cloned_repo git:(master) ✗ git add index.jsp
➜  custom_cloned_repo git:(master) ✗ git add .

git add command expects argument(s) - Either list of files names separated by comma or period(.) - all the files in working directory.

How to remove file from staged area(To un-stage) :
Using "git rm --cached <file_name>" - file can be moved from staged index to working area(un-stage). Below rm command un-stage front.css and move to untracked section .
➜  custom_cloned_repo git:(master) ✗ git status
On branch master
Your branch is up-to-date with 'origin/master'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

 new file:   front.css
 new file:   front.js
 new file:   index.jsp

➜  custom_cloned_repo git:(master) ✗ git rm --cached front.css 
rm 'front.css'
➜  custom_cloned_repo git:(master) ✗ git status               
On branch master
Your branch is up-to-date with 'origin/master'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

 new file:   front.js
 new file:   index.jsp

Untracked files:
  (use "git add <file>..." to include in what will be committed)


Git Commit : git commit /git commit -m "<commit_message>"

The git commit command takes files from the Staging Index and saves them in the repository. "git commit" opens configured editor for capturing commit message, if not configured vim will be used as default editor.
Command "git config --list" displays git configuration, Refer this for associating text editor with git.

git commit with -m flag is used to specify comments inline(default editor do not open for capturing commit message).
➜  custom_cloned_repo git:(master) ✗ git commit -m "added js and index jsp"
[master 67d8efc] added js and index jsp
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 front.js
 create mode 100644 index.jsp

Git Diff : git diff

git status tell us what files have been changed, but does not conveys what those changes are.The git diff command is used to find out what those changes are.

The git diff command can be used to see changes that have been made but haven't been committed, yet. It also displays:

  • the files that have been modified
  • the location of the lines that have been added/removed
  • the actual changes that have been made

Note: Outcome of "git log -p" is similar to "git diff".

Mar 2, 2015

Textual description of firstImageUrl

Git Commands Recap : init. clone, status, log (--oneline, --stat, --patch) , show

Git Init : git init

Create/Initialise git repository : Command  <git init>:
"init"command is short for "initialise", it's the command that will do all of the initial setup of a repository. Steps to follow for initial setup: 
  1. create a directory called "devinline-git"
  2. Under this, create another directory called "devinline-java-project"
Execute following command to create directory and go inside it. 
➜  ~ mkdir -p devinline-git-recap/java-git-project && cd $_
➜  java-git-project pwd

Running the git init command sets up all of the necessary files and directories that Git will use to keep track of everything.
➜  java-git-project git init
Initialized empty Git repository in /Users/n0r0082/devinline-git-recap/java-git-project/.git/

All of these files are stored in a directory called .git (a hidden directory in Unix system). This .git directory is the "repo" and It holds all of the configuration files and directories and is where all of the commits are stored.

Git Clone : git clone <path-to-repository-to-clone>

If we have some existing repo, we can clone it. The git clone command is used to create an identical copy of an existing repository. It is useful when we wan to start project with initial files(index.html, css, js or database jdbc files, etc). We have repository in place and we can clone it using "git clone" command.

➜  java-git-project cd ..
➜  devinline-git-recap git clone
Cloning into 'JSONPlayWithJackson'...
remote: Counting objects: 31, done.
remote: Compressing objects: 100% (17/17), done.
remote: Total 31 (delta 6), reused 30 (delta 5), pack-reused 0
Unpacking objects: 100% (31/31), done.
➜  devinline-git-recap ls -ltr
total 0
drwxr-xr-x  2 n0r0082  74715970   64 Jun 19 20:32 java-git-project
drwxr-xr-x  8 n0r0082  74715970  256 Jun 19 20:52 JSONPlayWithJackson
  • git clone command expects git repository name and by default will create a directory with the same name as the repository that's being cloned.
  • Default behaviour of name can be overwritten, providing second argument to "git clone" project. Below command creates cloned project with name provided in git clone command. 
➜  devinline-git-recap git clone custom_cloned_repo
Cloning into 'custom_cloned_repo'...
remote: Counting objects: 31, done.
remote: Compressing objects: 100% (17/17), done.
remote: Total 31 (delta 6), reused 30 (delta 5), pack-reused 0
Unpacking objects: 100% (31/31), done.
➜  devinline-git-recap ls -ltr                                                                   
total 0
drwxr-xr-x  2 n0r0082  74715970   64 Jun 19 20:32 java-git-project
drwxr-xr-x  8 n0r0082  74715970  256 Jun 19 20:52 JSONPlayWithJackson
drwxr-xr-x  8 n0r0082  74715970  256 Jun 19 21:40 custom_cloned_repo
Reference : Git Clone

Git Status : git status

The "git status" command will display the current status of the repository.
➜  custom_cloned_repo git:(master) git status
On branch master
Your branch is up-to-date with 'origin/master'.

nothing to commit, working tree clean

This command will-
  • tell us about new files that have been created in the Working Directory that Git hasn't started tracking, yet
  • files that Git is tracking that have been modified

Git Log : git log (--oneline, --stat, --patch/--p) 

The git log command is used to display all of the commits of a repository.
➜   git log

By default, "git log" command displays following values of every commit in the repository.
  • the SHA
  • the author
  • the date
  • and the message
Git uses the command line pager, Less, to page through all of the information. The important keys for Less are:
  • to scroll down by a line, use j or ↓
  • to scroll up by a line, use k or ↑
  • to scroll down by a page, use the spacebar or the Page Down button
  • to scroll up by a page, use b or the Page Up button
  • to quit, use q

git log --oneline 

The --oneline flag is used to alter how git log displays information.
➜   git log --oneline

This command:
  • lists one commit per line
  • shows the first 7 characters of the commit's SHA
  • shows the commit's message

git log --stat 

The --stat flag is used to displays information in descriptive way with following details.
  • displays the file(s) that have been modified
  • displays the number of lines that have been added/removed
  • displays a summary line with the total number of modified files and lines that have been added/removed
➜   git log --stat

git log --patch / git log --p 

The git log command has a flag (--patch/-p) that can be used to display the actual changes made to a file.
➜   git log --patch

Git Show : git show <Hash_of_commit>

To shows a specific commit. Output of above command is similar to "git log -patch"(no scroll as . only one hash details).
➜   git show 59142f8d2dc39584f6b1998a747d1c5e4bfb554d

By default, git show displays:

  •  the commit 
  • the author 
  • the date 
  • the commit message 
  • the patch information
However, git show can be combined with other flags like:
  •  --stat : to show the how many files were changed and the number of lines that were added/removed
  • -p or --patch : this the default, but if --stat is used, the patch won't display, so pass -p to add it again
  • -w : to ignore changes to whitespace