Nov 13, 2015

Textual description of firstImageUrl

Apache Hive - Introduction, Installation and database operations

Apache Hive:- It is a data warehouse infrastructure built on top of HDFS and MapReduce(HDFS provides distributed storage and MapReduce is execution model) and facilitates querying & managing large datasets residing in distributed storage.It provides an SQL-like language called HiveQL which transparently converts queries to map/reduce, Apache Tez and Spark jobs. It allows programmers to plug in their custom mappers and reducers for sophisticated analysis that may not be supported by the built-in capabilities of the language. Apache hive does not restrict for any particular file format, works equally well with even customized file format. Read more about Apache hive here and a nice discussion about it.
The main agenda of this post is to set-up hive in Linux(Ubuntu) already having hadoop setup running and execute SQL like query in hive environment.Since apace Hive built on top of Apache Hadoop, hadoop setup is prerequisite for running Hive.
Step 1:- Download stable release of Hive from here. I have downloaded apache-hive-1.2.1-bin.tar.gz, and unzip it, unzipped folder it creates is apache-hive-1.2.1-bin.
zytham@ubuntu:~/Downloads$ tar -xvzf apache-hive-1.2.1-bin.tar.gz
Step 2:- Move unzipped at some convenient location. I moved it at location : /usr/local/hive
zytham@ubuntu:~/Downloads$ sudo mv apache-hive-1.2.1-bin /usr/local/hive
Step 3:- Open bashrc file and update it with following entries.
#for Hive
export HADOOP_HOME=/usr/local/hadoop2.6.1
export HIVE_HOME=/usr/local/hive1.2
export PATH=$HIVE_HOME/bin:$PATH
Step 3:- Start hadoop services.Hive internally uses hdoop(or other computation engine too). Here we are relying on hadoop.
hduser1@ubuntu:/usr/local/hadoop2.6.1/sbin$ ./start-all.sh
Step 4:- Create two directories which is used by Hive for "location of default database for the warehouse" and "to restore the HBase table snapshot"- default location of directories is specified in configuration file hive-default.xml.template under $HIVE_HOME/conf/(search for property names hive.metastore.warehouse.dir and hive.hbase.snapshot.restoredir.)
Execute following command to create two directories.
hduser1@ubuntu:/usr/local/hadoop2.6.1$ ./bin/hadoop fs -mkdir -p /user/hive/warehouse
hduser1@ubuntu:/usr/local/hadoop2.6.1$ ./bin/hadoop fs -mkdir -p /tmp
Note:- Above command executed from hduser1 context, since I have installed hadoop under user hduser1.
Step 5:- Hadoop is installed in the context of user hduser1,give write permission to other users so that hive query can be executed from other user context. Execute following command for the same.
hduser1@ubuntu:/usr/local/hadoop2.6.1$ ./bin/hadoop fs -chmod -R a+w /user/hive/warehouse
hduser1@ubuntu:/usr/local/hadoop2.6.1$ ./bin/hadoop fs -chmod -R a+w /tmp
Step 6:- Start apache Hive prompt, just type hive in terminal and press enter.
zytham@ubuntu:~$ hive
Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 0.099 seconds, Fetched: 1 row(s)
Note:- 1. Error : "[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected"
Reason behind it is Hive has upgraded to Jline2 but jline 0.94 exists in the Hadoop lib.
Solution:- Execute following command in terminal followed by hive shell command.
zytham@ubuntu:~$ export HADOOP_USER_CLASSPATH_FIRST=true
zytham@ubuntu:~$ hive
Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
Refer this for more error and resolution in Hive.
2. Press Ctril + C to exit form Hive prompt and return to terminal.

Database operations - HiveQL(Hive query language)  

HiveQL is a dialect of SQL,heavily influenced by MYSQL. Although, SQL does not provide a solution for all big data problems but it provides an efficient and powerful tool to analyst with strong SQL background and mere Java knowledgeto for analysing high volume of data.
Internally, HiveQL is translated into series of MapReduce job or other jobs depending on property name "hive.execution.engine.property" configuredOther possible execution engines that can be used are Apache Tez, Spark.

Hive Shell and interaction with Hive:- Start Hive shell (Just type hive in terminal and press enter), First create a new schema (database), get inside it(use that schema) and followed by create a table.
hive> create database retails;
OK
Time taken: 1.219 seconds
hive> use retails;
OK
Time taken: 0.144 seconds
hive> create table customers(name string, company string, id int);
OK
Time taken: 0.341 seconds
hive> show tables;
OK
customers
Time taken: 0.137 seconds, Fetched: 1 row(s)
hive> show databases;
OK
default
retails
Time taken: 0.168 seconds, Fetched: 2 row(s)
Above table creation query is nothing more than a plain SQL query. Since, the main purpose of using Hive is to deal with high volume of data(log files,retails/sells data,forecasting data) and analyse these raw data with HiveQL,which usually done by writing mapreduce programs.
Lets create a table employee and load data in it from a text file followed by execute HiveQL query on employee table.
hive> create table employee(employee_id int, first_name string, email string) ROW FORMAT DELIMITED FIELDS TERMINATED by '|';
OK
Time taken: 0.377 seconds
hive> show tables;
OK
customers
employee
Time taken: 0.137 seconds, Fetched: 2 row(s)
hive> LOAD DATA LOCAL INPATH '/home/zytham/Documents/employee_record.txt' OVERWRITE INTO TABLE employee;
Loading data to table retails.employee
Table retails.employee stats: [numFiles=1, numRows=0, totalSize=257, rawDataSize=0]
OK
Time taken: 0.551 seconds
hive> select * from employee;
OK
16 john john@devinline.com
17 robert robert@devinline.com
18 andrew andrew@devinline.com
19 katty katty@devinline.com
27 edward edward@devinline.com
29 alan alan@devinline.com
31 kerry kerry@devinline.com
34 tom tom@devinline.com
35 zack zack@devinline.com
Time taken: 0.263 seconds, Fetched: 9 row(s)
Above create table query creates a table employee with 3 column and ROW FORMAT clause tells Hive that each row in the data file is | (pipe-delimited) text. Hive expects there to be three fields in each row, corresponding to the table columns, with fields separated by | (pipe) and rows by newlines.
Load DATA ... query tells Hive to put the specified local file(employee_record.txt) in its warehouse directory named employee (Remember, Hive does not mandate any particular file format so whatever file input will be provided it will be simply copied to warehouse directory, even if it does not matches definition of table columns, if there is any mismatch it will appear while querying). OVERWRITE keyword wipe out existing files in warehouse and add the specified one.
Note:-
1. Along with normal text files we can also load Map file,JSON files  in warehouse directory and analyses using known UDF(user defined function) - get_json_object, json_tuple, JSON SerDe.
2. If no schema name is specified(if we do not execute use retails and try to create table), table is created in default schema.
3. Hive creates one directory for each table under /user/hive/warehouse, execute following command outside hive shell and see the directory structure.
4. Interactive VS non-interactive mode:- Until now we executed queries in Hive shell and result displayed at same place, this mode of execution is termed as interactive mode execution.
With -f option we can run hive shell in non-interactive mode and here we specify SQL in a script file.Create a script file say,script.hive and copy following SQL query in that file.
show databases;
use retails;
show tables;
select * from employee;
LOAD DATA LOCAL INPATH '/home/zytham/Documents/employee_record.txt' INTO TABLE employee;
select * from employee;
From terminal execute following command- execute script file in non-interactive mode in Hive.
zytham@ubuntu:~$ hive -f script.hive 

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
OK
default
retails
Time taken: 1.063 seconds, Fetched: 2 row(s)
OK
Time taken: 0.086 seconds
OK
customers
employee
Time taken: 0.132 seconds, Fetched: 2 row(s)
OK
16 john john@devinline.com
17 robert robert@devinline.com
18 andrew andrew@devinline.com
19 katty katty@devinline.com
27 edward edward@devinline.com
29 alan alan@devinline.com
31 kerry kerry@devinline.com
34 tom tom@devinline.com
35 zack zack@devinline.com
Time taken: 1.091 seconds, Fetched: 9 row(s)
Loading data to table retails.employee
Table retails.employee stats: [numFiles=2, numRows=0, totalSize=514, rawDataSize=0]
OK
Time taken: 1.014 seconds
OK
16 john john@devinline.com
17 robert robert@devinline.com
18 andrew andrew@devinline.com
19 katty katty@devinline.com
27 edward edward@devinline.com
29 alan alan@devinline.com
31 kerry kerry@devinline.com
34 tom tom@devinline.com
35 zack zack@devinline.com
16 john john@devinline.com
17 robert robert@devinline.com
18 andrew andrew@devinline.com
19 katty katty@devinline.com
27 edward edward@devinline.com
29 alan alan@devinline.com
31 kerry kerry@devinline.com
34 tom tom@devinline.com
35 zack zack@devinline.com
Time taken: 0.203 seconds, Fetched: 18 row(s)
 
Note:- Command hive -f script.hive instruct hive to execute file script.hive in non-interactive mode and result is displayed for each execution.One important point that need special mention here is,in script file OVERWRITE keyword is missing from LOAD DATA... query and new file is copied to warehouse directory employee(without deleting existing one) and on querying select * from employee; results from both file are retrieved and displayed. We can verify the same with following(since file names are same, so stored with suffix copy_1)
zytham@ubuntu:/usr/local/hadoop2.6.1$ ./bin/hadoop dfs -ls /user/hive/warehouse/retails.db/employee
Found 2 items
-rwxrwxrwx   1 zytham supergroup        257 2015-11-14 09:31 /user/hive/warehouse/retails.db/employee/employee_record.txt
-rwxrwxrwx   1 zytham supergroup        257 2015-11-14 10:40 /user/hive/warehouse/retails.db/employee/employee_record_copy_1.txt

Managed tables VS External tables

By default,tables created in Hive are managed by Hive - On executing LOAD ... query,  Hive moves data(files) in warehouse directory of that table and deletes data & metadata on executing Drop command.As shown earlier following commands creates table and copy file in warehouse directory.
create table employee(employee_id int, first_name string, email string) ROW FORMAT DELIMITED FIELDS TERMINATED by '|';
LOAD DATA LOCAL INPATH '/home/zytham/Documents/employee_record.txt' OVERWRITE INTO TABLE employee;
On executing Drop command - data along with metadata are deleted.
DROP Table employee;
Hive provides flexibility to creates an external table and allows user to manages data itself.While creating external table - location of external table is specified and Hive does not care whether location specified exist or not(Even if location does not exist, table will be created successfully).
While executing drop command, only metadata is wiped out not actual data.This feature allows data creation lazily after creating table.
hive> CREATE EXTERNAL TABLE external_customers (customer_id STRING, Company String, Email String, Place string) ROW FORMAT DELIMITED FIELDS TERMINATED by '|'  LOCATION '/tmp/hive/externalTable' ;
OK
Time taken: 0.275 seconds
hive> show tables;
OK
customers
employee
external_customers
Time taken: 0.253 seconds, Fetched: 3 row(s)

hive> LOAD DATA LOCAL INPATH '/home/zytham/Documents/customers_record.txt' INTO TABLE external_customers;
Loading data to table retails.external_customers
Table retails.external_customers stats: [numFiles=0, numRows=0, totalSize=0, rawDataSize=0]
OK
Time taken: 0.464 seconds

hive> select * from external_customers;
OK
16 john john@devinline.com Maricopa County
17 robert robert@devinline.com Taos County
18 andrew andrew@devinline.com Pleasant Valley
19 katty katty@devinline.com Kit Carson County
27 edward edward@devinline.com Barry County
29 alan alan@devinline.com Lakeview
31 kerry kerry@devinline.com Bledsoe County
34 tom tom@devinline.com Kit Carson County
35 zack zack@devinline.com Taos County
Time taken: 0.197 seconds, Fetched: 18 row(s)
Here we have created external table(External Keyword) and loaded data locally. Since it is external table no file is copied by Hive in table location "/tmp/hive/externalTable"(as contrast to earlier table creation associated with data file copying in warehouse directory).
zytham@ubuntu:/usr/local/hadoop2.6.1$ ls -l /tmp/hive/externalTable
total 0

ALTER, DROP, TRUNCATE, Create table as SELECT :- 

A table's definition can be changed after table has been created- table directory structure can be changed(External table underlying directory cannot be changed). New column can be added but data associated with it cannot be updated(Hive does not allow updating existing record, instead always recommended to create new table with columns and populate it with select statement).
hive> create table dummy(id string);
OK
Time taken: 0.203 seconds

hive> show tables;
OK
customers
dummy
employee
external_customers
Time taken: 0.139 seconds, Fetched: 4 row(s)

hive> alter table dummy RENAME TO dummy_1;
OK
Time taken: 0.657 seconds
hive> show tables;
OK
customers
dummy_1
employee
external_customers
Time taken: 0.158 seconds, Fetched: 4 row(s)

hive> alter table dummy_1 ADD COLUMNS(place String);
OK
Time taken: 0.264 seconds

Using Drop command data and metadata both wiped out in managed tables and only metadata data is wiped out in external table.In managed tables, using Truncate command we can keep definitions of table and delete data of table. Truncate does not work with external tables, instead we need to use dfs -rmr to remove external table directory.
hive> Truncate table dummy_1;
OK
Time taken: 0.401 seconds

hive> Truncate external_customers;
FAILED: ParseException line 1:9 missing TABLE at 'external_customers' near '<EOF>'
hive> Truncate table external_customers;
FAILED: SemanticException [Error 10146]: Cannot truncate non-managed table external_customers.

hive> dfs -rmr  /tmp/hive/externalTable;
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /tmp/hive/externalTable

Create a table with same schema as another existing table using Like keyword.
hive> create table employee_duplicate Like employee;
OK
Time taken: 0.412 seconds
hive> show tables;
OK
customers
dummy_1
employee
employee_duplicate
external_customers
Time taken: 0.169 seconds, Fetched: 5 row(s)

Debugging Hive - Error and its solution

  • Exception in thread "main" java.lang.RuntimeException: java.net.ConnectException: Call From ubuntu/127.0.1.1 to localhost:54310 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    Solution: Do verify Hadoop services are running.
    Go to <Hadoop_Home>/sbin/ and execute ./start-all.sh.
  • [ERROR] Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
    Solution:- Hive has upgraded to Jline2 but jline 0.9x exists in the Hadoop lib.Execute following command in terminal before executing Hive.
    zytham@ubuntu:/tmp$ export HADOOP_USER_CLASSPATH_FIRST=true


References :-
1. Apache hive confluence
2. Apache Hadoop by Tom White
Location: Hyderabad, Telangana, India