Nov 20, 2015

Textual description of firstImageUrl

Apache Drill installation in embedded mode (in Ubuntu 13.04) and SQL query execution in drill prompt


A distributed SQL engine designed for data-intensive distributed applications for interactive analysis of large-scale datasets.It follows Schema-free JSON document model similar to MongoDB and has pluggable architecture that enables connectivity to multiple datastores. Apache drill project inspired form Google's Dremel system which is available as an infrastructure service called Google BigQuery. key features of Apache drill(Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage) are-
  • Drill supports a variety of NoSQL databases and file systems like HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. 
  • Data from multiple datastores(NoSQL,HDFS or HBase) can be joined using single query.
  • Drill processes the data in-situ(in place) without requiring users to define schemas or transform data.
  • Drill support industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
The main agenda of this post is to install/set-up Apache drill in Linux(Ubuntu) machine and execute sample SQL like query on non-relational data storage  -JSON/csv file. Apache drill can be installed in Embedded mode or Distributed mode.In this post, we are going to install in embedded mode.Read Apache drill installation in distributed mode and JDBC client connection.
Note:- Oracle Java SE Development (JDK) Kit 7 + must be available in our system, if not available refer Oracle JDK installation in Ubuntu.

Apache drill installation in embedded mode 

1. Download a stable distribution of Apache drill from here  OR execute one of the following commands and get specified version of Apache drill.
zytham@ubuntu:~$ wget http://getdrill.org/drill/download/apache-drill-1.1.0.tar.gz

zytham@ubuntu:~$ curl -o apache-drill-1.1.0.tar.gz http://getdrill.org/drill/download/apache-drill-1.1.0.tar.gz
2. Unzip downloaded distribution and move it to any location as per your convenience.Let's say we have placed apache drill at location "/opt/drill".Now Apache drill installation directory is /opt/drill.
zytham@ubuntu:~$ tar -xvzf ./Downloads/apache-drill-1.1.0.tar.gz
zytham@ubuntu:~$ mv apache-drill-1.1.0 /opt/drill
3. Go to drill home directory /opt/drill and start Apache drill in embedded mode using following command.On success,drill prompt appears.
zytham@ubuntu:/opt/drill$ ./bin/drill-embedded 
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Nov 20, 2015 9:54:10 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.0.0 
"say hello to my little drill"
0: jdbc:drill:zk=local> 
Error occurs while staring drill shell in embedded mode :- This error occurs, if no sufficient memory is available with JVM to allocate.(I am running Apache drill in VM with less memory configuration,)
zytham@ubuntu:/opt/drill$ ./bin/drill-embedded 
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006c0000000, 1431633920, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1431633920 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /opt/drill/hs_err_pid3122.log
Solution:- In order to resolve this issue, we need to modify values assigned to properties DRILL_MAX_DIRECT_MEMORY and DRILL_HEAP in <DRILL_HOME>/conf/drill-env.sh.
  • Open drill-env.sh and update these two fields :- DRILL_MAX_DIRECT_MEMORY and DRILL_HEAP so that memory allocated to starting script is reduced.
    DRILL_MAX_DIRECT_MEMORY="1G"
    DRILL_HEAP="512M"
  • Save it and again start drill shell in embedded mode. 

SQL query execution in drill prompt

From drill prompt, we can execute SQL query on flat files (JSON, CSV, Praquet,etc.). Here we are using sample JSON file employee.json shipped with Apache drill, packaged in the Foodmart data JAR in Drill's classpath availale at <DRILL_HOME>/jars/3rdparty/foodmart-data-json-0.4.jar.
Note:- Below cp.'employee.json' refers to employee json file available in class path.
0: jdbc:drill:zk=local> show databases;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| sys                 |
+---------------------+
6 rows selected (3.015 seconds)
0: jdbc:drill:zk=local> select employee_id, first_name,last_name,position_id,salary FROM cp.`employee.json` where salary > 30000;
+--------------+-------------+------------+--------------+----------+
| employee_id  | first_name  | last_name  | position_id  |  salary  |
+--------------+-------------+------------+--------------+----------+
| 1            | Sheri       | Nowmer     | 1            | 80000.0  |
| 2            | Derrick     | Whelply    | 2            | 40000.0  |
| 4            | Michael     | Spence     | 2            | 40000.0  |
| 5            | Maya        | Gutierrez  | 2            | 35000.0  |
| 10           | Darren      | Stanz      | 5            | 50000.0  |
| 21           | Pedro       | Castillo   | 2            | 35000.0  |
| 22           | Laurie      | Borges     | 2            | 35000.0  |
| 36           | Donna       | Arnold     | 7            | 45000.0  |
+--------------+-------------+------------+--------------+----------+
8 rows selected (0.231 seconds)
0: jdbc:drill:zk=local> select employee_id, first_name,last_name,position_id,salary FROM cp.`employee.json` where salary > 30000 and position_id=2;
+--------------+-------------+------------+--------------+----------+
| employee_id  | first_name  | last_name  | position_id  |  salary  |
+--------------+-------------+------------+--------------+----------+
| 2            | Derrick     | Whelply    | 2            | 40000.0  |
| 4            | Michael     | Spence     | 2            | 40000.0  |
| 5            | Maya        | Gutierrez  | 2            | 35000.0  |
| 21           | Pedro       | Castillo   | 2            | 35000.0  |
| 22           | Laurie      | Borges     | 2            | 35000.0  |
+--------------+-------------+------------+--------------+----------+
5 rows selected (0.276 seconds)
Note:- Here we have not specified any meta-data information(of columns), it is one of the brightest features of Apache drill.With Apache drill we can directly query self-describing data (eg, JSON, Parquet) without having to create and manage schemas. Drill is the world's only query engine that compiles and re-compiles queries at runtime.Drill's flexible JSON data model and on-the-fly schema discovery enable it to query self-describing data. This allows Drill to achieve high performance without knowing the structure of the data in advance.(Source: Apache drill faq).

Database Join operation:- In continuation of that, lets perform join operation on two json files(employee.json and department.json).First execute a query with aliasing followed by join two two files and get department_description from department_id.
0: jdbc:drill:zk=local> select emp.employee_id, emp.first_name,emp.salary,emp.department_id FROM cp.`employee.json` emp where emp.salary <40000 and emp.salary>21000;
+--------------+-------------+----------+----------------+
| employee_id  | first_name  |  salary  | department_id  |
+--------------+-------------+----------+----------------+
| 5            | Maya        | 35000.0  | 1              |
| 6            | Roberta     | 25000.0  | 2              |
| 20           | Beverly     | 30000.0  | 1              |
| 21           | Pedro       | 35000.0  | 1              |
| 22           | Laurie      | 35000.0  | 1              |
+--------------+-------------+----------+----------------+
0: jdbc:drill:zk=local> select emp.employee_id, emp.first_name,emp.salary,emp.department_id,dept.department_description FROM cp.`employee.json` emp , cp.`department.json` dept where emp.salary <40000 and emp.salary>21000 and emp.department_id = dept.department_id;
+--------------+-------------+----------+----------------+-------------------------+
| employee_id  | first_name  |  salary  | department_id  | department_description  |
+--------------+-------------+----------+----------------+-------------------------+
| 5            | Maya        | 35000.0  | 1              | HQ General Management   |
| 6            | Roberta     | 25000.0  | 2              | HQ Information Systems  |
| 20           | Beverly     | 30000.0  | 1              | HQ General Management   |
| 21           | Pedro       | 35000.0  | 1              | HQ General Management   |
| 22           | Laurie      | 35000.0  | 1              | HQ General Management   |
+--------------+-------------+----------+----------------+-------------------------+

SQL query on CSV file:-
Create a csv(comma separated values) file at location /tmp/sample.csv and copy following data in it.(Physical attributes of four people.)
1, NIKS, 79 , 5.8
2, CKM , 83 , 5.7
3, RITZ, 85 ,5.10
4, YUNNUS, 71, 5.11
Individual columns from CSV file is extracted - considering each row of csv file as zero indexed based array like columns[0], columns[1] and so on.
0: jdbc:drill:zk=local> select * FROM dfs.`/tmp/sample.csv`;
+---------------------------------+
|             columns             |
+---------------------------------+
| ["1"," NIKS"," 79 "," 5.8 "]    |
| ["2"," CKM "," 83 "," 5.7"]     |
| ["3"," RITZ"," 85 ","5.10"]     |
| ["4"," YUNNUS"," 71"," 5.11 "]  |
+---------------------------------+
4 rows selected (0.189 seconds)
0: jdbc:drill:zk=local> select columns[0] as id, columns[1] name, columns[2] as weight, columns[3] as height FROM dfs.`/tmp/sample.csv`;
+-----+----------+---------+---------+
| id  |   name   | weight  | height  |
+-----+----------+---------+---------+
| 1   |  NIKS    |  79     |  5.8    |
| 2   |  CKM     |  83     |  5.7    |
| 3   |  RITZ    |  85     | 5.10    |
| 4   |  YUNNUS  |  71     |  5.11   |
+-----+----------+---------+---------+
4 rows selected (0.288 seconds)

Querying multiple data source in single query :-

In Apache drill multiple data source(CSV,JSON, etc) can be accessed and joined in single query. Below query access employee.json, department.json and sample.csv file and display employee details with physical attributes.
0: jdbc:drill:zk=local> select emp.employee_id, phy.columns[1] as Name ,dept.department_description,phy.columns[2] as Weight , phy.columns[3] as Height FROM cp.`employee.json` emp , cp.`department.json` dept, dfs.`/tmp/sample.csv` phy where CAST(emp.employee_id AS INT) =  CAST(phy.columns[0] AS INT) and emp.department_id = dept.department_id;
+--------------+----------+-------------------------+---------+---------+
| employee_id  |   Name   | department_description  | Weight  | Height  |
+--------------+----------+-------------------------+---------+---------+
| 1            |  NIKS    | HQ General Management   |  79     |  5.8    |
| 2            |  CKM     | HQ General Management   |  83     |  5.7    |
| 4            |  YUNNUS  | HQ General Management   |  71     |  5.11   |
+--------------+----------+-------------------------+---------+---------+
3 rows selected (0.679 seconds)
In above result obtained from the given query, Name,Weight and Height is from csv file and other two fields are from two different JSON file.
Explicit type casting is mandatory when we are dealing with different sources,otherwise we will get error some thing like this.
Error: SYSTEM ERROR: DrillRuntimeException: Join only supports implicit casts between 1. Numeric data 2. Varchar, Varbinary data 3. Date, Timestamp data Left type: BIGINT, Right type: VARCHAR. Add explicit casts to avoid this error

Apache drill web console:- Apache drill provides web Console, where we can customize data source and add new Storage Plugin(i.e:- add support for MySQL database). Below is the web console url with default port 8047.
http://<IP address or host name>:8047
In terminal type ifconfig and get ip(inet) address, update above url accordingly and open it in browser.
Apache drill web console and configuration window 
In above web console diagram-1, cp refers to classpath, dfs refers to file system(local or others).When we click on update button of any storage plugin we get configuration window - where we can update connection properties,supported file format and its extensions, workspaces(where files are present physically).
While querying CSV file in above example,we created sample.csv at location /tmp/ because there is a entry for workspace location /temp/ and write permission is enabled(Refer diagram-2.). We can modify these settrings as per our convenience.
Note:-
1. By convention this /tmp/ directory is used to create "view" and it is created locally.
CREATE or REPLACE view dfs.tmp.<VIEW_NAME> as <SLLECT_QUERY>
2. It is not mandatory to use configured location in SQL query, we can place data source file at some location and in SQL query we can specify the same as follows:
SELECT * FROM dfs.`/users/zytham/Downloads/data/<data_source_json>`;
Stopping Apache drill:- 
Execute following command to exit prompt and reach at terminal.
0: jdbc:drill:zk=local> !quit

In next post, we will see how to connect Apache drill with JDBC client and query in the same way we did form Apache drill prompt.

References :- Apache drill official documentation
Location: Hyderabad, Telangana, India