Nov 23, 2015

Textual description of firstImageUrl

Schema on write, Schema before read and Schema on the fly - Different level of schema adherence/enforcement

Organizations dealing with large volume of data are moving with rapid pace so as the their volume of data.In past 10 years,they have shown more inclination towards non-relational database over relational database.With advancement of NoSQL databases(MongoDB,Cassandra) and big data technologies(Hive, ApacheDrill, HBase) data analysis can be performed as efficiently as carried out with relational databases. The main agenda of this post is to discuss - how does schema enforcement is adopted in various relational/non-relational databases technologies and pros/cons associated with each of them.Schema declaration or finding for any database/data-source can be broadly classified in two category :
  1. Schema declared in advance - Schema on write and Schema before read
  2. Schema discovered on the fly - Schema on the fly 
Overview of schema declaration/discovery paradigm

Schema declared in advance

Traditionally relational databases like Oracle, DB2(IBM),SQL server(Microsoft),Vertica(HP), etc enforces the very first thing to do - create schema of database objects(tables, views,etc) and then perform database operation like data loading.This design methodology is termed as schema on write(data structuring is verified against schema when it is written to database).Sometimes RDBMS modelling is termed as- prescriptive data modelling (1.create static DB schema, 2.transform data into RDBMS and 3.Query data in RDBMS format).

In order deal with large volume of data various databases technologies emerged adopting SQL like paradigm.Apache Hive is one of them,which supports SQL like language(HiveQL) and uses HDFS & Mapreduce internally. Hive does not enforce schema check while data load operation, just load the data and Done!!.When a query is issued for read operation schema check is performed.This design methodology is termed as schema before read(data structuring is verified against schema when it is read) and it follows prescriptive data modelling(1.Copy data in its native format, 2.Create schema & parser and 3.Query Data in its native format).
Note:- In NoSQL databases partial schema enforced by the DBMS on Write and schema is fully enforced on Read.

Schema discovered on the fly

Apache drill, a distributed SQL engine designed for data-intensive distributed applications for interactive analysis of large-scale datasets.Drill supports a variety of NoSQL(HBase, MongoDB) databases and file systems (MapR-DB,HDFS,MapR-FS,local files).In Apache drill it is not mandatory to define schema of database/data-source(It works with fixed schema, evolving schema or schema less data sources) and it can determine schema at execution time i.e: schema on the fly(schema is determined at query execution time).

Pro and cons of Schema on write and Schema before read

1. Schema before read make initial load is very fast because data need not be read, parsed and serialized to disk in database internal format, load operation is just a plain copy operation. However, in Schema on write in relational databases are expensive and comparatively slow, since data is parsed and serialized to disk in database internal format.
2. Schema before read provides more flexibility. An enterprise dealing with large volume of data and having various different user category(differing roles and interests who want to get different insights from that data).With schema-on-read we can present data in a schema that is adapted best to the queries being issued.We are not forced to deal with a one-size-fits-all schema and avoids upfront modelling exercise. However, with Schema on write approach, we have to foresee all of these user prospect in advance and define a schema that has something for everyone, it is highly impractical and inefficient.
3. Schema on write approach minimizes query execution time since database columns are indexed and indexed column retrieval is relatively faster. However, schema on read approach, indexing is not that much efficient(Hive support two index types: bitmap and compact).

Location: Hyderabad, Telangana, India