Hive – A Petabyte Scale Data Warehouse using Hadoop

Web has been growing rapidly in size as well as scale during the last 10 years and shows no signs of slowing down. Statistics shows us that every year more data is produced than all of the previous years combined. Apache Hadoop, inspired by Google’s MapReduce and Google File System (GFS) papers, is a framework that allows for the distributed processing of such large data sets across clusters of machines. Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks & distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Using Hadoop was not easy for end users, especially for the ones who were not familiar with MapReduce framework. End users had to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis.

Enter Hive!

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies.

Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) to run queries on the huge volumes of data to extract patterns and meaningful information. It provides an SQL-like language called HiveQL while maintaining full support for map/reduce.

A HiveQL statement is submitted via the Command Line Interface (CLI), the web UI or an external client using the Apache Thrift, ODBC or JDBC interfaces. The Driver first passes the query to the Compiler where it goes through the typical parse, type check and semantic analysis phases, using the metadata stored in the Metastore. An optimized plan in the form of a directed acyclic graph of map-reduce tasks and HDFS tasks is generated. The Execution Engine then executes these tasks in the order of their dependencies, using Hadoop. In short, any HiveQL query is divided into MapReduce tasks which run on the robust Hadoop framework.

Hive has access to files stored either directly in HDFS or in other data storage systems such as HBase. HBase tables can be accessed as if they were native Hive tables.  As a result, a single Hive query can perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. But the recommended approach is to pump the data from HBase (or any other storage) into Hive and then analyze it there. The benefit is 2-fold: (a) once the data is in Hive, queries against it will typically run faster (since Hive is optimized for warehousing) and (b) no risk of excessive load originating from cluster nodes onto the external storage system

Loading bulk data into Hadoop from production systems or accessing it from map-reduce applications running on large clusters can be a challenging task. Transferring data using scripts is inefficient and time consuming. How do we efficiently move data from an external storage into Hive? Meet Apache Sqoop. Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from an external system on to HDFS, and populate tables in Hive and HBase. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

This entry was posted in Technical and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s