Hadoop with Hive

September 19th, 2011 (Guest) Leave a comment 10 comments
Like the article?

Nowadays, there are lots of Hadoop emerging. Indeed, by “Lots of Hadoop”, I mean companies releasing their own versions of Hadoop (e.g. Cloudera) by building a layer over the original Apache Hadoop distribution. We can also call these “customized” versions of Apache Hadoop. But when we think about the core part, it remains the same across different Hadoop flavors. Apache Software Foundation (ASF) focuses on improving Hadoop by bringing many smaller sub-projects under it to facilitate open source tools development around Hadoop. Hive happens to be one of Hadoop’s more prominent child projects.

Hive is a data warehouse infrastructure, initially developed by Facebook. Hadoop with Hive combination gives us advantages of Distributed File System, Map-Reduce and SQL. As we know, to process huge amounts of data in Hadoop for each and every process/operation, we have to write new Map-Reduce program (job). For users with limited number of operations or sequences of same operation, this task will be an easy one. But for those whose requirements are a bit more prone to change, the challenge is they have to write new Map-Reduce program for every new requirement. Unfortunately, this is the only way to deal with unstructured data.

But for structured data, like logging (log4j) files, relational type data, and other similar, more predictable sets, the data can be stored in table-like structures. This is the area where Apache Hive really shines. Hive is a layer running on top of Hadoop that helps process the data in Hadoop by using SQL-like queries written in Hive Query Language (HQL). While loading data in HDFS through Hive as table, it also stores metadata of input, which describes the structure of input data. Note that Hive is required to be installed on the Hadoop master node. Hive converts an input query into a Map-Reduce job and submits it to Hadoop, making it easy for users to analyze and process data.

Hive Prerequisites

  • Hadoop 0.20 and above
  • Java 1.6 and above
  • MySQL or Derby lightweight database in master node to store only Hive metadata
Hadoop with Hive Diagram

Advantages of Hive

  • Supports rich data types like List, Map, and Struct, apart from basic data types.
  • Provides Web UI and Command Line Interface UI that are incorporated for querying data. This provides helpful tools for developers and learners for testing and debuging their queries.
  • Thrift server that comes with Hive helps with JDBC and ODBC connections, so any application can interact with Hive to Hadoop as a backend database. Thrift takes care of language conversion, which allows ANY type of language program to interact with Hadoop.
  • Even for complex structured input data, we can write our own DeSer (serializers and deserializers) programs for parsing input data, storing their table structure in metadata repository, and loading data on Hadoop File System (HDFS).
  • Supports queries with SQL filters, Joins, Group By, Order By, Inner Table, Functions, and other SQL-like operators. Using HQL we can also redirect query output to a new table. Along with all SQL features, we can also attach our own functions and Map-Reduce programs as the part of HQL query.
  • Partition and Bucket: partitioning helps split data into different chunks based on input value range, which allows to skip unwanted data while executing queries. Bucket split data is based on a hash function. Both help to improve the performance of querying.
  • Optimizers are being developed by Apache for Hive for better performance. We can improve our Hadoop and Hive performance by tuning few configuration parameters based on our application requirements. To learn more, read my recent article on Hadoop and Hive Performance Tuning.
  • Hive is used by major companies like Facebook, Yahoo, and Amazon. Hadoop and Hive play a major role in the proliferation of Cloud Computing. Amazon provides S3 (Simple Storage Service) and Elastic MapReduce as a service in cloud environment, which is a Cloud server pre-installed with Hadoop and Hive. It allows us to load our data in Hadoop (Elastic MapReduce) and execute queries on it with the help of Hive. Amazon Elastic MapReduce is a successful product which uses Hadoop and Hive jointly. Click here to learn more about how this technology works.

With more and more Hadoop distributions appearing in the “wild”, it’s clear that this project isn’t going anywhere anytime soon. If anything, it will only gain momentum as more and more companies switch to Hadoop to handle their large data repositories. Hive is a relatively mature Hadoop sub-project companion that facilitates easy data analysis, ad-hoc queries, and manipulation of large datasets stored in Hadoop. These two are a “Match Made in Heaven”!

If you enjoyed this article or if you have any other thoughts or questions on Hive and Hadoop, I encourage you to leave a comment below. I welcome your feedback and would be more than happy to respond!

Help us spread the word!
  • Twitter
  • Facebook
  • LinkedIn
  • Pinterest
  • Delicious
  • DZone
  • Reddit
  • Sphinn
  • StumbleUpon
  • Google Plus
  • RSS
  • Email
  • Print
If you liked this article, consider enrolling in one of these related courses:
Don't miss another post! Receive updates via email!

About Venkatahari Shankar

VenkataHari Shankar is an experienced developer and architect with a rich background in Java, Cloud Computing, Amazon Web Services and Hadoop. His latest project Big-Data uses Hadoop and Hive as its backbone. Venkat is in charge of all aspects of performance tuning and maintenance of Big-Data, which uses Hadoop cluster consisting of hundreds of nodes. He single-handedly handled the successful migration of the Big-Data project into Cloud (Amazon Web Services). VenkataHari holds a Bachelor Degree in Information Technology from Anna University, India. He shares his project experience and knowledge in his Cloud Computing Blog regularly.

10 comments

  1. Vijayakumar says:

    Venkat,
    Its nice article,
    Can we access the Hadoop thru ODBC drivers?

    Thanks and Regards
    Vijay

  2. Dear Venkat,

    This is a great article but I’m still confused about something: I want to develop a web analytics platform in order to create aggregated data about web traffic (page views, visits, visitors, etc).

    Is it an overkill or a “must” to use Hive?

    Can I do it only with pure Map/Reduce jobs?

  3. Bhavesh says:

    I have configured Hadoop successfully in Wondows 7 through Cygwin. Now i want to do Retrieve and Analyze data from very large database.
    Is it possible with the help of Hadoop and Hive?How?

  4. Saurabh says:

    This is one of the good introduction on web for this topic. I’m just a new comer to this Big Data world out of curiosity. We are traditionally using Informatica, Teradata oracle for maintaining DWH.
    1. Can you tell me if Hive will give me faster SQL results than traditional Teradata/MPP RDBMS query? (Beyond what amount of data does performance increases)

    2. It takes me 5 hours to bring a set of data (100 GB Approx )from oltp to Staging to teradata . So We need to wait for atleast 5 hours before data is available for reports. Most times it is available next day because batches run at night. How long will it take to load data into Hadoop from OLTP for same set of data and is it possible to load data continuously into Hadoop (Making it OLTP DB as well?)

  5. Saurabh says:

    It will be amazing if we can get reports on our data real time as they are happening in OLTP systems . Is there anything (or combination of tools) in Hadoop Ecosystem which can help us acheive this

  6. Nagarajan says:

    Good explanation on Hive and its easy to understand for beginners 🙂

Comment