Intro to Hadoop

June 10th, 2011 Leave a comment
Like the article?
Hadoop Overview

Hadoop. A seemingly nonsensical word that keeps getting thrown around whenever you’re in a meeting. What does it mean? What does it do? Let’s read on and find out!

What is Hadoop?

Hadoop is a project by the Apache Foundation in order to handle large data processing jobs. It was originally conceived by Doug Cutting, the creator of Apache Lucene (Who based the name on his son’s stuffed pet elephant, incidentally). He was inspired to do so after hearing about Google’s MapReduce and GFS projects, which were Google’s way of handling very large amounts of data at once; Hadoop is an open-source solution to the same problem, and in fact is built around a very similar paradigm. It was designed to be a framework for handling large amounts of data in distributed applications on standard (or not-so-standard) hardware. The biggest advantage of this system, obviously, is that it’s able to scale out very well; machines can be added to increase the amount of data handled seamlessly and easily, enabling hundreds of gigabytes (or even terabyes and petabyes, potentially) to be processed and stored in as efficient a manner as possible.

Hadoop Distributed File System (HDFS)

Hadoop, by its very nature, is designed to run on more than a single node; it can be run on a single node, but is very inefficient due to the overhead it needs, since it assumes that it is being run on multiple nodes. It runs on multiple nodes by implementing something called the “Hadoop Distributed File System”, or HDFS, which is a file system distributed across multiple nodes.

The way the HDFS works is by splitting the nodes into a system. One node is designated as the name node, and the rest become data nodes. Files are not stored on any one node; they are distributed across multiple data nodes and split into chunks of 64MB, which is the default HDFS chunk size. Each of the chunks are replicated redundantly on three nodes, according to the Hadoop default replication settings.

All the metadata for this system is handled by the name node, and this metadata includes everything you’d expect it to: file names, permissions, modified flags, and so on. It also keeps metadata unique to Hadoop which is crucial for its operation: it knows which data node has which chunk, and over what nodes the chunk is distributed. It also knows when nodes fail and takes care of file distribution. Though this sounds like a great deal of information, it’s actually relatively low, and so all this information is stored in the memory of the name node, enabling extremely fast access to it.

As a file system, HDFS is a write-once read-many type of file system, which makes it exceedingly useful in terms of data throughput and distribution. The data nodes process each single chunk in parallel, without the use of the name node for calculation; Hadoop uses this model in order to cut down on network bandwidth. Moving the data back and forth between nodes would cause excessive overhead, and by not doing so Hadoop achieves very high performance when processing large amounts of data.

MapReduce

A core component of how Hadoop works is MapReduce. HDFS may be the storage/memory of the Hadoop system, but MapReduce is its brain. MapReduce is a programming framework that enables you to run computation across the large amounts of data on the nodes; essentially, it is the way you interact with the HDFS to perform computational tasks. It’s very simple at its core: you need only write two methods. You have to write “map” and you have to write “reduce”, and after that you define a “JobConfig”. The JobConfig is what defines the essential communication between your functions and the HDFS, including such things as the map and reduce methods, input / output formats, and input/output paths. While there are other parts of MapReduce, an application can be run with just these three components; the rest of the components are optional added functionality.

Typically, one JobTracker and one TaskTracker monitor every MapReduce job. The JobTracker functions to send work to available TaskTrackers in the cluster, and the TaskTracker is what monitors the task being performed. Due to the nature of Hadoop’s distributed file method over potentially thousands of nodes, errors are expected; to counter this, the JobTracker attempts to keep the work as close to the data as possible. If an error is detected on one of the nodes, the JobTracker will start the MapReduce job on another data node that contains the same chunk.

The MapReduce job typically works thusly: Its InputFormat defines how to create key/value pairs from the data. Once the key / value pair is retrieved, the map function computes the data and sends the finished key/ value pairs to the reduce function. As mentioned before, a node can be running multiple map and reduce functions; the map functions run parallel from each other on every data node that contains that particular data file chunk; the lack of communication between this parallelized computing structure lends to decreased overhead and speedier execution.

Once the first map tasks have finished, all the output keys will be sorted on each Mapper. The output of the Mappers are stored in partitions, which end up being the input data to the reduce function. The Partitioner takes care of this process, making sure that everything goes in its place; regardless of what Mapper or node was the origin of the output data, the Partitioner makes sure that all key / value pairs are sent to the same partition regardless of the origin of those pairs. Once this is done, the Partitioner sends over its completed output data to the reduce function.

The reducer’s job is simple: for each key in the given partition, it calls the reduce function. This produces a key along with all the values linked to the key, and then the output of the task is sent to the file system.

Conclusion

Hadoop is not an easy technology to grasp right away, especially considering its rather cutting-edge nature. Despite this, however, it is an interesting technology whose basic fundamentals are not impossibly out of the realm of understanding; hopefully this basic introduction to Hadoop gave you the foundation you need to work with it or decide if it can be used successfully in your production environment! If you are looking for a fast track to learning Hadoop, consider enrolling in one of our Hadoop training courses. You will really enjoy taking one!

Help us spread the word!
  • Twitter
  • Facebook
  • LinkedIn
  • Pinterest
  • Delicious
  • DZone
  • Reddit
  • Sphinn
  • StumbleUpon
  • Google Plus
  • RSS
  • Email
  • Print
If you liked this article, consider enrolling in one of these related courses:
Don't miss another post! Receive updates via email!

Comment