Hadoop is an open-source software platform by the Apache Foundation for building clusters of servers for use in distributed computing. Server clustering is really nothing new or revolutionary but Hadoop is designed specifically for mass-scale computing, which involves thousands of servers. Based on a paper originally written by Google about their MapReduce system, Hadoop leverages concepts from functional programming to solve large computing problems. Hadoop is an ideal solution for working with large volumes of data in a variety of applications from scientific to searching through web pages.
Leveraging the Power of Functional Programming
Functional programming is a style of programming that has its roots in lambda calculus. Functional programming is based largely on the idea of applying functions to a set of data. In this style of programming there is no state or mutable data. This makes functional programming a natural style of development for systems that analyze large sets of data. In functional programming, a function is applied to each member of a set of data and the result is the output. However, the data that was input remains unchanged. This is exactly what you need when you’re working through very large set of data to analyze or transform them in some way.
MapReduce: Splitting a Big Problem into Little Pieces
Google applied the ideas of functional programming to the problem of how to manipulate large amounts of data about the contents of pages on the world wide web. The system they developed was called MapReduce and was described in a paper published by Google.
MapReduce is what allows Google to search and index the huge volume of web pages that their bot spiders. Maps are an element of functional programming languages in which a function is applied to a set of data items. The number of parameters in the original input is immaterial as the function will be applied to each one.
Google’s system took it a step further though. Their MapReduce system consists of thousands of servers. The process of applying a function to each element of the input can all happen on one machine or it could be split over as many servers as needed. This allows Google to leverage the computing power of many less expensive machines to sort, analyze and transform large sets of data.
Hadoop: A Java Based Open Source Implementation of MapReduce
While Google’s system is proprietary, they did publish the concepts behind it. This lead to the creation of Hadoop. Hadoop is implemented in the Java programming language and contains a suite of tools for creating mass-scale computer clusters. Hadoop consists of a number of components including a distributed file system, a scalable database, and a programming framework for building code that uses the MapReduce concepts to solve problems based on large sets of data. The system allows one to build highly reliable and scalable architectures for distributed computing.
Hadoop: Who’s Using It and Types of Applications
While it is pretty easy to see that Hadoop is a powerful suite of tools, it may not be readily apparent as to why it might be useful or the kinds of tasks that would benefit from it. You can get some ideas by looking at who is using Hadoop and the kinds of tasks they are running on Hadoop.
Log File & Web Analytics
Large web sites generate tremendous amounts of data about their visitors. These log files can grow to be hundreds of gigabytes in size. Log files contain important information, however, about how users interact with a site, where they are coming from and can even help detect attacks and suspicious activity. Analyzing that data is a great job for Hadoop and that is exactly what companies like Facebook and Rackspace are using it for.
The amount of data that is analyzed to decide which ads should display on your favorite web site is staggering. Large advertising networks also have to collect data on millions of clicks and provide useful information to their clients about how users are responding to those ads. Analyzing this data to determine how to best serve ads is another ideal application for Hadoop. Advertising networks like Adknowledge use Hadoop to analyze the millions of clicks and determine which ads to display and when.
There are more applications that require analyzing large volumes of data beside web logs and advertising. There are a number of practical scientific applications for a system like Hadoop. Physics, biochemistry and genetics research all require analyzing each item in a large set of data. One company, Spadac.com, is using Hadoop to power geospatial processing.
The finance industry is another segment that generates large volumes of data. Hadoop can help solve financial problems by analyzing large sets of transactions or stock prices. It can be used to spot trends that might suggest fraud, find ways to improve the bottom line or discover trends in the stock market to help investors pick better investments. Pronux is a company that uses Hadoop to analyze transactions being posted by the bookkeeping department of large organizations. Another company uses it to do technical analysis of various stocks.
Of course, searching a large data set, like Google does, lead to the concepts that inspired Hadoop. Hadoop is a powerful tool for indexing large amounts of data and searching through that data. It is used for this exact purpose by BaiDu, China’s leading Internet search engine; Amazon, for searching the many products they carry; and by LinkedIn, for suggesting people you might know and fun facts.
Based on the idea that a large problem could be split into smaller pieces and tackled by many computers, Hadoop provides an open source, scalable system to build clusters of thousands of servers. It is primarily designed to apply the concepts of functional programming to the analysis of large volumes of data. As such, Hadoop powers numerous systems in the web, search, finance, and scientific market segments.
Help us spread the word!
If you liked this article, consider enrolling in one of these related courses:
|Jan 16-18||Hadoop Administration|
|- Classroom - Online|