Just for posterity,
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.Since you arrived to this page, I’ll assume that you have some idea of what Hadoop is and what it is used for. This tutorial will walk you through an installation of Hadoop on your workstation so you can begin exploring some of its powerful features.
Hadoop has traditionally been a royal pain to setup and configure properly. With recent Cloudera’s distribution releases, this process has gotten simpler, but is still a far cry from straightforward. We’ll try to, if not simplify it, at least document it thoroughly so you follow clear, step-by-step instructions to get your first Hadoop cluster up and running locally. Let’s dive in!
This tutorial requires the following two hefty installers downloaded to your workstation:
- Oracle VirtualBox to in order to run Virtual Machine Images (VMs) on your machine. Here is the link to the Virtual Box download page:
- An Ubuntu 10 Image that will house our Hadoop installation. You can grab one from here:
NOTE: as of this writing, Cloudera’s Hadoop distribution was not compatible with Ubuntu 11. Just pick the version 10.04 LTS from the downloads drop-down menu to avoid any issues with your installation.
- Download the installation package for your operating system (Windows or Mac OS X recommended).
- Close all applications and run the installation package following the on screen instructions.
NOTE: The current tested version is 4.0.8 (08/05/2011).
Install Ubuntu 10 Image
- Download Ubuntu OS Version 10.04 LTS.
- Start VirtualBox from application selection menu:
- Click on the New button to create new virtual machine and click continue
- Provide a name for your VM and select Linux and Ubuntu in OS options
- Keep the rest of the settings as defaults and continue with instructions
- Start the VM after it was created by selecting the VM in the left screen and clicking on the Start button
- Select installation media as the downloaded Ubuntu installation package
- Proceed with default settings during the installation.
NOTE: the user hadoop is reserved and should not be selected as your user.
- Restart your VM OS after the installation has been completed. You should see the following screen:
Install Java JDK and Hadoop
- Open new terminal by going to Applications => Accessories => Terminal.
- Check the release version of the Ubuntu by running the following command:
lsb_release -cThe expected output should be lucid
- Inside the Terminal, create an empty file
/etc/apt/sources.list.d/cloudera.listby running the following command:
sudo nano /etc/apt/sources.list.d/cloudera.listNOTE: We used nano as the editor here, but obviously, you are free to use the editor you’re most comfortable with (Vi, gEdit, Emacs, whatever).
- Paste the following two lines into the file, save it by hitting Ctrl – O and exit Ctrl – X:
deb http://archive.cloudera.com/debian lucid-cdh3 contrib deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib
- Run the following commands in the terminal window:
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner" sudo apt-get update sudo apt-get install sun-java6-jdk sudo apt-get install hadoop-0.20
- Install Hadoop components:
sudo apt-get install hadoop-0.20-namenode sudo apt-get install hadoop-0.20-datanode sudo apt-get install hadoop-0.20-jobtracker sudo apt-get install hadoop-0.20-tasktracker
- Install configuration for pseudo distributed cluster:
sudo apt-get install hadoop-0.20-conf-pseudo
- Start services by running the following command in the terminal window:
for x in /etc/init.d/hadoop-* ; do sudo $x start; done
- Check your installation by opening the following links in your internet browser:
You should see the following screens in your browser:
Hadoop NameNode Administration
Hadoop Map/Reduce Administration
If your screens look similar to mine, congratulations, you now have a Hadoop cluster running locally!
We hope this tutorial has made some sense out of the esoteric Hadoop documentation found on the project homepage at Apache Software Foundation. Use this simple installation as your Hadoop playground for moving onto bigger and better Hadoop “elephants”. Hadoop’s real power lies in its ability to scale well across multiple machines and utilize distributed hardware to provide I/O speeds that far exceed traditional methods. Your next step should be to try a multi-node configuration, which is simply a duplicate of this tutorial. Good luck and happy Hadooping!
Help us spread the word!
If you liked this article, consider enrolling in one of these related courses:
|May 11-13||Hadoop Administration|
|May 15||Hadoop Overview for Managers|
|- Classroom - Online|