Hadoop Installation Tutorial

August 5th, 2011 Leave a comment 2 comments
Like the article?
Hadoop Installation Tutorial

Just for posterity,

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.
Since you arrived to this page, I’ll assume that you have some idea of what Hadoop is and what it is used for. This tutorial will walk you through an installation of Hadoop on your workstation so you can begin exploring some of its powerful features.

Hadoop has traditionally been a royal pain to setup and configure properly. With recent Cloudera’s distribution releases, this process has gotten simpler, but is still a far cry from straightforward. We’ll try to, if not simplify it, at least document it thoroughly so you follow clear, step-by-step instructions to get your first Hadoop cluster up and running locally. Let’s dive in!

Prerequisites

This tutorial requires the following two hefty installers downloaded to your workstation:

  1. Oracle VirtualBox to in order to run Virtual Machine Images (VMs) on your machine. Here is the link to the Virtual Box download page:
  2. An Ubuntu 10 Image that will house our Hadoop installation. You can grab one from here: NOTE: as of this writing, Cloudera’s Hadoop distribution was not compatible with Ubuntu 11. Just pick the version 10.04 LTS from the downloads drop-down menu to avoid any issues with your installation.

Install VirtualBox

  1. Download the installation package for your operating system (Windows or Mac OS X recommended).
  2. Close all applications and run the installation package following the on screen instructions.
    NOTE: The current tested version is 4.0.8 (08/05/2011).

Install Ubuntu 10 Image

  1. Download Ubuntu OS Version 10.04 LTS.
  2. Start VirtualBox from application selection menu:
Oracle VirtualBox Welcome Screen
  1. Click on the New button to create new virtual machine and click continue
  2. Provide a name for your VM and select Linux and Ubuntu in OS options
VirtualBox VM Name and OS Type
  1. Keep the rest of the settings as defaults and continue with instructions
  2. Start the VM after it was created by selecting the VM in the left screen and clicking on the Start button
Oracle VM VirtualBox Manager
  1. Select installation media as the downloaded Ubuntu installation package
VirtualBox Select Installation Media
  1. Proceed with default settings during the installation.
    NOTE: the user hadoop is reserved and should not be selected as your user.
  2. Restart your VM OS after the installation has been completed. You should see the following screen:
Ubuntu Fresh Install

Install Java JDK and Hadoop

  1. Open new terminal by going to Applications => Accessories => Terminal.
Ubuntu Terminal Start
  1. Check the release version of the Ubuntu by running the following command:
    lsb_release -c
    
    The expected output should be lucid
Hadoop Release Version
  1. Inside the Terminal, create an empty file /etc/apt/sources.list.d/cloudera.list by running the following command:
    sudo nano /etc/apt/sources.list.d/cloudera.list
    
    NOTE: We used nano as the editor here, but obviously, you are free to use the editor you’re most comfortable with (Vi, gEdit, Emacs, whatever).
  1. Paste the following two lines into the file, save it by hitting Ctrl – O and exit Ctrl – X:
    deb http://archive.cloudera.com/debian lucid-cdh3 contrib 
    deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib
    
  1. Run the following commands in the terminal window:
    sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
    sudo apt-get update
    sudo apt-get install sun-java6-jdk
    sudo apt-get install hadoop-0.20
    
  1. Install Hadoop components:
    sudo apt-get install hadoop-0.20-namenode
    sudo apt-get install hadoop-0.20-datanode
    sudo apt-get install hadoop-0.20-jobtracker
    sudo apt-get install hadoop-0.20-tasktracker
    
  1. Install configuration for pseudo distributed cluster:
    sudo apt-get install hadoop-0.20-conf-pseudo
    
  1. Start services by running the following command in the terminal window:
    for x in /etc/init.d/hadoop-* ; do sudo $x start; done
    
  1. Check your installation by opening the following links in your internet browser:

http://localhost:50070


http://localhost:50030

Verifying Installation

You should see the following screens in your browser:

Hadoop NameNode Administration

Hadoop NameNode Administration

Hadoop Map/Reduce Administration

Hadoop Map/Reduce Administration

If your screens look similar to mine, congratulations, you now have a Hadoop cluster running locally!

Conclusion

We hope this tutorial has made some sense out of the esoteric Hadoop documentation found on the project homepage at Apache Software Foundation. Use this simple installation as your Hadoop playground for moving onto bigger and better Hadoop “elephants”. Hadoop’s real power lies in its ability to scale well across multiple machines and utilize distributed hardware to provide I/O speeds that far exceed traditional methods. Your next step should be to try a multi-node configuration, which is simply a duplicate of this tutorial. Good luck and happy Hadooping!

Help us spread the word!
  • Twitter
  • Facebook
  • LinkedIn
  • Pinterest
  • Delicious
  • DZone
  • Reddit
  • Sphinn
  • StumbleUpon
  • Google Plus
  • RSS
  • Email
  • Print
If you liked this article, consider enrolling in one of these related courses:
Don't miss another post! Receive updates via email!

2 comments

  1. Ahmed Kamal says:

    Hey thanks for this guide, however I used an Ubuntu cloud technology called Ensemble to setup hadoop in one minute. You can check it out here
    http://cloud.ubuntu.com/2011/08/ensemble-meets-hadoop-on-the-cloud/

    More technical details at:
    http://cloud.ubuntu.com/2011/08/hadoop-cluster-with-ubuntu-server-and-ensemble/

    Thanks

  2. Rio says:

    Very helpful. Worked well for me.

    There was no natty repo, had to use maverick.

    And had to add the key..

    curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

Comment