Running Hadoop: Your First Single-Node Cluster

July 22nd, 2011 Leave a comment
Like the article?
Hadoop Cluster

Hadoop can be a very powerful resource, if used correctly; it’s a system designed to work with a vast amount of data while simultaneously taking advantage of the hardware it’s using, low-end or not. It can be a bit difficult to set up, however; as a result, many people don’t take advantage of the system. Let’s take a look at how to set up your first single-node cluster to give you a look at how Hadoop can help your business! This tutoral uses the Apache distribution, but you can just as easily use it with Cloudera’s.

NOTE: This tutorial assumes you’re running Linux; though there are steps available for running Hadoop on Windows platforms, the vast majority of servers running it are Linux and thus this tutorial caters to that set. Onward!

Step 1: Install the Java JDK

The first step is to install the Java JDK. Though there are open-source alternatives to the Sun Java JDK, it is best to use their official JDK just to avoid any problems caused by slight differences between the official JDK and the open source alternatives. If you are running an RPM or APT-based system, you can probably grab Sun’s official binaries from your repositories. If not, head over to:

http://www.oracle.com/technetwork/java/index.html

and grab a binary for your distribution and install it.

Step 2: Create a Hadoop user

Though this isn’t strictly necessary, it’s always a good idea to create a new user to handle all the Hadoop stuff with; it makes permissions, security, etc. a whole lot easier in the long run:

sudo adduser hadoop
sudo addgroup hadoop
sudo usermod -g hadoop hadoop

Step 3: Give said hadoop user SSH permissions

Hadoop interacts with its nodes via SSH, so you’ll need to set up key authentication for your hadoop user on the machine (If you don’t already have an SSH server set up, you’ll have to do that before following these steps):

su - hadoop
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

For those of you curious about these commands, the first is to su up to the hadoop user in the shell, and the second is to generate an RSA key for the hadoop user to log in without a password. You’ll notice that we left the passphrase blank; this is so we don’t have to enter it every time hadoop tries to connect. Finally, the last command copies the key to the authorized_keys file so that the system accepts hadoop’s login when it tries to SSH in! You can test the login from the hadoop user prompt by trying to ssh to localhost:

ssh localhost

It should allow you in without a fuss. If it doesn’t, make sure PubKeyAuthentication is set to yes in /etc/ssh/sshd_config, and that if AllowUsers is active that the hadoop user is allowed!

Step 4: Installing Hadoop

You can grab Hadoop binaries from the Apache website:

http://www.apache.org/dyn/closer.cgi/hadoop/core/

Once you’ve got them, untar them and extract them wherever you like (most people like it in /usr/hadoop or /usr/local/hadoop). Once extracted, make sure to change all the permissions on said folder to allow the hadoop user read/write/execute access!

In the Hadoop user’s home directory, edit the $HOME/.bashrc to include this:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

We’re assuming you use bash in this tutorial, but if not just edit whichever shell config you use!

Step 5: Configuring Hadoop

First, open your $HADOOP_HOME/conf/hadoop-env.sh and change the following line:

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

Change that to wherever your Java SDK lives.

Hadoop needs to store its files somewhere, a variable you’ll set in just a bit. For now, let’s use /var/hadoop/tmp as our directory. Create it and give Hadoop read+write permissions on that directory.

Now that we’ve done that, open up $HADOOP_HOME/conf/core-site.xml and find the following lines:

hadoop.tmp.dir
/var/hadoop/tmp

The value, obviously, won’t be quite that; change it to that directory we just created. This is where Hadoop is going to store all its temporary and job data. The rest of the configuration can be left with default settings; if you’d like to know more about how Hadoop’s configuration works, however, definitely check out the Hadoop documentation on Apache’s website; this will be necessary when setting up a multi-node cluster, for example.

Step 6: Starting Your Node Up

First you’ll have to format your name node. To do this, change to your hadoop directory and run this command:

./hadoop namenode -format

This initializes the HDFS filesystem. It can take a moment, so be a little patient. Once it’s done, you’re ready to fire up your Hadoop node with the following command:

./start-all.sh

And that’s it! Your Hadoop cluster is officially up and running. You can now run local jobs on it; in fact, the Hadoop install comes with a few test jobs that you can run on sample sets if you so desire.

Conclusion

And so there it is; your first Hadoop single-node cluster install. Of course, you’re not going to do much with a single-node cluster on Hadoop; Hadoop’s power lies, obviously, in its ability to scale well across multiple machines and take advantage of distributed hardware to provide I/O speeds far beyond that of traditional distributed measures. Having a single-node install, however, is the stepping stone to getting to a multi-node configuration, and the initial expenditure of time in the single-node install gives you the familiarity with Hadoop necessary in order to move on to bigger, more powerful multiple-node clusters!

Help us spread the word!
  • Twitter
  • Facebook
  • LinkedIn
  • Pinterest
  • Delicious
  • DZone
  • Reddit
  • Sphinn
  • StumbleUpon
  • Google Plus
  • RSS
  • Email
  • Print
If you liked this article, consider enrolling in one of these related courses:
Don't miss another post! Receive updates via email!

Comment