Hadoop, for those of you not in the know, is a scalable framework for performing data-intensive distributed applications. If you’re reading this article, however, you probably already know that: you’re intrigued by Hadoop’s performance potential and its proclaimed ability to run on commodity hardware. In this article, we’ll talk a bit about the hardware required to run Hadoop, and what the best configuration would be to get the most bang for your buck!
First things first, however: don’t think you can just run out there and grab hundreds of bargain bin PCs and call your job done. Hadoop’s big draw is that it doesn’t need beefy servers that cost tens of thousands of dollars; it can run on hardware much less powerful than that, enabling you to buy more computers to act as nodes and truly take advantage of Hadoop’s distributed power. What that means, however, is that you’re still looking to buy servers in the range of 2-3k a pop; Hadoop can be memory intensive, and you’re going to need drive space to store all that Big Data you plan on using.
This is not to say, however, that you can’t supplement a Hadoop data cluster with some older machines that aren’t being put to good use. There are some Hadoop best practices and administrators out there who highly, highly discourage the use of older machines in Hadoop clusters; many configurations use older machines without any problem and have incorporated them into the setup just fine. If you have a few 4GB servers that are being taken out of production, there is nothing wrong with repurposing them into a Hadoop cluster; Hadoop will take advantage of them and they’ll be worth more to you than sitting somewhere in a dusty closet.
Where the caution against old machines comes, however, is in the initial buying phase. When faced with the option of buying 25 average machines of the specs listed above or 50 cheaper ones at roughly the same price, the initial choice is uncertain: Hadoop claims to use (and does use) the older hardware very well, and the cost / performance ratio, initially, will be very close between the two types of setups.
Where the 25-box setup wins out is in real cost and administrative overhead; quite simply, 25 boxes are easier to take care of from an administrative point of view and, since there are less parts, there will be less headache with failing parts and maintenance. On the other hand, having just five beefy servers won’t really use Hadoop’s distributed model to its fullest, and you’ll actually see less cost / performance than you would with more average sized boxes.
What, then, is the best commodity hardware to run Hadoop on? What will give you the best cost / performance ratio as well as ease administrative support and maintenance costs? A solid, generic Hadoop setup should look something like this:
2 dual-core CPUs 8-12GB RAM 2 250GB SATA drives
These are not your run-of-the-mill desktop PC specs, though it’s clear that they are specs that are substantially lower than most high-end server machines and can be had for 2 -2.5k per machine. These are the sorts of servers you should be aiming at when building a Hadoop cluster; these machines will offer you the best cost / performance ratio, both over time and in terms of real cost.
And so there you have it- the best type of hardware to run on Hadoop to get the best cost / performance ratio out of your servers. Good luck with your Hadoop clusters, and happy Hadooping!
Help us spread the word!
If you liked this article, consider enrolling in one of these related courses:
|May 16-18||Hadoop Administration|
|Hadoop Developer Training with MapReduce|
|Jun 09||Hadoop Overview for Managers|
|- Classroom - Online|