Hadoop has become something of a buzzword in recent months; everyone and their boss is recommending it for everything, books are being written about it left and right, and there are precious few ideas or things one can mention about Hadoop without running into someone else who has other opinions about that very subject. Hadoop is regarded as the new way of doing things, and many corporations and enterprise IT departments are researching just how to fit Hadoop into their infrastructure.
With all this buzz about Hadoop, you may be tempted to run off and implement it in your own organization. After all, you’ve heard wonders of its abilities to handle large data and run on commodity hardware: it is a tantalizing fruit on the cutting-edge IT tree, and it would seem like a no-brainer to have your corporation jump on the cool wagon.
As it turns out, however, you may want to wait: depending on your setup, what you have running, and what you want Hadoop to do, it may very well be the case that you don’t want to include Hadoop in your system architecture and that you’ll actually be better off without it.
So how do you know if Hadoop is for you?
Well, first off: Hadoop is very, very effective at using multiple commodity machines to work on very large data sets. The idea behind Hadoop is that it scales very well, and the more commodity hardware you throw in, the more power you’ll have to sift through all that data. If you have gigabytes or terabytes of data, then a multiple-node Hadoop cluster is going to be able to aggregate that data very quickly.
This makes it extraordinarily effective at processing huge amounts of data quickly: As an example of a large Internet company that uses Hadoop, let’s take Twitter. Twitter handles an insane amount of data load: its users generate 150 million tweets a day, give or take. While each tweet is small, they add up: in total, that’s almost 12 terabytes per day!
Obviously, this is an insane amount of data: most likely, those of you reading aren’t handling nearly that amount of data every day. This is, however, a perfect case of where Hadoop shines in action: processing that much data effectively and efficiently requires the kind of mass scaling that Hadoop was designed to do. Using MapReduce and Pig, Twitter’s engineers are able to analyze Twitter’s ecosystem and enable functions like People Search that need to process all that data quickly and conveniently.
This use-case makes a great case for Hadoop. Given Twitter’s success with it, and Twitter’s status as a successfully run business from an IT standpoint, it may seem only natural to start looking into ways to use Hadoop as an alternative to your current database setup. For example, let’s say you are a small business running MySQL; you’ve got a database with about 5 million rows or so. Would it be more efficient to keep your database on the one machine using MySQL, or switch to Hadoop and use 2 commodity servers to run the same process?
As it turns out, the Hadoop developers have had this very same thought and have run some benchmarks on this particular use case: they ran a query on a MySQL database with that same setup against a Hadoop cluster with two nodes.
The results were quite dramatic: the MySQL single-server setup obliterated the Hadoop 2-node cluster. In a query against almost 6 million rows, the MySQL database took just 4.43 seconds to complete the query. The Hadoop cluster, on the other hand, took 172.30 seconds. That’s about 43 times slower!
So what makes Hadoop so slow in this case? Why is it lightning fast when processing huge amounts of data but slower than an ordinary RDBMS when processing smaller amounts of data?
The reason for this is that relational databases, like MySQL, are built to handle this sort of operation: they have optimizations like indexes and other algorithms that are designed to help them perform operations across these rows. Hadoop, on the other hand, has no such thing, and is optimized instead for processing vast amounts of data using many machines at a time. In short, Hadoop is built with scaling to meet demand as its primary function – a relational database is not. Hadoop’s power only really shines with many commodity machines and lots of data to process.
The above should make it quite clear that Hadoop isn’t a replacement for relational databases: far from it, in fact. Relational databases still have their place in an enterprise or small business network, and often they perform quite admirably at their function. Hadoop’s use is very specialized, and often it doesn’t belong in a smaller network. If your queries are running fine and there’s no bottleneck issues to speak of, most likely there’s no reason to move to Hadoop. You’ll be better off with your current database simply because there are no scaling issues to be dealt with and there would be no benefits (and indeed, many drawbacks and hassles) to implementing Hadoop in an environment that simply does not need it at the moment.
So there you have it! Hadoop is, by all measures and standards, an amazing tool: using multiple commodity machines, it can process terabytes of data per day, and scales amazingly to accommodate rapidly growing infrastructures that would otherwise be untenable. It’s powerful and robust, and should absolutely be used in cases where its use is required.
It is, however, a tool, and a specialized one at that. It is important to use it only in places where it would really help. You must think long and well about your enterprise or small business network before migrating to Hadoop; chances are, your network doesn’t handle enough data to warrant the change to Hadoop, and you’d be better off sticking to whatever database back-end is currently working for you.
Hopefully this article has helped you understand where and why to use (and not use) Hadoop; check out your network infrastructure today and see if Hadoop is right for you! If you think that it is, check out our Hadoop training course offerings. Our hands-on, lab-intensive Hadoop courses will get you up-to-speed on Hadoop in no time!
Help us spread the word!
If you liked this article, consider enrolling in one of these related courses:
|May 09-11||Hadoop Administration|
|May 20||Hadoop Overview for Managers|
|Jun 20-22||Hadoop Developer Training with MapReduce|
|- Classroom - Online|