Hadoop is a fairly new technology, but there are already some who are predicting its downfall and slide into disuse: some recent rumblings on the Internet claim that, indeed, Hadoop and HDFS are going to be going away in the near future even despite the fact that they’ve only just recently become a staple of enterprise development and deployment. The reason that Internet pundits are giving for Hadoop’s demise is the rising trend of real-time data processing, something that Hadoop doesn’t do well at all, and because of this more and more enterprises will be turning away from Hadoop and moving to other real-time data processing platforms instead.
That there is a trend towards real-time data processing is not in doubt: recent moves by major tech players indicate that real-time data processing is becoming more and more important to them, much like Google experimenting with new real-time processing technologies like Percolator and Dremel. It is also true that Hadoop and HDFS is spectacularly unsuited to this task- if you’re looking to get real-time data processing done, you’re not going to get it done well with Hadoop’s disk-based, batch-based processing systems.
The fact that Hadoop’s disk-based approach isn’t great for real-time data processing, however, is essentially what cuts to the heart of this discussion: Hadoop, and by extension HDFS, was never designed for real-time data processing. What is was designed for is batch storing and accessing data, almost like a data warehouse- which it does exceedingly well. It was never designed for real-time data processing and will never be good at doing so, leading to the reason that tech giants like Google and other companies have been looking to find something that is good at real-time data processing.
That anyone would predict Hadoop’s demise based on this distinction is quite surprising to me, as Hadoop has clear uses separate from real-time processing systems. In-memory indexing and real-time processing systems, like GridGain or Percolator, simply do not have the capacity for petabyte / exabyte storage and processing like Hadoop does. Similarly, Hadoop is terrible at real-time data processing ad should not be used as such. The evidence that the two types of data processing are different lies in the truth of implementation: Hadoop and GridGain, for example, work near-seamlessly side-by-side together to accomplish different enterprise-level goals, namely those of storage and processing alongside each other.
Like the LAMP stack before it, Hadoop is now being unfairly subjected to rumors of an early demise solely because it’s specialized. I would use a LAMP stack instead of Hadoop for a small-to-medium sized website, just as I would not use a LAMP stack for a large scalable deployment. Each of these tools is meant for a specific purpose, and Hadoop, predictably so, performs poorly when operating outside of what it was meant to do. Until in-memory storage hits the point where it can hold as much as disk storage, I don’t see Hadoop / HDFS going away any time soon: it will stick around as data warehousing while the real-time processing is done separately, with different tools more suited to the job!
Help us spread the word!
If you liked this article, consider enrolling in one of these related courses:
|Feb 09-11||Hadoop Developer Training with MapReduce|
|Mar 09-11||Hadoop Administration|
|- Classroom - Online|