Hadoop is great idea for a framework, and it’s been one of a few game-changers in the open source world in the past few years. It’s designed to distribute processing for many large datasets across a machine cluster so that the dataset can be processed in parallel. The fact that it’s open-source and free is another bonus- there’s no cost to try out the software and see if it fits your needs, and it’s enabled many companies to sift through large datasets that they otherwise would have had to buy expensive proprietary software for.
As might be expected, then, Hadoop has entertained a fairly popular entrance into the buzzword world, and it’s often unclear as to how Hadoop might be able to help you or your company. For this reason, many IT employees are keen to learn the ins and outs of Hadoop to compare it to their needs and decide whether it’s what they need to process their large datasets on a distributive platform- or they know that they want to use it and need a guide to help them get competent in it. For many of them, a book is the best way to do this efficiently and quickly.
Hadoop: The Definitive Guide is a great answer to this need. Tom White’s book is setting out to provide everything that a Hadoop book should give its readers. The book is extremely comprehensive: it takes you from what Hadoop is and how it’s different from like-sounding frameworks or tools to in-depth explanations about some of Hadoop’s core functions and features. White also doesn’t shy away from design philosophy, something that too many authors skip out on: he’s very willing not just to show you how something works, but also why it works that way and how that fits into the overarching design rationale of the framework itself.
As far as the book’s contents go, the chapters are fairly well-organized. Chapter one, for example, starts out with an introduction of Hadoop, what it is, and its history. It also details the benefits of Hadoop versus other tools for accomplishing certain parallel processing tasks over large datasets, and it begins to introduce the concepts of combiner, reduce, and map functions, all of which steadily get built on throughout the book. White is very good at taking things slowly: at no point do you feel lost with too much new material, and throughout the entire book White logically, patiently, and thoroughly takes you through Hadoop and how to use it, though sometimes he strays and the chapters can become a bit confusing to follow at times.
One thing that pops up in later chapters that I really liked is the fact that White doesn’t just teach you about Hadoop’s functions and features: he also has ideas about how to use them more effectively. For example, he takes the time to not only show you how to create and run a MapReduce application, but he also takes time to show you how MapReduce jobs can be tested, configured, and tuned to make them more efficient. While this is something that many Hadoop developers would steadily learn on their own, White’s inclusion of it here makes it easier for new Hadoop developers to get up to speed on best practices and sound design philosophy on MapReduce jobs.
Another brief bit that I thought was quite illuminating was a section involving Amazon’s EC2 service and using Hadoop on EC2 virtual machines. While it’s possible that other Hadoop books have covered this subject, none come to my mind- White’s writeup of using EC2 clusters with Hadoop to lift Hadoop into the cloud is quite interesting, and not something I had personally thought of before. Definitely a great little bit for anyone interested in using their Hadoop install in the cloud- White even goes into setting up Hadoop for running on EC2 in case you need a primer to get started!
White also provides case studies for Hadoop- some people might not like these as they’re not strictly programming-related, but I’ve always found that case studies of practical applications of whatever language or framework you’re using are one of the best ways to get a clear idea of what situations require the use of the framework and how best to apply it to the task at hand. His case studies are clear and level-headed, and they give quite a good overview of how Hadoop has been used in practical situations to overcome parallel processing problems that other companies have encountered in the past.
There are a few minor quibbles with White’s book, my biggest one being that it can be a bit dry at times. That’s understandable given the density of the subject material at hand, but I definitely found myself losing focus or dozing off a few times when going through the prose. There are other Hadoop books that cover the material with a bit more style to capture and maintain reader interest, and I think White could have benefited from an editor with an eye for touching up the prose to make it more readable and interesting in certain parts, and as it stands the book can sometimes be frustrating and a little confusing to get through.
That said, however, the book is quite good and definitely worth a read if you’re looking to get into Hadoop. It’s an in-depth, comprehensive tome that will most certainly leave you with a very good knowledge of Hadoop and how to apply it to your specific parallel processing problems in your workplace!
|Amazon: Hadoop: The Definitive Guide|
Help us spread the word!
If you liked this article, consider enrolling in one of these related courses:
|May 11-13||Hadoop Administration|
|May 15||Hadoop Overview for Managers|
|Jun 15-17||Hadoop Developer Training with MapReduce|
|- Classroom - Online|