- The 3rd edition of "Hadoop The Definitive Guide" covers the latest changes to the 1.x and 2.x APIs.
- The book extensively discusses the new distributed resource management system named YARN.
- The new edition also covers the new features of HDFS, including high availability and federation.
- "Hadoop The Definitive Guide" is a comprehensive reference book for Hadoop, covering virtually anything you want to know about the system.
- The book is recommended as an essential part of your Hadoop bookshelf and serves as a great reference for Hadoop's ecosystem projects.
My original review of Hadoop The Definitive Guide (TDG) was for the 2nd edition. Recently, the 3rd edition was released. I reread the book in its entirety.
The new edition covers the latest changes to the 1.x (0.20) and the 2.x (0.23). The book’s examples now use the 2.x API throughout. Those still using the 1.x API won’t be left in the lurch because it is still discussed in the book.
There is extensive discussion of the runtime changes that come with Hadoop 2.0. This is the new distributed resource management system named YARN (Yet Another Resource Negotiator). While YARN is not recommended for production clusters, it is the future and very important to keep an eye on. TDG shows the new programming model for acquiring resources for a job. It also shows how YARN will make Hadoop more extensible for running other types of jobs.
Another addition is the new features of HDFS. The first is high availability (HA). HA address the single point of failure of having a single NameNode daemon. With the new HA feature, there is an active NameNode and standby NameNode running. TDG shows how this new failover mechanism works and the necessary settings. The second feature is federation. This allows a filesystem to have multiple NameNodes running different parts of the filesystem. Once again, TDG tells you how to set these things up and how it improves scalability.
The word “Definitive” in the book’s title is well founded. You can find virtually anything you want about Hadoop in this book. If you need to find that elusive parameter for changing spill size, TDG has it. A quick search will give you the parameter name, default value, and what it changes. If you need to know how a client HDFS reads a file and makes the relevant remote procedure calls (RPC), TDG has it. With a distributed system, these sorts of calls aren’t as straightforward to track.
A cover-to-cover read may not be for everyone. TDG serves as a great reference. I recommend getting the PDF version because it facilitates a much quicker search. The smaller chapters on Hadoop’s ecosystem projects are very handy. You may not use Hive or Pig on a daily basis and TDG can refresh your memory.
I highly recommend TDG and it is an essential part of your Hadoop bookshelf. As an Instructor and Curriculum Developer for Cloudera, I refer to the book extensively.