Finding The Optimal Minimum Split Count For Your Hadoop Job

December 11th, 2012 Leave a comment
Like the article?
Optimal Split Count For Your Hadoop

Figuring out ways to optimize Hadoop isn’t always easy, and one part of the job that’s often overlooked is the split size / split count of a Hadoop job. Most people often leave it alone to the preset defaults, but are the preset defaults right for you? Let’s find out!

When you’re looking at minimum split count, you want to look at a great deal of things: one of the most important ones is the map task capacity of the cluster for the particular job at head. Let’s say, for example, that a particular cluster has a map task capacity of 32. This means that it can run 32 jobs in parallel- you can see the inefficiency if, for example, we used a maximum split count of 5. Using a split count of 96 will show better results, but it too comes with a drawback- it’s going to create a great deal of overhead with regards to map task communication and size.

Some people advocate setting the number of splits equal to the map task capacity, but there’s a bit of difficulty here too: while this eliminates issues of the cluster’s capacity being used inefficiently, it makes for some fairly hefty splits. These big splits will lead to much longer-running tasks, and those tasks will block other Hadoop tasks in the queue, which is hardly the kind of optimization that many shops want when it comes to their Hadoop processes running.

A better idea when it comes to split counts is to split the difference (no pun intended!) by making the split count some multiple of the map task capacity. This means two things: the cluster’s scaling optimally, which is good, and the split counts are smaller, which greatly reduces the time the cluster is blocked by a task.

In general, here’s an equation you’ll want to use when calculating your split counts:

input / ( multiplier * mapTaskCapacity )

The input is the total size of the input for that particular job, the map task capacity is the maximum jobs your cluster can run. Using these two, you can figure out the optimal split size for your particular job. To find the multiplier, just use this equation:

input / (max_split_size * mapTaskCapacity)

That will give you the multiplier, and using those two together you can correctly calculate the optimal split size and split count for your particular job. It’s also possible that your scenario doesn’t need this type of optimization- some shops, for example, may not have many Hadoop tasks to perform and are perfectly content with making their split count the size of their map task capacity. In general, however, for most typical Hadop use case scenarios this strategy will lead to the best balance between optimal efficiency of both the cluster and the task queue that you have!

Help us spread the word!
  • Twitter
  • Facebook
  • LinkedIn
  • Pinterest
  • Delicious
  • DZone
  • Reddit
  • Sphinn
  • StumbleUpon
  • Google Plus
  • RSS
  • Email
  • Print
Don't miss another post! Receive updates via email!