Tuesday, January 26, 2010

Hadoop code efficient hint

The pattern of creating a new key object in the mapper for the transformation object is not the most efficient pattern. Most key classes provide a set() method, which sets the current value of the key. The context.write() method uses the current value of the key, and once the write() method is complete, the key object or the value object is free to be reused.

If the job is configured to multithread the map method, via conf.setMapRunner(MultithreadedMapRunner.class), the map method will be called by multiple threads. Extreme care must be taken in using the mapper class member variables. A ThreadLocal LongWritable object could be used to ensure thread safety.The following sample snippet demonstrates a common pattern for per-job management of map task parallelism. The choice of 100 was made for demonstration purposes and is not suitable for a CPU-intensive map task.

if (conf.getInt("mapred.tasktracker.map.tasks.maximum", 2)==1) {
conf.setMapRunnerClass(MultithreadedMapRunner.class);
conf.setInt("mapred.map.multithreadedrunner.threads", 100);
}



Object churn is a significant performance issue in a map method, and to a lesser extent, in the reduce method. Object reuse can provide a significant performance gain.

No comments:

Post a Comment