I am trying to write a new Hadoop job for input data that is somewhat skewed. An analogy for this would be the word count example in Hadoop tutorial except lets say one particular word is present lot of times.
I want to have a partition function where this one key will be mapped to multiple reducers and remaining keys according to their usual hash paritioning. Is this possible?
Thanks in advance.
Don’t think that in Hadoop the same key can be mapped to multiple reducers. But, the keys can be partitioned so that the reducers are more or less evenly loaded. For this, the input data should be sampled and the keys be partitioned appropriately. Check the Yahoo Paper for more details on the custom partitioner. The Yahoo Sort code is in the org.apache.hadoop.examples.terasort package.
Lets say Key A has 10 rows, B has 20 rows, C has 30 rows and D has 60 rows in the input. Then keys A,B,C can be sent to reducer 1 and key D can be sent to reducer 2 to make the load on the reducers evenly distributed. To partition the keys, input sampling has to be done to know how the keys are distributed.
Here are some more suggestions to make the Job complete faster.
Specify a Combiner on the JobConf to reduce the number of keys sent to the reducer. This also reduces the network traffic between the mapper and the reducer tasks. Although, there is no guarantee that the combiner will be invoked by the Hadoop framework.
Also, since the data is skewed (some of the keys are repeated again and again, lets say ‘tools’), you might want to increase the # of reduce tasks to complete the Job faster. This ensures that while a reducer is processing ‘tools’, the other data is getting processed by other reducers in parallel.
answered Oct 25 ’11 at 14:26
don t understand how evenly distributing is related to unnecessary processing in the reducer task – evenly distributing the load on the reducers will make sure that the job gets completed faster. Otherwise, the total time of the job will be impacted by the reducer which takes the most time. For this reason, Hadoop supports Speculative Execution. which is not efficient. Praveen Sripati Oct 27 ’11 at 13:14
If you split your data over multiple reducers for performance reasons then you need a second reducer to aggregate the data into the final result set.
Hadoop has a feature built in that does something like that: the combiner.
The combiner is a “reducer” kind of functionality. This ensures that within the map task a partial reduce can be done of the data and as such reduces the number of records that need to be processed later on.
In the basic wordcount example the combiner is exactly the same as the reducer. Note that some algorithms you will need a different implementation for these two. I’ve also had a project where a combiner was not possible because of the algorithm.
answered Oct 25 ’11 at 12:16