Sort the output of a Mapreduce Job globally

This is for Hadoop eco system like HDFS, Map reduce, Hive, Hbase, Pig, sqoop,sqoop2, Avro, solr, hcatalog, impala, Oozie, Zoo Keeper and Hadoop distribution like Cloudera, Hortonwork etc.
Posts: 162
Joined: Sat Sep 20, 2014 11:29 pm

Sort the output of a Mapreduce Job globally

Postby mohit123 » Sun Sep 21, 2014 12:28 am

Is the output of Mapreduce job is globally sorted for more than one Reducer? How I can globally sort the output of a Mapreduce Job?


Re: Sort the output of a Mapreduce Job globally

Postby Guest » Mon Sep 22, 2014 4:16 am

You can achieve a globally sorted file (which is what you basically want) using these methods:

Use just one reducer in mapreduce (bad idea !! This puts too much work on one machine)
Write a custom partitioner. Partioner is the class which divides the key space in mapreduce. The default partioner (Hashpartioner) evenly divides the key space into the number of reducers. Check out this example for writing a custom partioner.

Use Hadoop Pig/Hive to do sort.

TotalOrderPartitioner class can be used instead of default HashPartitioner class to generate the output from Mapreduce job to be in globally sorted order.


Return to “Hadoop and Big Data”

Who is online

Users browsing this forum: No registered users and 2 guests