Following on from Bright HDInsight Part 1, which dealt with consuming a compressed input stream in a Map/Reduce job, we’ll now see how to extend this scenario to emit compressed data from the output of your Map/Reduce job.
Scenario
Again, the scenario is very much one of reducing overhead on the management, security and storage of your data. If you are to leave your resulting work at rest in a remote system, you should reduce its footprint as much as possible.
Reiteration of tradeoff
Remember that you are shifting an IO to a Compute bound problem – compression requires inflation prior to utilisation of the data. You should run metrics on this to see if you’re truly saving what you think you might be.
Command Line
Again, this is achieved by using an argument on the command line:
mapred.output.compression.codec=org.apache.hadoop.io.compress.
Using the C# SDK for HDInsight
To do this, in your configure method, simply append the AdditionalGenericParameter as below;
config.AdditionalGenericArguments.Add(“-D \”mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec\”");
config.AdditionalGenericArguments.Add(“-D \”mapred.output.compress=true\”");
UPDATE: Alternatively, you can set the “CompressOutput” property to true on the config object, and the SDK will take care of this for you.
Happy clouding,
Andy

Pingback: Windows Azure Community News Roundup (Edition #47) - Windows Azure - Site Home - MSDN Blogs