Bright HDInsight Part 2: Smart compressed ouput

Following on from Bright HDInsight Part 1, which dealt with consuming a compressed input stream in a Map/Reduce job, we’ll now see how to extend this scenario to emit compressed data from the output of your Map/Reduce job.

Scenario

Again, the scenario is very much one of reducing overhead on the management, security and storage of your data. If you are to leave your resulting work at rest in a remote system, you should reduce its footprint as much as possible.

Reiteration of tradeoff

Remember that you are shifting an IO to a Compute bound problem – compression requires inflation prior to utilisation of the data. You should run metrics on this to see if you’re truly saving what you think you might be.

Command Line

Again, this is achieved by using an argument on the command line:

mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Using the C# SDK for HDInsight

To do this, in your configure method, simply append the AdditionalGenericParameter as below;

config.AdditionalGenericArguments.Add(“-D \”mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec\”");

config.AdditionalGenericArguments.Add(“-D \”mapred.output.compress=true\”");

UPDATE: Alternatively, you can set the “CompressOutput” property to true on the config object, and the SDK will take care of this for you.

Happy clouding,

Andy

One thought on “Bright HDInsight Part 2: Smart compressed ouput

  1. Pingback: Windows Azure Community News Roundup (Edition #47) - Windows Azure - Site Home - MSDN Blogs

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>