Bright HDInsight Part 1: Consume Compressed Input

This is the first of a series of quick tips on HDInsight (Hadoop on Azure), and deals with how to consume a compressed input stream.

The scenario

The HDInsight product and Big Data solutions in general by definition deal with large amounts of data. Data at rest incurs a cost, whether in terms of management of that data, security of it or simply the storing of it. Where possible, it is good practise to reduce the weight of data in a lossless way, without losing or compromising the data quality.

The standard way to do this is by using compression on the static data, where common strings are deflated from the static file and indexed so that they can be later repositioned in an inflation stage.

The problem with this for HDInsight or Hadoop in general is that the stream becomes a binary stream which it cannot access directly.

Configuring a Hadoop Job to read a Compressed input file

In order to configure a Hadoop Job to read the Compressed input, you simply have to specify a flag on the job command line. That flag is:

1
-D "io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec"
view raw inputflag.txt hosted with ❤ by GitHub

This causes an additional map task to be undertaken which loads the input as a GzipStream before your map/reduce job begins. NOTE: This can be a time consuming activity – if you’re planning on loading this file many times for parsing, your efficiency saving by doing this will be limited.

A 2Gb GZipped example file of mine was inflated to 11Gb, taking 20mins.

Filename dependency

If you try this approach, you might find a problem where the input strings to your Map job still seem to be binary. This is a sign of the stream not be inflated by the infrastructure. There is one last thing you must ensure in order to trigger the system to begin the inflation process – this is that the filename needs to be of the relevant extension. As an example, to use GZip, the filename must end with .gz such as “mylogfile.log.gz”

Using the HDInsight C# Streaming SDK to control Input Compression

In order to use the C# Streaming SDK with this flag, one simply modifies the Job.Configure override in order to add and additional generic argument specifying this flag.

1 2 3 4 5 6 7 8 9
//public override HadoopJobConfiguration Configure(ExecutorContext context)
 
HadoopJobConfiguration config = new HadoopJobConfiguration();
config.InputPath = "/input/data.log.gz";
config.OutputFolder = "/output/output" + DateTime.Now.ToString("yyyyMMddhhmmss");
 
config.AdditionalGenericArguments.Add("-D \"io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec\"");
 
return config;
view raw Job.cs hosted with ❤ by GitHub
You will find an additional Task specified in your JobTracker, which takes the stream and inflates for your runtime code.

A better way

If you can control your input stream (i.e. it is not provided by a third party), you should look at a better compression algorithm than the one shown here, Gzip. A better approach is to use LOZ, a splittable algorithm that allows you to better distribute your work. The Gzip algorithm must be processed in a sequential way, which makes it very hard to distribute the workload. LOZ (which is configued by using com.hadoop.compression.lzo.LzoCodec) is splittable, allowing better workload distribution.

Happy Hadoopy Cloudy Times :-)
Andy

 

2 thoughts on “Bright HDInsight Part 1: Consume Compressed Input

  1. Pingback: Bright HDInsight Part 2: Smart compressed ouput | Andy Cross

  2. Pingback: Windows Azure Community News Roundup (Edition #47) - Windows Azure - Site Home - MSDN Blogs

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>