This is the first of a series of quick tips on HDInsight (Hadoop on Azure), and deals with how to consume a compressed input stream.
The HDInsight product and Big Data solutions in general by definition deal with large amounts of data. Data at rest incurs a cost, whether in terms of management of that data, security of it or simply the storing of it. Where possible, it is good practise to reduce the weight of data in a lossless way, without losing or compromising the data quality.
The standard way to do this is by using compression on the static data, where common strings are deflated from the static file and indexed so that they can be later repositioned in an inflation stage.
The problem with this for HDInsight or Hadoop in general is that the stream becomes a binary stream which it cannot access directly.
Configuring a Hadoop Job to read a Compressed input file
In order to configure a Hadoop Job to read the Compressed input, you simply have to specify a flag on the job command line. That flag is:
This causes an additional map task to be undertaken which loads the input as a GzipStream before your map/reduce job begins. NOTE: This can be a time consuming activity – if you’re planning on loading this file many times for parsing, your efficiency saving by doing this will be limited.
A 2Gb GZipped example file of mine was inflated to 11Gb, taking 20mins.
If you try this approach, you might find a problem where the input strings to your Map job still seem to be binary. This is a sign of the stream not be inflated by the infrastructure. There is one last thing you must ensure in order to trigger the system to begin the inflation process – this is that the filename needs to be of the relevant extension. As an example, to use GZip, the filename must end with .gz such as “mylogfile.log.gz”
Using the HDInsight C# Streaming SDK to control Input Compression
In order to use the C# Streaming SDK with this flag, one simply modifies the Job.Configure override in order to add and additional generic argument specifying this flag.
|//public override HadoopJobConfiguration Configure(ExecutorContext context)|
|HadoopJobConfiguration config = new HadoopJobConfiguration();|
|config.InputPath = "/input/data.log.gz";|
|config.OutputFolder = "/output/output" + DateTime.Now.ToString("yyyyMMddhhmmss");|
A better way
If you can control your input stream (i.e. it is not provided by a third party), you should look at a better compression algorithm than the one shown here, Gzip. A better approach is to use LOZ, a splittable algorithm that allows you to better distribute your work. The Gzip algorithm must be processed in a sequential way, which makes it very hard to distribute the workload. LOZ (which is configued by using com.hadoop.compression.lzo.LzoCodec) is splittable, allowing better workload distribution.