customcategory

Bright HDInsight Part 3: Using custom Counters in HDInsight

Telemetry is life! I repeat this mantra over and over; with any system, but especially with remote systems, the state of the system is dificult or impossible to ascertain without metrics on its internal processing. Computer systems operate outside the scope of human comprehension – they’re too quick, complex and transient for us to ever be able to confidently know their state in process. The best we can do is emit metrics and provide a way to view these metrics to judge a general sense of progress.

Metrics in HDInsight

The metrics I will present here relate to HDInsight and the Hadoop Streaming API, presented in C#. It is possible to access the same counters from other programmatic interfaces to HDInsight as they are a core Hadoop feature.

These metrics shouldn’t be used for data gathering as that is not their purpose. You should use them to track system state and not system result. However this line is a thin line ;-) For instance, if we know there are 100 million rows for data pertaining to “France” and 100 million rows for data pertaining to “UK” and these are across multiple files and partitions then we might want a metric which reports the progress across these two data aspects. In practice however, this type of scenario (progress through a job) is better measured without reverting to measuring data directly.

Typically we also want the ability to group similar metrics together in a category for easier reporting, and I shall show an example of that.

Scenario

The example data used here is randomly generated strings with a data identifier, action type and a country of origin. This slim 3 field file will be mapped to select the country of origin as the key and reduced to count everything by country of origin.

823708 rz=q UK
806439 rz=q UK
473709 sf=21 France
713282 wt.p=n UK
356149 sf=1 UK
595722 wt.p=n France
238589 sf=1 France
478163 sf=21 France
971029 rz=q France
……10000 rows…..

Mapper

This example shows how to add the counters to the Map class of the Map/Reduce job.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
using System;
using Microsoft.Hadoop.MapReduce;
 
namespace Elastacloud.Hadoop.SampleDataMapReduceJob
{
public class SampleMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
try
{
context.IncrementCounter("Line Processed");
var segments = inputLine.Split("\t".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
 
context.IncrementCounter("country", segments[2], 1);
 
context.EmitKeyValue(segments[2], inputLine);
context.IncrementCounter("Text chars processed", inputLine.Length);
}
catch(IndexOutOfRangeException ex)
{
//we still allow other exceptions to throw and set and error state on the task but this
//exception type we are confident is due to the input not having >3 separated segments
context.IncrementCounter("Logged recoverable error", "Input Format Error", 1);
context.Log(string.Format("Input Format Error on line {0} in {1} - {2} was {3}", inputLine, context.InputFilename,
context.InputPartitionId, ex.ToString()));
}
}
}
}
view raw SampleMapper-edit.cs hosted with ❤ by GitHub

Here we can see that we are using the parameter “context” to interact with the underlying Hadoop runtime. context.IncrementCounter is the key operation we are calling, using its underlying stderr output to write out in the format:
“reporter:counter:{0},{1},{2}”, category, counterName, increment

We are using this method’s overloads in different ways; to increment a country simply by name, to increment a counter my name and category and to increment by with a custom increment.

We are free to add these counters to any point of the map/reduce program in order that we can gain telemetry of our job as it progresses.

Viewing the counters

In order to view the counters, visit the Hadoop Job Tracking portal. The Hadoop console output will contain the details for your Streaming Job job id, for example for me it was http://10.174.120.28:50030/jobdetails.jsp?jobid=job_201212091755_0001, reported in the output as:

12/12/09 18:01:37 INFO streaming.StreamJob: Tracking URL: http://10.174.120.28:50030/jobdetails.jsp?jobid=job_201212091755_0001

Since I used two counters that were not given a category, they appear in the “Custom” category:

Another of my counters was given a custom category, and thus it appears in a separate section of the counters table:

In my next posts, I will focus on error handling, status and more data centric operations.

Happy big-data-cloudifying ;-)
Andy

One thought on “Bright HDInsight Part 3: Using custom Counters in HDInsight

  1. Pingback: MSDN Blogs

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>