Author Archives: Andy Cross

About Andy Cross

Andy Cross (@andybareweb) Andy Cross, is the author of the popular Windows Azure blog at blog.elastacloud.com. He is a software consultant, cloud architect and co-owner of Elastacloud. His passion for embedded software and high performance compute clusters gives him a unique sphere of computation from the very small and resource constrained to the massively scalable limitless potential of the cloud. Andy’s specialism in the Azure realm is runtime diagnostics and service elasticity. He is a Windows Azure MVP, Insider, co-founder of the UK London Windows Azure User Group and a Microsoft DevPro Community Leader.

Global Windows Azure Bootcamp

On April 27th 2013, I’m helping to run the London instance of the Global Windows Azure Bootcamp.

It’s a whole day of Windows Azure training, labs, talks and hands on stuff that’s provided free by the community for the community. You can read more about the type of stuff we’re doing here http://magnusmartensson.com/globalwindowsazure-a-truly-global-windows-azure-community-event 

We’re covering .net, java and other open stacks on Azure, as well as participating in one of the largest single uses of Azure compute horsepower in the global community’s history. This is going to rock, significantly.

If London is too far, look for a closer event here http://globalwindowsazure.azurewebsites.net/?page_id=151

Happy clouding,

Andy

hadoopcmdline

HDInsight: Workaround error Could not find or load main class

Sometimes when running the C# SDK for HDInsight, you can come across the following error:

The system cannot find the batch label specified – jar
Error: Could not find or load main class c:\apps\dist\hadoop-1.1.0-SNAPSHOT\lib\hadoop-streaming.jar

To get around this, close the command shell that you are currently in and open up a new hadoop shell, and try your command again. It should work immediately.

This tends to occur after killing a hadoop job, and so I am assuming something that this activity does changes the context of the command shell in such a way that it can no longer find the hadoop javascript files. I’ve yet to get to the bottom of it, so if anyone has any bright ideas, let me know on comments ;-)

Good Hadoopification,

Andy

HDInsight: Workaround bug when killing Jobs

When running a Streaming Job from the Console in HDInsight, you might are given a message which describes how to kill a job:

13/01/09 14:52:07 INFO streaming.StreamJob: To kill this job, run:
13/01/09 14:52:07 INFO streaming.StreamJob: C:\apps\dist\hadoop-1.1.0-SNAPSHOT/bin/hadoop job -Dmapred.job.tracker=10.186.136.26:9010 -kill job_201301081702_0001

Unfortunately there is an error in this and it will not work:

c:\apps\dist\hadoop-1.1.0-SNAPSHOT>hadoop job -Dmapred.job.tracker=10.186.136.26:9010 -kill job_201301081702_0014
Usage: JobClient <command> <args>
 [-submit <job-file>]
 [-status <job-id>]
.....

This is because there is an error in the command as written out by the hadoop streaming console. There should be a space between the -D and mapred.job.tracker=ipAddressJobTracker, and furthermore the mapred.job.tracker parameter should be quoted:

c:\apps\dist\hadoop-1.1.0-SNAPSHOT>hadoop job -D "mapred.job.tracker=10.186.136.26:9010" -kill job_201301081702_0001
Killed job job_201301081702_0001

Et voila.

Happy big-dataification ;-)
Andy

Bright HDInsight Part 6: Take advantage of your console context

The Microsoft.Hadoop.MapReduce SDK for HDInsight (Hadoop on Azure) requires you to package your map reduce job in a windows console application that is then used by the Hadoop Streaming API in order to run your logic. This gives a very nature way to configure and orient your application at startup, which this blog will give a simple example of.

Console Arguments

The natural requirement to explore is the supply of runtime arguments to the Map Reduce console application (hereafter, package), which can be used for configuration purposes. We may use them to manually specify additional generic parameters, input or output locations or control the type of job being run. Consider the input location purpose; we may want to write a single Map Reduce job, but then be able to run that job against different datasets, differententiated by their storage location.

A very natural way to do this for anyone familiar with a commandshell is that you may wish to run:

mypackage.exe /path/to/input.txt

This would specify a path to an input location on which the application could work. In order to achieve this with the SDK, given that we have parsed and verified the input, we can then include the arguments that we are interested in as a parameter to the HadoopJobExecutor.ExecuteJob(string[] args) method.

Once we have done that, in our TJobClass (subclass of HadoopJob) Configure method, we are provided with a ExecutorContext class. This ExecutorContext has a single Property “Arguments”.

This Property is the string[] parameter that we passed into the ExecuteJob method. We can then use this to configure the Job during setup:

public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var hadoopConfiguration = new HadoopJobConfiguration();
hadoopConfiguration.InputPath = context.Arguments[0];
hadoopConfiguration.OutputFolder = “asv://output/” + DateTime.Now.Ticks.ToString(CultureInfo.InvariantCulture);

return hadoopConfiguration;
}

Conclusion

Use your console context well. Further examples of what you could do: capture the output of the package host to a text file with > or >> commands; pipe the results to another program using |; chain execution together with a batch file; schedule execution with the Windows Scheduler.

Happy cloudy-big-dataification (and happy new year),
Andy

sleepingduringmap

Bright HDInsight Part 5: Better status reporting in HDInsight

Following on from earlier blogs in this series, in this blog I will show another way of getting telemetry out of HDInsight. We have previously tackled counters and logging to the correct context, and now I will show an additional type of logging that is useful while a job is in process; the reporting of a job’s status in process.

This status is a transient record of the state of a Map/Reduce job – it is not persisted beyond the execution of a job and is volatile. Its volatility is evident in that any subsequent set of this status will overwrite the previous value. Consequently, any inspection of this value will only reflect the state of the system when it was inspected, and may already have changed by the time the user gets access to the value it holds.

This does not make the status irrelevant, in fact it gives a glimpse at the system state in process and allows the viewer to immediately determine a recent state. This is useful in its own right.

Controlling the Status

The status of a Map/Reduce job is reported at the Map, Combine and Reduce stages of execution, and the individual Task can report its status.

This status control is not available in the Hadoop C# SDK but is simple to achieve using the Console.Error – see Part 2 of this series for why this stream is used!

In order that we can remain in sync with the general approach taken by the C# Hadoop SDK, we will wrap the call into an extension method on ContextBase, the common base class for Mapper, Combiner and Reducer. With this new method – which we will call SetStatus(string status) – we will write a well known format to the output that Hadoop picks up for its Streaming API, and Hadoop will then use this as the status of the job.

The format of the string is:
reporter:status:Hello, World!

This will set the task executing’s state to “Hello, World!”.

Implementing the Extension method

The extension method is very simple:

public static class ContextBaseExtension
{
public static void SetStatus(this ContextBase context, string statusMessage)
{
Console.Error.WriteLine("reporter:status:{0}", statusMessage);
}
}
view raw file1.cs hosted with ❤ by GitHub

Once we have this method referenced and included as a using include to our map/reduce/combine class, we can use the context parameter of our task (in my case I chose a Map), and set a status!

I chose a trivial and unrealisting sleep statement, so that I can see the message easily in the portal :-)

context.SetStatus("Sleeping During Map!");
Thread.Sleep(180000);
view raw file2.cs hosted with ❤ by GitHub

Viewing the output

To view this, you can enter the Hadoop portal and check the state of the running task. In this screenshot you can see that status is set to “Sleeping During Map!” – and you can see why I used Thread.Sleep – to give me a few moments to log onto the RDP server and take the screenshot. I choose 3 minutes so I could also have time to get a coffee …. ;-)

Production use

A better example, and a very useful use case for this is to report whether a Task is running successfully or whether it is encountering errors that have to be recovered from. In order to achieve this, we will catch any exceptions that can be recovered from, report that we are doing so and then continue execution.

Notice that this is a very similar map to our previous example with Counters, but that we are also setting a state of our Task that is in process.

using System;
using Microsoft.Hadoop.MapReduce;
namespace Elastacloud.Hadoop.SampleDataMapReduceJob
{
public class SampleMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
try
{
context.IncrementCounter("Line Processed");
var segments = inputLine.Split("\t".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
context.IncrementCounter("country", segments[2], 1);
context.EmitKeyValue(segments[2], inputLine);
context.IncrementCounter("Text chars processed", inputLine.Length);
}
catch(IndexOutOfRangeException ex)
{
//we still allow other exceptions to throw and set and error state on the task but this
//exception type we are confident is due to the input not having >3 \t separated segments
context.IncrementCounter("Logged recoverable error", "Input Format Error", 1);
context.Log(string.Format("Input Format Error on line {0} in {1} - {2} was {3}", inputLine, context.InputFilename,
context.InputPartitionId, ex.ToString()));
context.SetStatus("Running with recoverable errors");
}
}
}
public static class ContextBaseExtension
{
public static void SetStatus(this ContextBase context, string statusMessage)
{
Console.Error.WriteLine("reporter:status:{0}", statusMessage);
}
}
}
view raw file3.cs hosted with ❤ by GitHub

Happy big-data-cloudarama ;-)
Andy

clickasv

Bright HDInsight Part 4: Blob Storage in Azure Storage Vault (examples in C#)

When using a HDInsight cluster, a key concern is sourcing the data on which you will run your Map/Reduce jobs. The load time to transfer the data for your Job into your Mapper or Reducer context must be as low as possible; the quality of a source can be ranked by this load wait – the data latency before a Map/Reduce job can commence.

Typically, you tend to store your data for your jobs within Hadoop’s Distributed FileSystem, HDFS. With HDFS, you are limited by the size of the storage attached to your HDInsight nodes. For instance, in the HadoopOnAzure.com preview, you are limited to 1.5Tb of data. Alternatively, you can use Azure Storage Vault and access up to 150 Tb in an Azure Storage account.

Azure Storage Vault

The Azure Storage Vault is a storage location, backed by Windows Azure blob Storage that can be addressed and accessed by Hadoop’s native processes. This is achieved by specifying a protocol scheme for the URI of the assets you are trying to access, ASV://. This ASV:// is synonymous with other storage accessing schemes such as file://, hdfs:// and s3://.

With ASV, you configure access to a given account and key in the HDInsight portal Manage section.

Once in this section, one simply begins the configuration process by entering the Account Name and Key and clicking Save:

Using ASV

Once you have configured the Account Name and Key, you can use the new storage provided by ASV by addressing using the ASV:// scheme. The format of these URIs are:

ASV://container/path/to/blob.ext

As you can see, the ASV:// is hardwired to the configured account name (and key) so that you don’t have to specify the account or authentication to access this account. It is a shortcut to http(s)://myaccount.blob.core.windows.net. Furthmore, since it encapsulates security, you don’t need to worry about the access to these blobs.

Benefits of ASV

There are many benefits of ASV, from exceptionally low storage costs (lower than S3 at the time of writing), to the ability to seemlessly provide geo-located redundancy on the files for added resilience. For me, as a user of the time limited (at the time of writing again!) HadoopOnAzure.com clusters, a really big benefit is that I don’t lose the files when the cluster is released, as I would do if they were stored on HDFS. Additional benefits to me include the ability to read and write to ASV and access those files immediately off the cluster in a variety of tools that I have gotten to know very well over the past few years, such as Cerebrata’s Cloud Storage Studio.

How to use ASV

It is exceptionally easy to configure your C# Map/Reduce tasks to use ASV, due to the way it has been designed. The approach is also equivalent and compatible with any other Streaming, Native or ancillary job creation technique.

To Use ASV in a C# application, first configure the ASV in the HDInsight portal as above, and then configure the HadoopJob to access that resource as the InputPath or OutputFolder locations.

public override HadoopJobConfiguration Configure(ExecutorContext context)
{
try
{
HadoopJobConfiguration config = new HadoopJobConfiguration();
config.InputPath = "asv://container/inputpath/";
config.OutputFolder = "asv://container/outputpath" + DateTime.Now.ToString("yyyyMMddhhmmss");
return config;
}
catch(Exception ex)
{
Console.WriteLine(ex.ToString());
}
return null;
}
view raw asv.cs hosted with ❤ by GitHub

As you can see, this configuration commands Hadoop to load from the configured ASV:// storage account container “container”, find any files in the /inputpath/ folder location and include them all as input files. You can also specify an individual file.

Similarly, the outputFolder is specified as a location that the HDInsight job should write the output of the Map/Reduce job to.

As a nice additional benefit, using ASV adds counters on the amount of bytes read and written by the Hadoop system using ASV, allowing you to track your usage of the data in your Storage Account.

All very simple stuff, but amazingly powerful.

Happy big-data-cloudification ;-)

Andy

customcategory

Bright HDInsight Part 3: Using custom Counters in HDInsight

Telemetry is life! I repeat this mantra over and over; with any system, but especially with remote systems, the state of the system is dificult or impossible to ascertain without metrics on its internal processing. Computer systems operate outside the scope of human comprehension – they’re too quick, complex and transient for us to ever be able to confidently know their state in process. The best we can do is emit metrics and provide a way to view these metrics to judge a general sense of progress.

Metrics in HDInsight

The metrics I will present here relate to HDInsight and the Hadoop Streaming API, presented in C#. It is possible to access the same counters from other programmatic interfaces to HDInsight as they are a core Hadoop feature.

These metrics shouldn’t be used for data gathering as that is not their purpose. You should use them to track system state and not system result. However this line is a thin line ;-) For instance, if we know there are 100 million rows for data pertaining to “France” and 100 million rows for data pertaining to “UK” and these are across multiple files and partitions then we might want a metric which reports the progress across these two data aspects. In practice however, this type of scenario (progress through a job) is better measured without reverting to measuring data directly.

Typically we also want the ability to group similar metrics together in a category for easier reporting, and I shall show an example of that.

Scenario

The example data used here is randomly generated strings with a data identifier, action type and a country of origin. This slim 3 field file will be mapped to select the country of origin as the key and reduced to count everything by country of origin.

823708 rz=q UK
806439 rz=q UK
473709 sf=21 France
713282 wt.p=n UK
356149 sf=1 UK
595722 wt.p=n France
238589 sf=1 France
478163 sf=21 France
971029 rz=q France
……10000 rows…..

Mapper

This example shows how to add the counters to the Map class of the Map/Reduce job.

using System;
using Microsoft.Hadoop.MapReduce;
namespace Elastacloud.Hadoop.SampleDataMapReduceJob
{
public class SampleMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
try
{
context.IncrementCounter("Line Processed");
var segments = inputLine.Split("\t".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
context.IncrementCounter("country", segments[2], 1);
context.EmitKeyValue(segments[2], inputLine);
context.IncrementCounter("Text chars processed", inputLine.Length);
}
catch(IndexOutOfRangeException ex)
{
//we still allow other exceptions to throw and set and error state on the task but this
//exception type we are confident is due to the input not having >3 separated segments
context.IncrementCounter("Logged recoverable error", "Input Format Error", 1);
context.Log(string.Format("Input Format Error on line {0} in {1} - {2} was {3}", inputLine, context.InputFilename,
context.InputPartitionId, ex.ToString()));
}
}
}
}
view raw SampleMapper-edit.cs hosted with ❤ by GitHub

Here we can see that we are using the parameter “context” to interact with the underlying Hadoop runtime. context.IncrementCounter is the key operation we are calling, using its underlying stderr output to write out in the format:
“reporter:counter:{0},{1},{2}”, category, counterName, increment

We are using this method’s overloads in different ways; to increment a country simply by name, to increment a counter my name and category and to increment by with a custom increment.

We are free to add these counters to any point of the map/reduce program in order that we can gain telemetry of our job as it progresses.

Viewing the counters

In order to view the counters, visit the Hadoop Job Tracking portal. The Hadoop console output will contain the details for your Streaming Job job id, for example for me it was http://10.174.120.28:50030/jobdetails.jsp?jobid=job_201212091755_0001, reported in the output as:

12/12/09 18:01:37 INFO streaming.StreamJob: Tracking URL: http://10.174.120.28:50030/jobdetails.jsp?jobid=job_201212091755_0001

Since I used two counters that were not given a category, they appear in the “Custom” category:

Another of my counters was given a custom category, and thus it appears in a separate section of the counters table:

In my next posts, I will focus on error handling, status and more data centric operations.

Happy big-data-cloudifying ;-)
Andy

Bright HDInsight Part 2: Smart compressed ouput

Following on from Bright HDInsight Part 1, which dealt with consuming a compressed input stream in a Map/Reduce job, we’ll now see how to extend this scenario to emit compressed data from the output of your Map/Reduce job.

Scenario

Again, the scenario is very much one of reducing overhead on the management, security and storage of your data. If you are to leave your resulting work at rest in a remote system, you should reduce its footprint as much as possible.

Reiteration of tradeoff

Remember that you are shifting an IO to a Compute bound problem – compression requires inflation prior to utilisation of the data. You should run metrics on this to see if you’re truly saving what you think you might be.

Command Line

Again, this is achieved by using an argument on the command line:

mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Using the C# SDK for HDInsight

To do this, in your configure method, simply append the AdditionalGenericParameter as below;

config.AdditionalGenericArguments.Add(“-D \”mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec\”");

config.AdditionalGenericArguments.Add(“-D \”mapred.output.compress=true\”");

UPDATE: Alternatively, you can set the “CompressOutput” property to true on the config object, and the SDK will take care of this for you.

Happy clouding,

Andy

Bright HDInsight Part 1: Consume Compressed Input

This is the first of a series of quick tips on HDInsight (Hadoop on Azure), and deals with how to consume a compressed input stream.

The scenario

The HDInsight product and Big Data solutions in general by definition deal with large amounts of data. Data at rest incurs a cost, whether in terms of management of that data, security of it or simply the storing of it. Where possible, it is good practise to reduce the weight of data in a lossless way, without losing or compromising the data quality.

The standard way to do this is by using compression on the static data, where common strings are deflated from the static file and indexed so that they can be later repositioned in an inflation stage.

The problem with this for HDInsight or Hadoop in general is that the stream becomes a binary stream which it cannot access directly.

Configuring a Hadoop Job to read a Compressed input file

In order to configure a Hadoop Job to read the Compressed input, you simply have to specify a flag on the job command line. That flag is:

-D "io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec"
view raw inputflag.txt hosted with ❤ by GitHub

This causes an additional map task to be undertaken which loads the input as a GzipStream before your map/reduce job begins. NOTE: This can be a time consuming activity – if you’re planning on loading this file many times for parsing, your efficiency saving by doing this will be limited.

A 2Gb GZipped example file of mine was inflated to 11Gb, taking 20mins.

Filename dependency

If you try this approach, you might find a problem where the input strings to your Map job still seem to be binary. This is a sign of the stream not be inflated by the infrastructure. There is one last thing you must ensure in order to trigger the system to begin the inflation process – this is that the filename needs to be of the relevant extension. As an example, to use GZip, the filename must end with .gz such as “mylogfile.log.gz”

Using the HDInsight C# Streaming SDK to control Input Compression

In order to use the C# Streaming SDK with this flag, one simply modifies the Job.Configure override in order to add and additional generic argument specifying this flag.

//public override HadoopJobConfiguration Configure(ExecutorContext context)
HadoopJobConfiguration config = new HadoopJobConfiguration();
config.InputPath = "/input/data.log.gz";
config.OutputFolder = "/output/output" + DateTime.Now.ToString("yyyyMMddhhmmss");
config.AdditionalGenericArguments.Add("-D \"io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec\"");
return config;
view raw Job.cs hosted with ❤ by GitHub
You will find an additional Task specified in your JobTracker, which takes the stream and inflates for your runtime code.

A better way

If you can control your input stream (i.e. it is not provided by a third party), you should look at a better compression algorithm than the one shown here, Gzip. A better approach is to use LOZ, a splittable algorithm that allows you to better distribute your work. The Gzip algorithm must be processed in a sequential way, which makes it very hard to distribute the workload. LOZ (which is configued by using com.hadoop.compression.lzo.LzoCodec) is splittable, allowing better workload distribution.

Happy Hadoopy Cloudy Times :-)
Andy

 

BreakpointCounter

Unit Testing Hadoop on Azure (HDInsight) – Part 2: Logging Context

Hello again!

In this short post I’ll give a brief run through of tying together debugging options available in Hadoop, HDInsight and locally when using Microsoft’s HDInsight SDK.

Debugging and Logging

Telemetry is life – diagnostics are existence. Without telemetry on operational status, your applications whether local or remote could be in any state. It is not enough to assume a lack of error message is equivalent to the existence of a success message. To this end, Microsoft’s HDInsight SDK provides logging and debugging techniques in order that your Hadoop jobs can be monitored.

The most straightforward debugging technique is to utilise StreamingUnit, a debugging framework that exists within the Microsoft.Hadoop.MapReduce SDK. The framework allows in process debugging; step-through, breakpoints, immediates – the whole bag.

This debugging is an essential first step when building using this framework; but what do we do when we move beyond the local debugging environment and into the remote production (etc) hadoop environment?

Output Streams

The way the Microsoft.Hadoop.MapReduce SDK works is as a Streaming Job in Hadoop. This means that hadoop loads an executable by a shell invoke, and gathers the executable’s output from the Standard Output stream (stdout) and its logging and errors from the Standard Error stream (stderr).

This is equivalent to (in C#) writing your results in Console.WriteLine and your logs to Console.Error.WriteLine

Given this point, it becomes very important that the consumer of the HDInsight SDK, the author of the MapReduce job does not write spurious output to either of these streams. Indeed, a misplaced error handling such as try{ int i = 10/0; } catch (Exception ex) { Console.WriteLine(ex.ToString(); } can result in erroneous, multiline output being written where only results should be emitted.

In fact, in the story of HDInsight and Hadoop Stream Jobs in general, writing to the output streams should not be done directly at all. So how can it be done?

MapperContext and ReducerCombinerContext

MapperContext and ReducerCombinerContext are two classes that are passed in to the MapperBase.Map and ReducerCombinerBase.Reduce methods respectively. The classes derive from ContextBase, and this base class provides the correct way to write a log from the map/reduce job.

If we take our previous example of HelloWorldJob, we can expand our Mapper to emit a log if the conditional statement to detect whether the inputline matches the correct start of string is met.

We would change:

//example input: Hello, Andy
if (!inputLine.StartsWith("Hello, ")) return;
view raw Original.cs hosted with ❤ by GitHub

to

//example input: Hello, Andy
if (!inputLine.StartsWith("Hello, "))
{
context.Log(string.Format("The inputLine {0} is not in the correct format", inputLine));
return;
}
view raw Modified1.cs hosted with ❤ by GitHub

Using StreamingUnit, we can then modify our program to have invalid input (to cause a log), and then check that the log was correctly output by checking the StreamingUnit output

using System;
using Elastacloud.Hadoop.StreamingUnitExample.Job.Map;
using Elastacloud.Hadoop.StreamingUnitExample.Job.Reduce;
using Microsoft.Hadoop.MapReduce;
namespace Elastacloud.Hadoop.StreamingUnitExample
{
class Program
{
static void Main(string[] args)
{
var inputArray = new[]
{
"Hello, Andy",
"Hello, andy",
"Hello, why doesn't this work!",
"Hello, Andy",
"Hello, chickenface",
"Hello, Andy",
"Goodbye, Andy"
};
var output =
StreamingUnit.Execute<HelloWorldMapper, HelloWorldReducer>(inputArray);
Console.WriteLine("Map");
foreach (var mapperResult in output.MapperResult)
{
Console.WriteLine(mapperResult);
}
Console.WriteLine("MapLog");
foreach (var mapperLog in output.MapperLog)
{
Console.WriteLine(mapperLog);
}
Console.WriteLine("Reduce");
foreach (var reducerResult in output.ReducerResult)
{
Console.WriteLine(reducerResult);
}
Console.ReadLine();
}
}
}
view raw Program.cs hosted with ❤ by GitHub

The output can then be seen here

Map
Andy 1
Andy 1
Andy 1
andy 1
chickenface 1
why doesn't this work! 1
MapLog
The inputLine Goodbye, Andy is not in the correct format
Reduce
Andy 3
andy 1
chickenface 1
why doesn't this work! 1
view raw output.txt hosted with ❤ by GitHub

Counters

It is also possible to increment Hadoop counters by using the ContextBase derived classes and calling IncrementCounter. This provides a way of interfacing with the hadoop intrinsic “hadoop:counter:…..” format – these counters then are summed across execution of the hadoop job and reported on at job completion. This telemetry is exceptionally useful, and is the definitive state information that is so critical to understanding the operational status of a remote system.

In order to increment these counters, one simply calls the “IncrementCounter” method on the context. In our example, we can add a counter call to follow these formatting exceptions.

We will use a category of “RecoverableError”, a counter of “InputFormatIncorrect” and increment by 1.

using Microsoft.Hadoop.MapReduce;
namespace Elastacloud.Hadoop.StreamingUnitExample.Job.Map
{
public class HelloWorldMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
//example input: Hello, Andy
if (!inputLine.StartsWith("Hello, "))
{
context.Log(string.Format("The inputLine {0} is not in the correct format", inputLine));
context.IncrementCounter("RecoverableError", "InputFormatIncorrect", 1);
return;
}
var key = inputLine.Substring(7);
if (key.EndsWith(".")) key = key.Trim('.');
context.EmitKeyValue(key, "1");//we are going to count instances, the value is irrelevant
}
}
}
view raw ModifiedIncrement.cs hosted with ❤ by GitHub

Happy big data-y clouding!

Andy