BreakpointHit

Unit Testing in Hadoop on Azure (HDInsight) – Part 1: StreamingUnit

Hey all,

Over the past months I’ve been working hard on Hadoop on Azure – now known as HDInsight – with our customers. Taking existing practises to this new world of Big Data is important to me, especially ones that I’d never abandon in the Big Compute and standard software engineering worlds. One such is unit testing and continuous integration. I’m going to start a series of posts on the tools available to aid unit testing in Hadoop, and how you can perfect your map/reduce jobs in isolation from your hadoop cluster, and guarantee their integrity during development.

This tutorial covers the Microsoft HDInsight .net SDK and examples are written in C#.

HDInsight testing

The basic premise of testing a HDInsight job is that given a known set of input, we’ll be presented with a mapped set that is integral and a reduced result that is expected. Ideally we want to be able to do this in process during development, so that we can debug as easily as possible and not have to rely on the infrastructure of Hadoop – the configuration and availability of which is not a develop-time concern. Furthermore, we want to be able to step through code and use the excellent tooling built into Visual Studio to be able to inspect runtime conditions. Finally, we want to be able to control the inputs to the system in a very robust way, in order that the testing is guaranteed a consistent input from which to assert consistent results; we want to be able to submit a job to a framework with known literal input.

We could use nUnit or MSTest to provide this testing framework, but we would be testing in isolation between the Mapper and Reducer classes. This has merit in its own regard, but there is complexity in that a MapperBase or ReducerCombinerBase method does not return a result value, instead it writes to a “context”. This context in reality is StdOut or StdErr console outputs, and to test this we would need to write code to interact with these streams. We could abstract our logic to return values from methods that the Map and Reduce classes simply marshal to the context, indeed this is a valid abstraction, but the further our abstraction moves from the runtime operation of the software the greater the need becomes for an integration test – a test that mimics a real runtime environment.

In the Microsoft.Hadoop.MapReduce framework (nuget http://nuget.org/packages/Microsoft.Hadoop.MapReduce/0.0.0.4), Microsoft provides StreamingUnit – a lightweight framework for testing a Hadoop Job written in .net in process on a development machine. StreamingUnit allows a developer to write Map Reduce jobs, and test their execution in Visual Studio.

Example Use of StreamingUnit

Firstly, we will start out with a blank Console Application. This is the project type used for creating Hadoop Jobs with the Microsoft HDInsight SDK.

Once we have this vanilla Console Application, we can add in the required assemblies through Nuget. Use the Package Manager window or the Manage Nuget Packages project context menu to add Microsoft.Hadoop.MapReduce (the version I am using is 0.0.0.4).

Once you have added the package, your can go right ahead and create a Mapper and Reducer class and a Job class. You might already have these if you have already begun development of your Hadoop Job, but I will produce some and include them as an attachment at the end of this post.

Once you have done this, your solution will look similar to this:

In our example, we are taking all the output from all the different example applications I’ve ever written and running a Hadoop query over them to count all the outputs I’ve written. It’s very common for example applications to follow the HelloWorld example, which after you write the simplest “print(‘Hello world’)” the next example is almost always a method with a signature “helloworld(string data)”, which outputs “Hello, ‘data’”. So my data sample will be “Hello, Andy”, “Hello, andy”, “Hello, why doesn’t this stupid thing work” etc etc. The output from the job will be a count of the different strings.

Lets implement that Map and Reduce logic.

Our Map extracts the Name of who said hello:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
using Microsoft.Hadoop.MapReduce;
 
namespace Elastacloud.Hadoop.StreamingUnitExample.Job.Map
{
public class HelloWorldMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
//example input: Hello, Andy
if (!inputLine.StartsWith("Hello, ")) return;
 
var key = inputLine.Substring(7);
if (key.EndsWith(".")) key = key.Trim('.');
 
context.EmitKeyValue(key, "1");//we are going to count instances, the value is irrelevant
}
}
}
view raw HelloWorldMapper.cs hosted with ❤ by GitHub

Our reducer will simply count the inputs to it.

1 2 3 4 5 6 7 8 9 10 11 12 13 14
using System.Collections.Generic;
using System.Linq;
using Microsoft.Hadoop.MapReduce;
 
namespace Elastacloud.Hadoop.StreamingUnitExample.Job.Reduce
{
public class HelloWorldReducer : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Count().ToString());//count instances of this key
}
}
}
view raw HelloWorldReducer.cs hosted with ❤ by GitHub

Additionally, we’ll build a Job class that links the two together:

1 2 3 4 5 6 7 8 9 10 11 12 13 14
using Elastacloud.Hadoop.StreamingUnitExample.Job.Map;
using Elastacloud.Hadoop.StreamingUnitExample.Job.Reduce;
using Microsoft.Hadoop.MapReduce;
 
namespace Elastacloud.Hadoop.StreamingUnitExample.Job
{
public class HelloWorldJob : HadoopJob<HelloWorldMapper, HelloWorldReducer>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
return new HadoopJobConfiguration();//here you would normally set up some input ;-)
}
}
}
view raw HelloWorldJob.cs hosted with ❤ by GitHub

Normally we might expect a little more work to be undertaken in the Job class, it should define input and output locations, however for our demo this is not required.

Now we will use the Program class to define some simple input and execute the job with StreamingUnit.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
using System;
using Elastacloud.Hadoop.StreamingUnitExample.Job.Map;
using Elastacloud.Hadoop.StreamingUnitExample.Job.Reduce;
using Microsoft.Hadoop.MapReduce;
 
namespace Elastacloud.Hadoop.StreamingUnitExample
{
class Program
{
static void Main(string[] args)
{
var inputArray = new[]
{
"Hello, Andy",
"Hello, andy",
"Hello, why doesn't this work!",
"Hello, Andy",
"Hello, chickenface",
"Hello, Andy"
};
 
var output =
StreamingUnit.Execute<HelloWorldMapper, HelloWorldReducer>(inputArray);
 
Console.WriteLine("Map");
foreach (var mapperResult in output.MapperResult)
{
Console.WriteLine(mapperResult);
}
 
Console.WriteLine("Reduce");
foreach (var reducerResult in output.ReducerResult)
{
Console.WriteLine(reducerResult);
}
 
Console.ReadLine();
 
}
}
}
view raw Program.cs hosted with ❤ by GitHub

The output for this shows first the values produced (sent to the context) by the Mapper then by the Reducer. More importantly, this can be debugged through by adding breakpoints and then running the executable.

The Program continues to list all Map output and then Reduce output by writing to the console.

1 2 3 4 5 6 7 8 9 10 11 12
Map
Andy 1
Andy 1
Andy 1
andy 1
chickenface 1
why doesn't this work! 1
Reduce
Andy 3
andy 1
chickenface 1
why doesn't this work! 1
view raw output.csv hosted with ❤ by GitHub

Next time, my blog focusses on more advanced approaches to unit testing, before a final post on Mocking and Unit Testing.

Here’s the source: Elastacloud.Hadoop.StreamingUnitExample (note, packages had to be deleted)

Happy cloudy big data-ing ;-)

Andy

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>