Over the past months I’ve been working hard on Hadoop on Azure – now known as HDInsight – with our customers. Taking existing practises to this new world of Big Data is important to me, especially ones that I’d never abandon in the Big Compute and standard software engineering worlds. One such is unit testing and continuous integration. I’m going to start a series of posts on the tools available to aid unit testing in Hadoop, and how you can perfect your map/reduce jobs in isolation from your hadoop cluster, and guarantee their integrity during development.
This tutorial covers the Microsoft HDInsight .net SDK and examples are written in C#.
The basic premise of testing a HDInsight job is that given a known set of input, we’ll be presented with a mapped set that is integral and a reduced result that is expected. Ideally we want to be able to do this in process during development, so that we can debug as easily as possible and not have to rely on the infrastructure of Hadoop – the configuration and availability of which is not a develop-time concern. Furthermore, we want to be able to step through code and use the excellent tooling built into Visual Studio to be able to inspect runtime conditions. Finally, we want to be able to control the inputs to the system in a very robust way, in order that the testing is guaranteed a consistent input from which to assert consistent results; we want to be able to submit a job to a framework with known literal input.
We could use nUnit or MSTest to provide this testing framework, but we would be testing in isolation between the Mapper and Reducer classes. This has merit in its own regard, but there is complexity in that a MapperBase or ReducerCombinerBase method does not return a result value, instead it writes to a “context”. This context in reality is StdOut or StdErr console outputs, and to test this we would need to write code to interact with these streams. We could abstract our logic to return values from methods that the Map and Reduce classes simply marshal to the context, indeed this is a valid abstraction, but the further our abstraction moves from the runtime operation of the software the greater the need becomes for an integration test – a test that mimics a real runtime environment.
In the Microsoft.Hadoop.MapReduce framework (nuget http://nuget.org/packages/Microsoft.Hadoop.MapReduce/0.0.0.4), Microsoft provides StreamingUnit – a lightweight framework for testing a Hadoop Job written in .net in process on a development machine. StreamingUnit allows a developer to write Map Reduce jobs, and test their execution in Visual Studio.
Example Use of StreamingUnit
Firstly, we will start out with a blank Console Application. This is the project type used for creating Hadoop Jobs with the Microsoft HDInsight SDK.
Once we have this vanilla Console Application, we can add in the required assemblies through Nuget. Use the Package Manager window or the Manage Nuget Packages project context menu to add Microsoft.Hadoop.MapReduce (the version I am using is 0.0.0.4).
Once you have added the package, your can go right ahead and create a Mapper and Reducer class and a Job class. You might already have these if you have already begun development of your Hadoop Job, but I will produce some and include them as an attachment at the end of this post.
Once you have done this, your solution will look similar to this:
In our example, we are taking all the output from all the different example applications I’ve ever written and running a Hadoop query over them to count all the outputs I’ve written. It’s very common for example applications to follow the HelloWorld example, which after you write the simplest “print(‘Hello world’)” the next example is almost always a method with a signature “helloworld(string data)”, which outputs “Hello, ‘data’”. So my data sample will be “Hello, Andy”, “Hello, andy”, “Hello, why doesn’t this stupid thing work” etc etc. The output from the job will be a count of the different strings.
Lets implement that Map and Reduce logic.
Our Map extracts the Name of who said hello:
|1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18||
Our reducer will simply count the inputs to it.
|1 2 3 4 5 6 7 8 9 10 11 12 13 14||
Additionally, we’ll build a Job class that links the two together:
|1 2 3 4 5 6 7 8 9 10 11 12 13 14||
Normally we might expect a little more work to be undertaken in the Job class, it should define input and output locations, however for our demo this is not required.
Now we will use the Program class to define some simple input and execute the job with StreamingUnit.
|1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41||
The output for this shows first the values produced (sent to the context) by the Mapper then by the Reducer. More importantly, this can be debugged through by adding breakpoints and then running the executable.
The Program continues to list all Map output and then Reduce output by writing to the console.
|1 2 3 4 5 6 7 8 9 10 11 12||
Next time, my blog focusses on more advanced approaches to unit testing, before a final post on Mocking and Unit Testing.
Here’s the source: Elastacloud.Hadoop.StreamingUnitExample (note, packages had to be deleted)
Happy cloudy big data-ing