Monthly Archives: January 2013

hadoopcmdline

HDInsight: Workaround error Could not find or load main class

Sometimes when running the C# SDK for HDInsight, you can come across the following error:

The system cannot find the batch label specified – jar
Error: Could not find or load main class c:\apps\dist\hadoop-1.1.0-SNAPSHOT\lib\hadoop-streaming.jar

To get around this, close the command shell that you are currently in and open up a new hadoop shell, and try your command again. It should work immediately.

This tends to occur after killing a hadoop job, and so I am assuming something that this activity does changes the context of the command shell in such a way that it can no longer find the hadoop javascript files. I’ve yet to get to the bottom of it, so if anyone has any bright ideas, let me know on comments ;-)

Good Hadoopification,

Andy

HDInsight: Workaround bug when killing Jobs

When running a Streaming Job from the Console in HDInsight, you might are given a message which describes how to kill a job:

13/01/09 14:52:07 INFO streaming.StreamJob: To kill this job, run:
13/01/09 14:52:07 INFO streaming.StreamJob: C:\apps\dist\hadoop-1.1.0-SNAPSHOT/bin/hadoop job -Dmapred.job.tracker=10.186.136.26:9010 -kill job_201301081702_0001

Unfortunately there is an error in this and it will not work:

c:\apps\dist\hadoop-1.1.0-SNAPSHOT>hadoop job -Dmapred.job.tracker=10.186.136.26:9010 -kill job_201301081702_0014
Usage: JobClient <command> <args>
 [-submit <job-file>]
 [-status <job-id>]
.....

This is because there is an error in the command as written out by the hadoop streaming console. There should be a space between the -D and mapred.job.tracker=ipAddressJobTracker, and furthermore the mapred.job.tracker parameter should be quoted:

c:\apps\dist\hadoop-1.1.0-SNAPSHOT>hadoop job -D "mapred.job.tracker=10.186.136.26:9010" -kill job_201301081702_0001
Killed job job_201301081702_0001

Et voila.

Happy big-dataification ;-)
Andy

Bright HDInsight Part 6: Take advantage of your console context

The Microsoft.Hadoop.MapReduce SDK for HDInsight (Hadoop on Azure) requires you to package your map reduce job in a windows console application that is then used by the Hadoop Streaming API in order to run your logic. This gives a very nature way to configure and orient your application at startup, which this blog will give a simple example of.

Console Arguments

The natural requirement to explore is the supply of runtime arguments to the Map Reduce console application (hereafter, package), which can be used for configuration purposes. We may use them to manually specify additional generic parameters, input or output locations or control the type of job being run. Consider the input location purpose; we may want to write a single Map Reduce job, but then be able to run that job against different datasets, differententiated by their storage location.

A very natural way to do this for anyone familiar with a commandshell is that you may wish to run:

mypackage.exe /path/to/input.txt

This would specify a path to an input location on which the application could work. In order to achieve this with the SDK, given that we have parsed and verified the input, we can then include the arguments that we are interested in as a parameter to the HadoopJobExecutor.ExecuteJob(string[] args) method.

Once we have done that, in our TJobClass (subclass of HadoopJob) Configure method, we are provided with a ExecutorContext class. This ExecutorContext has a single Property “Arguments”.

This Property is the string[] parameter that we passed into the ExecuteJob method. We can then use this to configure the Job during setup:

public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var hadoopConfiguration = new HadoopJobConfiguration();
hadoopConfiguration.InputPath = context.Arguments[0];
hadoopConfiguration.OutputFolder = “asv://output/” + DateTime.Now.Ticks.ToString(CultureInfo.InvariantCulture);

return hadoopConfiguration;
}

Conclusion

Use your console context well. Further examples of what you could do: capture the output of the package host to a text file with > or >> commands; pipe the results to another program using |; chain execution together with a batch file; schedule execution with the Windows Scheduler.

Happy cloudy-big-dataification (and happy new year),
Andy