When using a HDInsight cluster, a key concern is sourcing the data on which you will run your Map/Reduce jobs. The load time to transfer the data for your Job into your Mapper or Reducer context must be as low as possible; the quality of a source can be ranked by this load wait – the data latency before a Map/Reduce job can commence.
Typically, you tend to store your data for your jobs within Hadoop’s Distributed FileSystem, HDFS. With HDFS, you are limited by the size of the storage attached to your HDInsight nodes. For instance, in the HadoopOnAzure.com preview, you are limited to 1.5Tb of data. Alternatively, you can use Azure Storage Vault and access up to 150 Tb in an Azure Storage account.
Azure Storage Vault
The Azure Storage Vault is a storage location, backed by Windows Azure blob Storage that can be addressed and accessed by Hadoop’s native processes. This is achieved by specifying a protocol scheme for the URI of the assets you are trying to access, ASV://. This ASV:// is synonymous with other storage accessing schemes such as file://, hdfs:// and s3://.
With ASV, you configure access to a given account and key in the HDInsight portal Manage section.
Once in this section, one simply begins the configuration process by entering the Account Name and Key and clicking Save:
Once you have configured the Account Name and Key, you can use the new storage provided by ASV by addressing using the ASV:// scheme. The format of these URIs are:
As you can see, the ASV:// is hardwired to the configured account name (and key) so that you don’t have to specify the account or authentication to access this account. It is a shortcut to http(s)://myaccount.blob.core.windows.net. Furthmore, since it encapsulates security, you don’t need to worry about the access to these blobs.
Benefits of ASV
There are many benefits of ASV, from exceptionally low storage costs (lower than S3 at the time of writing), to the ability to seemlessly provide geo-located redundancy on the files for added resilience. For me, as a user of the time limited (at the time of writing again!) HadoopOnAzure.com clusters, a really big benefit is that I don’t lose the files when the cluster is released, as I would do if they were stored on HDFS. Additional benefits to me include the ability to read and write to ASV and access those files immediately off the cluster in a variety of tools that I have gotten to know very well over the past few years, such as Cerebrata’s Cloud Storage Studio.
How to use ASV
It is exceptionally easy to configure your C# Map/Reduce tasks to use ASV, due to the way it has been designed. The approach is also equivalent and compatible with any other Streaming, Native or ancillary job creation technique.
To Use ASV in a C# application, first configure the ASV in the HDInsight portal as above, and then configure the HadoopJob to access that resource as the InputPath or OutputFolder locations.
public override HadoopJobConfiguration Configure(ExecutorContext context)
HadoopJobConfiguration config = new HadoopJobConfiguration();
config.InputPath = "asv://container/inputpath/";
config.OutputFolder = "asv://container/outputpath" + DateTime.Now.ToString("yyyyMMddhhmmss");
As you can see, this configuration commands Hadoop to load from the configured ASV:// storage account container “container”, find any files in the /inputpath/ folder location and include them all as input files. You can also specify an individual file.
Similarly, the outputFolder is specified as a location that the HDInsight job should write the output of the Map/Reduce job to.
As a nice additional benefit, using ASV adds counters on the amount of bytes read and written by the Hadoop system using ASV, allowing you to track your usage of the data in your Storage Account.
All very simple stuff, but amazingly powerful.