Monthly Archives: May 2014

Azure CDN Presentation from @maartenballiauw

I had a conversation recently with @maartenballiauw about the new Azure CDN and how much of an improvement it can make to serving web content; primarily an improvement in last mile for your customers.

He’s put together an awesome slidedeck that he’s given at TechEd and I thought I’d share again. On slide 14 there’s a quote from me :)

 

The quote “The Internet sucks and so does your server” is a simple way to think of the problem domain. Due to things like geographic distribution causes unavoidable speed of light latency, compounded by the need to make multiple TCP requests underlying a HTTP request; things like embedding cookies in each request, bloating its weight; things like browsers limiting the number concurrent request: due to things like these, the Internet is weak over distance. Using a CDN fixes this by move some data (often static or semi-static) close to the requestor. And always remember, even if the internet doesn’t suck (a closer datacenter, no cookies, future browsers) sending traffic to a web server increases the load on expensive application capable servers. You can use the CDN to lower the cost of your app by decreasing the load on important, functional servers and pushing it to a content delivery server.

Happy clouding
Andy 

Bright HDInsight 8: Use Hive Variables with Powershell

This post is part best practice, part workaround.

When submitting Hive jobs to a HDInsight cluster with Powershell, one might consider building a common script, controlling or securing this in a version control system and then using variable substitution to provide the values required. This post explores how to supply these variables.

Firstly, one can either submit a hive job with Invoke-AzureHDInsightHiveJob, Invoke-Hive or simply hive. These last two are simply alias for the first commandlet. I will be using “hive” throughout. When I write “hive” I am remotely executing from powershell, via templeton, to a remote HDInsight cluster. POWER!

Hive allows you to supply variables straight off the command line, but when using Powershell we’ll have to wrap them up. We’ll want to get to something as close as possible to the below type of argument passing.

hive -hiveconf state='California' -f /hive.hql

In this example, we pass a hive config value of “California” for the key “state”. We can imagine a query like the below using this setting:

select count(*) from hivesampletable where state='${hiveconf:state}'

As you can see, to load the value out of the -hiveconf flag, we use ${hiveconf:keyname} as a convention.

Hive’s powershell is described with Get-Help hive as:

 NAME Invoke-AzureHDInsightHiveJob SYNTAX Invoke-AzureHDInsightHiveJob [[-Query] ] [-Arguments ] [-Defines ] [-File ] [-Files ] [-JobName ] [-StatusFolder ] [] ALIASES Invoke-Hive hive REMARKS None 

It’s clear that we can only really use “-Arguments” or “-Defines” in order to really affect the command line; -File and -Files supply scripts, -JobName and -StatusFolder provide metadata regarding job execution.

Without clear guidance in the powershell help, it took another Hive command to come to the rescue. The “set” command lets us see all the arguments defined for the executing script. Using this in conjunction with both the -Defines and -Arguments flags allow us to see how this affects the runtime environment:
PS Z:\src\powershell> hive "set" -Arguments @("argument=123") -Defines @{ "defines"="abc" } Submitting Hive query.. Started Hive query with jobDetails Id : job_201405061100_0084 Hive query completed Successfully *snip* defines=abc *snip* env:_arguments=C:\apps\dist\hive-0.11.0.1.3.7.1-01293\lib\hive-cli-0.11.0.1.3.7.1-01293.jar org.apache.hadoop.hive.cli.CliDriver -hiveconf hive.querylog.location C:\apps\dist\hive-0.11.0.1.3.7.1-01293\logs\history -hiveconf hive.log.dir C:\apps\dist\hive-0.11.0.1.3.7.1-01293\logs --hiveconf "mapred uce.job.credentials.binary=c:/apps/temp/hdfs/mapred/local/taskTracker/ecommerce/jobcache/job_201405061100_0084/jobToken" --hiveconf "hive.metastore.local=false" --hiveconf "hive.metastore.uris=thrift://headnodehost:9083" --hiveconf "hive.metastore.warehouse.dir=/apps/hive/warehouse" --hiveconf "def ines=abc" --hiveconf "hdInsightJobName=Hive: 451d5737109c47e79bfd" "argument=123" -f 451d5737109c47e79bfda2f5bcb9da23.hql *snip* system:defines=abc

The important parts to notices here are that the -Arguments flag is appended after the Job name and without a –hiveconf prefix (attempting to add this causes an error). However, the -Defines creates a system level variable which can be accessed. There is a caveat here that there are many reserved system variables, you must be careful not to clash with those. 

So if we run either of these two queries, we can achieve what we want:

hive "select count(*) from hivesampletable where state='`${system:state}'" -Defines @{"state"="California"}
hive "select count(*) from hivesampletable where state='`${hiveconf:state}'" -Defines @{"state"="California"}
Submitting Hive query.. Started Hive query with jobDetails Id : job_201405061100_0085 Hive query completed Successfully 6881 

Beyond this we can also abstract that hive statement into a file, upload this to the root of our HDInsight container and then submit a much more terse statement, and reuse out logic, externalised into a file.

PS Z:\src\powershell> hive -File /hive.hql -Defines @{"state"="California"} Submitting Hive query.. Started Hive query with jobDetails Id : job_201405061100_0087 Hive query completed Successfully 6881

The statement:  hive -File /hive.hql -Defines @{“state”=”California”} is quite close to the original use case, hive -f /hive.hql -hiveconf “state=California”! 

Bright HDInsight 7: Customise Your Cluster

HDinsight as a clustered, cloud backed resource allows for the offloading of management overhead of cluster maintenance, tuning, patching and other sysadmin type operations. The flipside to this is that for the management overhead that you lose you pay a price of a lower level of control. You cannot elevate your user context to administrative privileges. You cannot expect tweaks applied after a cluster has been created to be permanent. Certain operations inside Azure will cause “reimaging” events, where cluster members are returned to their initial state. This is in line with all clouds offering PaaS compute. In order that this loss of control should not hamper the users of HDInsight, Microsoft have supplied a comprehensive cluster customization capacity at provision time.

Using this capacity, it is possible to affect core Hadoop configuration settings in:

  • Core-site.xml
  • Mapred-site.xml
  • Hdfs-site.xml

These should be used in utter total preference to making manual changes after having connected with RDP.

The code for achieving this is similarly straight forward as the previous examples. Simply create objects to represent the configuration settings required at provision time, and submit them with the provisioning request. Again using Powershell as our exemplar language, we can achieve this with 4 simple commands.

$coreConfig = @{    

“io.compression.codec”=”org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec”;        

“io.sort.mb” = “2048″;
}

$mapredConfig = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightMapReduceConfiguration’

$mapredConfig.Configuration = @{                

“mapred.tasktracker.map.tasks.maximum”=”4″;

$clusterConfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 64 

$clusterConfig = $clusterConfig | Add-AzureHDInsightConfigValues -Core $coreConfig -MapReduce $mapredConfig 

$clusterConfig | New-AzureHDInsightCluster -Credential $clusterCreds -Location $location -Name $clusterName -Verbose -ErrorAction Stop

Happy Hadooping!