Bright HDInsight 8: Use Hive Variables with Powershell

This post is part best practice, part workaround.

When submitting Hive jobs to a HDInsight cluster with Powershell, one might consider building a common script, controlling or securing this in a version control system and then using variable substitution to provide the values required. This post explores how to supply these variables.

Firstly, one can either submit a hive job with Invoke-AzureHDInsightHiveJob, Invoke-Hive or simply hive. These last two are simply alias for the first commandlet. I will be using “hive” throughout. When I write “hive” I am remotely executing from powershell, via templeton, to a remote HDInsight cluster. POWER!

Hive allows you to supply variables straight off the command line, but when using Powershell we’ll have to wrap them up. We’ll want to get to something as close as possible to the below type of argument passing.

hive -hiveconf state='California' -f /hive.hql

In this example, we pass a hive config value of “California” for the key “state”. We can imagine a query like the below using this setting:

select count(*) from hivesampletable where state='${hiveconf:state}'

As you can see, to load the value out of the -hiveconf flag, we use ${hiveconf:keyname} as a convention.

Hive’s powershell is described with Get-Help hive as:

 NAME Invoke-AzureHDInsightHiveJob SYNTAX Invoke-AzureHDInsightHiveJob [[-Query] ] [-Arguments ] [-Defines ] [-File ] [-Files ] [-JobName ] [-StatusFolder ] [] ALIASES Invoke-Hive hive REMARKS None 

It’s clear that we can only really use “-Arguments” or “-Defines” in order to really affect the command line; -File and -Files supply scripts, -JobName and -StatusFolder provide metadata regarding job execution.

Without clear guidance in the powershell help, it took another Hive command to come to the rescue. The “set” command lets us see all the arguments defined for the executing script. Using this in conjunction with both the -Defines and -Arguments flags allow us to see how this affects the runtime environment:
PS Z:\src\powershell> hive "set" -Arguments @("argument=123") -Defines @{ "defines"="abc" } Submitting Hive query.. Started Hive query with jobDetails Id : job_201405061100_0084 Hive query completed Successfully *snip* defines=abc *snip* env:_arguments=C:\apps\dist\hive-0.11.0.1.3.7.1-01293\lib\hive-cli-0.11.0.1.3.7.1-01293.jar org.apache.hadoop.hive.cli.CliDriver -hiveconf hive.querylog.location C:\apps\dist\hive-0.11.0.1.3.7.1-01293\logs\history -hiveconf hive.log.dir C:\apps\dist\hive-0.11.0.1.3.7.1-01293\logs --hiveconf "mapred uce.job.credentials.binary=c:/apps/temp/hdfs/mapred/local/taskTracker/ecommerce/jobcache/job_201405061100_0084/jobToken" --hiveconf "hive.metastore.local=false" --hiveconf "hive.metastore.uris=thrift://headnodehost:9083" --hiveconf "hive.metastore.warehouse.dir=/apps/hive/warehouse" --hiveconf "def ines=abc" --hiveconf "hdInsightJobName=Hive: 451d5737109c47e79bfd" "argument=123" -f 451d5737109c47e79bfda2f5bcb9da23.hql *snip* system:defines=abc

The important parts to notices here are that the -Arguments flag is appended after the Job name and without a –hiveconf prefix (attempting to add this causes an error). However, the -Defines creates a system level variable which can be accessed. There is a caveat here that there are many reserved system variables, you must be careful not to clash with those. 

So if we run either of these two queries, we can achieve what we want:

hive "select count(*) from hivesampletable where state='`${system:state}'" -Defines @{"state"="California"}
hive "select count(*) from hivesampletable where state='`${hiveconf:state}'" -Defines @{"state"="California"}
Submitting Hive query.. Started Hive query with jobDetails Id : job_201405061100_0085 Hive query completed Successfully 6881 

Beyond this we can also abstract that hive statement into a file, upload this to the root of our HDInsight container and then submit a much more terse statement, and reuse out logic, externalised into a file.

PS Z:\src\powershell> hive -File /hive.hql -Defines @{"state"="California"} Submitting Hive query.. Started Hive query with jobDetails Id : job_201405061100_0087 Hive query completed Successfully 6881

The statement:  hive -File /hive.hql -Defines @{“state”=”California”} is quite close to the original use case, hive -f /hive.hql -hiveconf “state=California”! 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>