Running Apache Hive on hServer

ScaleOut hServer can be used as an execution engine for Apache Hive. Apache Hive translates queries into a series of MapReduce jobs, which can then be configured to run on ScaleOut hServer. Running Apache Hive queries through ScaleOut hServer provides significant performance improvements by eliminating intermediate disk I/O for MapReduce and reusing JVMs. ScaleOut hServer is designed to accelerate query performance for datasets which can be processed in-memory.

ScaleOut hServer supports existing distributions of Apache Hive, so the only configuration change required is to configure Hadoop/Hive to use ScaleOut hServer as an execution engine.

Follow the below procedure to configure Hive to use ScaleOut hServer as the MapReduce execution engine:

  1. Install and configure ScaleOut hServer as described in Installation of the IMDG.
  2. Configure ScaleOut hServer to run as an execution engine for the YARN cluster, as described in Running existing Hadoop applications.
  3. Set the mapred.job.tracker configuration property to none.

The following example directs the Hive query to run using ScaleOut hServer:

$ hive --hiveconf mapreduce.framework.name=hserver-yarn \
    --hiveconf mapred.job.tracker=none \
    -e 'SELECT SUBSTR(sourceIP, 1, 12), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 12);'
[Note] Note

You can switch between ScaleOut hServer and standard YARN by setting the value of mapreduce.framework.name. This can be helpful for large queries which do not fit in memory and cannot run using ScaleOut hServer.

[Note] Note

Apache Hive installations configured to run with Apache Tez are not currently supported.