Querying a NamedMap with Apache Hive

ScaleOut hServer implements the Apache Hive storage handler to provide a read-only Hive table view of the NamedMap. It maps Java object properties stored in the NamedMap to Hive table columns, allowing HQL queries on the objects stored in the NamedMap. Hive cannot modify the NamedMap or its data.

To create the table view of the NamedMap, use a Hive CREATE TABLE statement with the following requirements:

  1. The column names should correspond to the Java object property getter name (e.g., getFoo() getter should correspond to column named foo ). It is not required to define columns for all properties; unmapped properties will be ignored.
  2. The column type should match the corresponding property type.
  3. The NamedMap name should be associated with the Hive table by setting the hserver.map.name table property.
  4. If custom serialization is necessary, the CustomSerializer should be set via hserver.value.serializer and hserver.value.type (see the following section).

To run distributed queries on the NamedMap, each SOSS node should have the necessary class definitions for value objects and custom serializers. This can be achieved by adding the JARs containing these definitions to the --auxpath property in the Hive command line:

$ hive --auxpath /home/hiveuser/myjar.jar

Example: Shopping Cart

To illustrate the concept of querying a NamedMap through Hive, let’s create a Hive table representing a customer.

First, we define the sample Java class representing the customer with properties customerId (int), firstName (String), lastName (String), login (String), and respective getter methods:

public class Customer implements Serializable
{
    private int customerId;
    private String firstName;
    private String lastName;
    private String login;

    public Customer(int customerId, String firstName, String lastName, String login) {
        this.customerId = customerId;
        this.firstName = firstName;
        this.lastName = lastName;
        this.login = login;
    }

    public int getCustomerId() {
        return customerId;
    }

    public String getFirstName() {
        return firstName;
    }

    public String getLastName() {
        return lastName;
    }

    public String getLogin() {
        return login;
    }
}

If the Customer instances are stored in a NamedMap with the name customers, we can use the following statement to create the Hive table view:

hive> CREATE TABLE
customers (customerid int, firstname string, lastname string, login string)
STORED BY 'com.scaleoutsoftware.soss.hserver.hive.HServerHiveStorageHandler'
TBLPROPERTIES ("hserver.map.name" = "customers");
OK
Time taken: 0.508 seconds

Hive now has a table view of the NamedMap and can run queries against it. Example:

hive> SELECT * FROM customers;
..............................
1       Eduardo     Hazelrigg       ehazelrigg
13      Serena      Sadberry        ssadberry
9       Ermelinda   Manganaro       emanganaro
5       Edda        Speir           espeir
17      Tomeka      Stovall         tstovall
21      Luciano     Perkinson       lperkinson
25      Jacob       Garrow          jgarrow
33      Quincy      Kreutzer        qkreutzer
37      Iona        Speir           ispeir
41      Ermelinda   Thielen         ethielen
Time taken: 0.475 seconds, Fetched: 100 row(s)

When finished querying the NamedMap, destroy the table view by calling DROP TABLE. This command only removes the associated table from the metastore; it does not clear the NamedMap. Example:

hive> DROP TABLE customers
OK
Time taken: 0.18 seconds

To illustrate joins between two NamedMaps, let’s create another table based on a Java class which represents a shopping cart with the properties customerId (int), totalPrice (int), and a map of items in the shopping cart with each item’s quantity as the map value:

public class ShoppingCart implements Serializable {
    private int customerId;
    private Map<String, Integer> items;
    private int totalPrice;

    public ShoppingCart(Integer customerName, Map<String, Integer> items, int totalPrice) {
        this.customerId = customerName;
        this.items = items;
        this.totalPrice = totalPrice;
    }

    public int getCustomerId() {
        return customerId;
    }

    public Map<String, Integer> getItems() {
        return items;
    }

    public int getTotalPrice() {
        return totalPrice;
    }
}

Create the Hive table view of the class with the following CREATE TABLE statement. Notice that the Java Map type corresponds to the Hive MAP type:

hive>CREATE TABLE shoppingcarts(customerid int, totalprice int, items MAP<string, int>)
STORED BY 'com.scaleoutsoftware.soss.hserver.hive.HServerHiveStorageHandler'
TBLPROPERTIES ("hserver.map.name" = "shoppingcarts");
OK
Time taken: 0.402 seconds

After the table is created, we can run exploratory queries on the NamedMap. For example, this query lists all shopping carts containing more than 2 CPUs:

hive> SELECT * FROM shoppingcarts WHERE items["CPU"]>2;
..............................
33      812     {"Motherboard":4,"CPU":3,"Printer":3,"Router":1,"Monitor":4}
56      212     {"Mouse":4,"Router":3,"CPU":4}
40      1284    {"Motherboard":1,"CPU":3,"Hard drive":1,"Keyboard":4,"Monitor":4}
46      689     {"Motherboard":2,"CPU":3,"Printer":1,"Mouse":2,"Router":1,"Monitor":3}
64      567     {"Printer":3,"CPU":4,"Router":2,"Monitor":4}
41      865     {"Monitor":1,"Motherboard":3,"Printer":3,"CPU":4,"Mouse":4,"Keyboard":4,"Router":2}
52      166     {"CPU":3,"Keyboard":1}
97      730     {"CPU":3,"Mouse":3,"Keyboard":1,"Monitor":2}
29      671     {"Motherboard":1,"CPU":3,"Mouse":1,"Hard drive":1,"Keyboard":1,"Router":4}
38      485     {"Printer":2,"CPU":4,"Hard drive":2,"Keyboard":3,"Router":1}
36      797     {"Motherboard":2,"Monitor":4,"CPU":4}
21      151     {"CPU":4}
12      1044    {"Printer":4,"CPU":3,"Mouse":3,"Hard drive":4,"Keyboard":3,"Monitor":2}
7       788     {"Hard drive":4,"Monitor":1,"Motherboard":3,"Printer":1,"CPU":4,"Mouse":1,"Keyboard":4}
19      1144    {"Motherboard":1,"CPU":4,"Printer":2,"Router":3,"Monitor":2}
63      284     {"CPU":4,"Printer":3}
Time taken: 7.428 seconds, Fetched: 209 row(s)

Finally, this query pulls data from both NamedMaps, joining them on the customer ID to find customers who have more than $5,000 in their shopping cart(s). We assume that a customer may have multiple shopping carts, so the total prices are summed:

hive>SELECT firstname, lastname
FROM shoppingcarts LEFT JOIN customers ON shoppingcarts.customerid=customers.customerid
GROUP BY firstname, lastname, shoppingcarts.customerid
HAVING SUM(totalprice)>5000;
..............................
Shelton Burgener
Iona    Speir
Paulita Liptak
Quincy  Neher
Jacob   Liptak
Luciano Garrow
Paulita Perkinson
Lavada  Manganaro
Edda    Thielen
Edda    Garrow
Lindsy  Knights
Time taken: 34.216 seconds, Fetched: 42 row(s)