Submitting MapReduce jobs to ScaleOut hServer

To construct a MapReduce job using ScaleOut hServer, the HServerJob class should be used instead of Job for configuring the MapReduce job. In addition, when running under YARN, Hadoop MapReduce jobs can be run unchanged as described in the following section. The HServerJob supports identical constructor signatures to that of Job and (since it extends the Job class) the methods for configuring the job parameters are unchanged. For example, to apply this change to the WordCount example:

Using the Hadoop Job Tracker. 

//This job will run using the Hadoop job tracker:
public static void main(String[] args)
                        throws   Exception {

    Configuration conf = new Configuration();
    Job job = new Job(conf, "wordcount"); //Change this line!

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(
         TextInputFormat.class);
    job.setOutputFormatClass(
         TextOutputFormat.class);
    FileInputFormat.addInputPath(
         job, new Path(args[0]));
    FileOutputFormat.setOutputPath(
         job, new Path(args[1]));

    job.waitForCompletion(true);
 }

Using ScaleOut hServer. 

//This job will run using ScaleOut hServer:
public static void main(String[] args)
                        throws Exception {

    Configuration conf = new Configuration();
    Job job = new HServerJob(conf, "wordcount"); //This line changed!

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(
         TextInputFormat.class);
    job.setOutputFormatClass(
         TextOutputFormat.class);
    FileInputFormat.addInputPath(
         job, new Path(args[0]));
    FileOutputFormat.setOutputPath(
         job, new Path(args[1]));

    job.waitForCompletion(true);
 }

There is a constructor signature in HServerJob which takes an additional boolean parameter to control whether the reducer input keys are sorted:

public HServerJob(Configuration conf, String jobName, boolean sortEnabled)

To maximize performance, this constructor parameter allows sorting of the reducer input keys for each partition to be disabled.

Running MapReduce jobs from the command line (without a Hadoop distribution installed)

As a full replacement for the Hadoop MapReduce execution engine, ScaleOut hServer does not require any Hadoop distribution to be installed. To run this example from the command line without a Hadoop distribution installed, the ScaleOut hServer library JARs and the Hadoop distribution JARs must be added to the Java classpath. Individual worker JVMs in the invocation grid will automatically receive all necessary JAR dependencies specified by the application’s configuration.

[Note] Note

For your convenience, the Hadoop distribution JARs required for running MapReduce jobs are located in the ScaleOut hServer installation directory.

If the previous WordCount example is packaged as wordcount-hserver.jar, you can run it via the command line as follows:

Windows

java -classpath "C:\Program Files\ScaleOut_Software\StateServer\JavaAPI\lib\*;C:\Program Files\ScaleOut_Software\StateServer\JavaAPI\*;C:\Program Files\ScaleOut_Software\StateServer\JavaAPI\hslib\hadoop-1.2.1\*"  org.myorg.WordCount input/ output/

Linux

$ java -classpath "/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/hadoop-1.2.1/*" org.myorg.WordCount input/ output/

If your application uses HDFS for input or output, the Hadoop configuration directory must be added to the Java classpath when running the ScaleOut hServer job through the command line. For example (using CDH5 on Linux):

$ java -classpath "/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/cdh5.0.2/*:/etc/hadoop/conf" org.myorg.WordCount input/ output/

Running a MapReduce Job from the Hadoop Command Line

Optionally, if a Hadoop distribution is installed on the IMDG’s cluster, you can use the Hadoop command line to run a MapReduce job within the IMDG. To do this, be sure that the ScaleOut hServer library JARs are present in the classpath of the invocation JVM. This can be accomplished by adding the HADOOP_CLASSPATH variable to conf/hadoop-env.sh in the Hadoop installation directory, as follows:

Apache Hadoop 1.2.1

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/hadoop-1.2.1/*

Apache Hadoop 2.4.1

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/hadoop-2.4.1/*

CDH 4.4.0

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/cdh4.4.0/*

CDH 5

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/cdh5.0.2/*

CDH 5 (YARN)

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/cdh5.0.2-yarn/*

CDH 5.2

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/cdh5.2.1/*

CDH 5.2 (YARN)

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/cdh5.2.1-yarn/*

HDP 2.1 (YARN)

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/hdp2.1-yarn/*

HDP 2.2 (YARN)

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/hdp2.2-yarn/*

IBM BigInsights

$ export HADOOP_CLASSPATH=/usr/local/soss/java_api/*:/usr/local/soss/java_api/lib/*:/usr/local/soss/java_api/hslib/ibm-bi-3.0.0/*
[Note] Note

Running a MapReduce Job from the Hadoop command line does not require adding the Hadoop distribution-specific JARs to the classpath; this is handled for you by the Hadoop command line (it will use the default JARs for your distribution).

This small change is sufficient to run a MapReduce application from the Hadoop command line. For example, if the WordCount example is modified as described in the previous section and packaged as wordcount-hserver.jar, it can be run from the command line as follows:

$ hadoop jar wordcount-hserver.jar org.myorg.WordCount inputdir/ outputdir/