Hello ScaleOut hServer™ V2.
Hadoop for Operational Intelligence
Welcome to real-time analytics for Hadoop! ScaleOut hServer™ V2 is the world's first in-memory execution engine for Hadoop MapReduce. Now you can analyze live data using standard Hadoop MapReduce code, in memory and in parallel without the need to install and manage the Hadoop stack of software. (Only one small change is needed to your Hadoop program.) Gone are disk I/O latencies, slow start-up times, and software environment management headaches. Benchmark tests have demonstrated 20x faster execution time over the Apache Hadoop distribution. Now you can use Hadoop MapReduce in live applications in financial services, e-commerce, logistics, and countless other scenarios where results are needed in seconds instead of minutes or hours.
ScaleOut hServer V2 builds on the low latency data access and live, in-memory data storage for Hadoop introduced by ScaleOut hServer V1. ScaleOut hServer V2 adds a full MapReduce execution engine that runs standard Hadoop MapReduce code to provide real-time performance for continuous live data analysis. It also provides blazingly fast analysis of large, static data sets. Best of all, you don't need to learn anything new – if you know Hadoop MapReduce, you can use ScaleOut hServer right away.
ScaleOut hServer easily installs on a cluster of commodity servers as shown below:
ScaleOut hServer is not a Hadoop distribution. It is designed to complement popular Hadoop distributions. It can be installed either on your Hadoop cluster or another set of servers. (When used to analyze data hosted in the Hadoop Distributed File System (HDFS), it can be installed directly on the HDFS cluster for maximum performance.) Because ScaleOut hServer has its own Hadoop MapReduce execution engine, you can avoid the tedious process of installing Hadoop to run MapReduce applications. This dramatically speeds up development time and enables fast prototyping of MapReduce applications.
When using ScaleOut hServer, MapReduce applications can input data from either ScaleOut's in-memory data grid (IMDG) or from other data sources, such as HDFS. When using external data sources, data set size is only limited by the combined, intermediate data set output by the mappers, which ScaleOut hServer buffers within the IMDG. By adding servers to the cluster, the IMDG can hold intermediate data sets in the terabytes. This enables ScaleOut hServer to analyze very large, static data sets in addition to fast-changing, "live" data hosted in the IMDG.
In-Memory Hadoop for Live Data
Instead of storing "live" data on disk within HDFS, ScaleOut hServer uses a fast, scalable in-memory data grid (IMDG) that enables data to be continuously updated and analyzed using ScaleOut hServer's new Hadoop MapReduce engine. ScaleOut hServer's IMDG middleware stores key/value pairs across an elastic set of networked servers, ensuring fast data access, linear scalability, and high availability. ScaleOut hServer's integrated MapReduce engine executes standard Hadoop MapReduce programs directly in the IMDG, delivering results in seconds so you can spot important trends in your data as they occur. At the same time, your live application can easily create, read, update and delete fast-changing data in the IMDG with easy-to-use Java APIs. Together, these capabilities enable you to bring the power of Hadoop's analytics to live, operational systems.
Consider these advantages to using ScaleOut hServer for real-time analysis:
- The Hadoop MapReduce engine executes standard Hadoop code with a 20X performance increase in benchmark tests.
- ScaleOut hServer eliminates Hadoop's batch scheduling overhead, resulting in very fast (sub-second) start-up times.
- ScaleOut hServer's IMDG stores data at in-memory speed, reducing data access times, and is designed to hold "live," fast-changing data.
- Optional sorting and optimized combining and data shuffling between the mappers and reducers using in-memory storage streamlines processing and minimizes execution time.
- For memory-based data sets, automatic setting of key MapReduce parameters, such as splits, partitions, and slots, simplifies development and makes MapReduce execution self-tuning.
- Performance linearly scales just by adding servers to increase memory capacity and throughput; ScaleOut hServer automatically rebalances the workload.
- The IMDG ensures that stored data is highly available to protect from server or network failures.
- The IMDG's key/value storage and associated Java APIs match the object-oriented architecture of your Hadoop application. Optimized data storage for large data sets with very small key/value pairs ensures efficient memory usage and maximum MapReduce performance.
- ScaleOut hServer automatically detects and optimizes applications that produce a single, combined result instead of a key/value space. This is particularly useful in real-time applications.
Streamline Hadoop Development
ScaleOut hServer can be used for rapid analysis of large, static data sets, even those that don't fit in memory. Its MapReduce engine can efficiently read and process HDFS data. In addition, it can be configured to transparently cache HDFS data in the IMDG for subsequent runs.
In yet another usage model, ScaleOut hServer enables fast, easy "what-if" simulations of static data that is held in the grid. This can be very useful in applications, such as financial modeling, which require running multiple "what-if" simulations. For example, ScaleOut hServer's fast execution time enables stock trading strategies to be easily tested and honed with multiple simulations over price histories held in-memory.
Finally, ScaleOut hServer provides a simple, easy-to-use debugging environment for developing MapReduce applications. After installing ScaleOut hServer in minutes, you can load a subset of your data into memory and execute Hadoop jobs in seconds. This means that you can rapidly iterate on your MapReduce code until you're getting the results you want.
ScaleOut hServer's technology has been developed over nearly a decade of research and customer experience. ScaleOut hServer brings together ScaleOut's proven in-memory data grid (IMDG) with a highly scalable Hadoop MapReduce engine implemented using ScaleOut's parallel computing technology. The stack of integrated ScaleOut analytics technologies that make up ScaleOut hServer can be visualized as follows:
On each server within the IMDG's cluster, the ScaleOut Grid Service provides scalable, highly available data storage and management, and the ScaleOut Analytics Engine provides the highly optimized execution platform that powers the Hadoop MapReduce engine.
Like Hadoop, ScaleOut hServer performs data-parallel computation in which application code is sent to every node in the grid. However, unlike standard Hadoop, which stores data on disk and moves it multiple times during processing, ScaleOut hServer's integrated in-memory data grid minimizes data motion by enabling input and output data sets, as well as intermediate data, to be stored in the IMDG. Customized, grid record readers and writers for IMDG data efficiently pipeline key/value pairs to the mappers and from the reducers, resulting in significant performance gains.
ScaleOut hServer's Hadoop MapReduce engine leverages ScaleOut Analytics Server's in-memory, parallel computing technology to minimize start-up times and efficiently distribute tasks to the IMDG's servers. This execution engine emulates the functionality of Hadoop MapReduce to execute Hadoop code and output the same results as standard Hadoop MapReduce.
ScaleOut hServer automatically creates an "invocation grid" of JVMs to pre-stage the execution environment for a MapReduce application on all ScaleOut hServer nodes and automatically deploy all necessary executable programs and libraries. This eliminates a significant amount of setup work necessary with standard Hadoop. This execution environment optionally can be managed by the user to avoid the overhead of repeated setup across multiple MapReduce runs.
ScaleOut hServer's combination of parallel computing technologies, including in-memory data storage, optimized record reading and writing, and integrated MapReduce code execution, enables it to dramatically reduce execution time and thereby enable Hadoop MapReduce to be used for real-time applications. ScaleOut hServer has demonstrated 20x faster execution time over the Apache Hadoop distribution for the familiar Hadoop WordCount benchmark program.
Optimized, In-Memory Data Grid
After you install ScaleOut hServer, it will automatically discover and self-aggregate into an in-memory data grid spanning the cluster of servers. Using ScaleOut hServer's Java APIs, your application can create, read, update, and delete key/value pairs in the IMDG to manage fast-changing data within your live application, keeping the data in the grid up to date as changes occur in your application.
IMDGs traditionally host complex objects with rich semantics in the grid. However, Hadoop jobs often require storing and analyzing huge numbers of very small objects, such as sensor data or tweet streams. To handle these divergent requirements, ScaleOut hServer supports two object formats in its IMDG. The Named Cache is designed for large, complex objects and supports rich functionality such as property-oriented query, dependencies, timeouts, pessimistic locking, remote store access, and more.
With the new Named Map, ScaleOut hServer adds Java ConcurrentMap semantics to efficiently organize large populations of small data objects and minimize the amount of metadata associated with each. Objects stored in a named map can be queried in parallel and can be cached in the client using user-adjustable coherency policies. For fast loading and updating of key/value data, the Named Map provides bulk insert and bulk update functions. In both named caches and named maps, applications can create, read, update, and delete objects to manage live data. The Hadoop developer now has the choice to store and analyze heavyweight objects with rich metadata or lightweight objects depending on the type of data being analyzed.
ScaleOut hServer includes a Grid Record Reader to input key/value pairs to Hadoop's mappers with minimum latency. Its input format automatically creates splits of the specified input key/value collection to avoid network overhead when retrieving key/value pairs on all worker nodes. The Grid Record Reader works with both named caches and named maps. Likewise, a Grid Record Writer enables pipelined output of results from Hadoop's reducers back to a named cache or named map in the grid.
Streaming Data from HDFS
With ScaleOut hServer, MapReduce applications can be connected to an HDFS data source and input data sets for processing by the MapReduce engine. This enables ScaleOut hServer to be used for analytics on data sets hosted within HDFS. Likewise, output from MapReduce applications can be sent directly to HDFS instead of to the IMDG for storage.
As data streams in from HDFS, it is fed to the mappers, processed, and the output of the mappers is stored as intermediate data within the IMDG before being sent to the reducers. As long as this intermediate data set fits with the IMDG's memory, ScaleOut hServer can process very large data sets that otherwise would not fit within the IMDG.
HDFS Distributed Cache
In some cases you may not need to analyze live data, but you would like to decrease the access time for reading data into Hadoop from HDFS. ScaleOut hServer can speed access times with its unique ability to capture and store data from HDFS during MapReduce processing. This feature is intended for use with HDFS data sets which fit within the memory of the IMDG. Here's how it works.
ScaleOut hServer includes a Dataset Record Reader that "wraps" your HDFS record reader after you make a simple, two-line code change to your Hadoop program. When the Hadoop mappers request key/value pairs, the Dataset Record Reader automatically stores the key/value pairs generated by the HDFS record reader in ScaleOut hServer's IMDG. On subsequent Hadoop MapReduce runs, ScaleOut hServer verifies that the HDFS data set has not changed and then supplies key/value pairs to the mappers directly from the IMDG. Results can be written back to HDFS just as they normally would or optionally to the ScaleOut hServer IMDG using the Grid Record Writer. The diagram below shows how the Dataset Record Reader is used:
The Dataset Record Reader is designed to store key/value pairs in the IMDG with minimum network overhead and maximum storage efficiency. Using splits defined for the HDFS file, it creates "chunks" of key/pairs in the IMDG using overlapped updates to the IMDG while each HDFS record reader reads from the HDFS file and supplies key/value pairs to its mapper. These chunks are stored as highly available objects within the IMDG.
Likewise, on subsequent Hadoop MapReduce runs in which key/value pairs are available in the IMDG, the Dataset Record Reader bypasses the underlying HDFS record reader and supplies key/value pairs from the IMDG. ScaleOut hServer uses the same set of splits to efficiently retrieve the key/value chunks from the IMDG in an overlapped manner that minimizes latency. To minimize network overhead, chunks are served from the ScaleOut hServer service process running on the same Hadoop worker node as the requesting mapper:
Available as a Community Edition
ScaleOut hServer is available in both a free Community Edition and several commercial editions. The Community Edition can be used either for evaluation or production and supports up to a four-server IMDG and a maximum data set of 256GB. Support for the Community Edition is provided via the ScaleOut Community Forum, where you can ask questions and exchange ideas with other users and ScaleOut experts. You can download the Community Edition using the form above.
A range of commercial editions is available to fit your needs. Commercial editions are licensed on an annual subscription or perpetual basis and include support and maintenance.
ScaleOut hServer Partners
ScaleOut is a proud member of the Cloudera Connect Partner Program. Through this program, ScaleOut provides Cloudera customers with the ability to capture perishable business opportunities by running MapReduce on live, fast-changing data.