Dataset Input Format Optimization for Text Key/Value Pairs of Known Length

If the key and value produced by underlying input format is a Text object, and there is a known fixed size for keys and values, both server memory consumption and deserialization overhead can be optimized by storing Text contiguously in a buffer; this avoids storing the length of each individual object.

To take advantage of this optimization, the object sizes in bytes should be set in the job configuration by using the setTextKeyValueSize(…) method of DatasetInputFormat:

job.setInputFormatClass(DatasetInputFormat.class);
DatasetInputFormat.setUnderlyingInputFormat(job, TextInputFormat.class);
DatasetInputFormat.setTextKeyValueSize(job, 10, 90);