Here are some of the best ways to increase performance:
2.Use a Combiner, which can greatly minimize network traffic.
6.Define and configure a RawComparator to avoid deserializing objects.
7.Prefer StringUtils.split over String.split .
8.Use data compression, which can minimize network traffic.
mapreduce.task.io.sort.mb -- The amount of memory allocated to the MapOutputBuffer. The default value is 200MB.
mapreduce.map.sort.spill.percent
-- Represents a percentage of mapreduce.task.io.sort.mb that, when exceeded, records are spilled to disk. The default is 80%.
mapreduce.task.io.sort.factor
-- The number of partitions to merge at one time. The default is 10.
mapreduce.map.speculative
-- When set to true, the MapReduce framework may start another instance of a map task that is straggling, just in case this poorly-performing task eventually fails or could be completed quicker on a different node. The default value of this property
is false.
mapreduce.job.jvm.numtasks
-- The number of a tasks to run per JVM. The default is 1,
meaning each task will run in a new JVM process. Set this value higher than 1 to reuse a JVM for multiple tasks, which can save the overhead of killing and starting up JVM processes.
mapreduce.map.output.compress
-- Defaults to false, but set it to true if you want the output of the mapper to be compressed before being sent across the network. You can specify a codec using mapreduce.map.output.compress.codec.
mapreduce.reduce.shuffle.input.buffer.percent
-- percentage of the Reducer’s memory allocated for storingmap outputs during the shuffle. Default is 70%.
mapreduce.reduce.shuffle.merge.percent
-- percentage of mapreduce.reduce.shuffle.input.buffer.percent that, when exceeded,causes merging to occur on the Reducer. Default is 66%.
mapreduce.reduce.merge.inmem.threshold-- when
this threshold is reached, a merge is triggered and a spill to disk occurs on thereducer. The default is 1,000 map outputs. Setting this value to 0 has theeffect of letting the merges occur based on the value of mapreduce.reduce.shuffle.merge.percent.
mapreduce.reduce.input.buffer.percent
-- allows map outputs to remain in memory and not be written to disk.
mapreduce.reduce.shuffle.parallelcopies
-- the number of threads that a reducer uses to retrievethe output from mappers. The default is 5, but if you have hundreds of mappers than this can be a bottleneck.
mapreduce.reduce.speculative
-- if a reduce task is straggling then the MapReduceframework will start another instance of the same task on a different node. Thedefault value of this property is false.
Data
compression often has two benefits:
2.Less
space needed on the filesystem
Thecommonly
used algorithms and codecs available in Hadoop are:
Snappy
-- org.apache.hadoop.io.compress.SnappyCodec
gzip
-- org.apache.hadoop.io.compress.GzipCodec
bzip2
-- org.apache.hadoop.io.compress.BZip2Codec
LZO -- com.hadoop.compression.lzo.LzopCodec
DEFLATE -- org.apache.hadoop.io.compress.DefaultCodec
Data compression hasseveral trade-offs that you need to
consider, including:
Space vs. time -- While there is a gain in filesystem space
or smaller network traffic, it will take additional time to compress and decompress the data.
Splittable vs. Non-splittable -- Most of the compression
algorithms do not support the splitting of files, which is a major concern in MapReduce.
If the codec utilized does not support splitting, then a map task cannot
take advantage of data locality. For example, if a large file is chunked across 10 DataNodes but uses Snappy compression (which is not splittable), then only one map task can process this file, which means 90% of the file needs to be transferred across the
network to a single NodeManager for processing.
Onlybzip2
and LZO support splitting.
Configuring
Data Compression
Enable
compression and configure the codec using configuration properties, which can be defined at several levels, including:
The
DataNode level -- in mapred-site.xml
The application level -- by setting the properties using
the Configuration instance
The runtime level -- by using command-line arguments
Here are theproperties to enable and configure compression
in a MapReduce job:
mapreduce.map.output.compress:set to true or false to enable
or disable compression of data output by the Mapper.
mapreduce.map.output.compress.codec : defines
the codec touse for the compressed map output.
mapreduce.output.fileoutputformat.compress : set
to true or falseto enable or disable compression of data output by the job. The default is false.
mapreduce.output.fileoutputformat.compress.codec :defines
the codec to use for the compressed job output.
mapreduce.output.fileoutputformat.compress.type :if the
job output is compressed SequenceFiles, this property determines how they are compressed. Valid values are RECORD, BLOCK, or NONE.
Turns on compression for both the map and reduce output, both using Snappy compression.
- Configuration conf = job.getConfiguration();
- conf.setBoolean(MRJobConfig.MAP_OUTPUT_COMPRESS, true);
- conf.setClass(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC,
- SnappyCodec.class,
- CompressionCodec.class);
- conf.setBoolean(FileOutputFormat.COMPRESS, true);
- conf.setClass(FileOutputFormat.COMPRESS_CODEC,
- SnappyCodec.class,
- CompressionCodec.class);
Providing a RawComparator can greatly improve the performanceof a large MapReduce application:
• When a Mapper writes out a < key ,value > pair using the context.write method, the key and value are immediately serialized.•During the shuffle/sort phase,
these keys need to be sorted, and the ordering is determined by the compareTo method of the key class.
•Because the keys are serialized,
they must first be deserialized before they can be compared to each other.
You can avoid the deserialization by writing a compare method that compares the keys in their serialized state.
You
can also define a group RawComparator for controlling which keys are grouped together for a single call to the reduce method of a Reducer.
Defining
a RawComparator
Write
a class that implements the org.apache.hadoop.io.RawComparator interface.The easiest way to implement the RawComparator interface is to extend the WritableComparator class.
- public class CustomerComparator extends WritableComparator {
- protected CustomerComparator() {
- super(CustomerKey.class);
- }
- @Override
- public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
- int customerId1 = super.readInt(b1, s1);
- int customerId2 = super.readInt(b2, s2);
- return customerId1 - customerId2;
- }
- }
Let
the MapReduce job know that your particular data type is to use your RawComparator.
You
have two options for configuring a sort RawComparator:
1.Define
a static initializer in the class definition.
- public class CustomerKey
- implements WritableComparable<CustomerKey> {
- static {
- WritableComparator.define(CustomerKey.class,
- new CustomerComparator());
- }
- private int customerId;
- private String zipCode;
- //remainder of class definition...
- }
2.Use
the setSortComparator method when configuring the Job.
- job.setSortComparatorClass(CustomerKey.class);