- Counters (values are definitive only once job has successfully completed)
- Task Counters
- Filesystem Counters
- Job Counters (only in application master. doesn't need to send across network, mainly about task info)
- FileInputFormat Counters
- FileOutputFormat Counters
- User-defined counters
- by enum
context.getCounter(Temperature.MALFORMED).increment(1);
- by counter group
public Counter getCounter(String groupName, String counterName)
- Sorting
- Partial sort (due to multiple map tasks and multiple reduce tasks)
- Total sort
InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(0.1, 10000, 10);
InputSampler.writePartitionFile(job, sampler);
// Add to DistributedCache
Configuration conf = job.getConfiguration();
String partitionFile = TotalOrderPartitioner.getPartitionFile(conf);
URI partitionUri = new URI(partitionFile);
job.addCacheFile(partitionUri);
- secondary sort
- Make the key a composite of the natural key and the natural value.
- The sort comparator should order by the composite key (i.e., the natural key and natural value).
- The partitioner and grouping comparator for the composite key should consider only the natural key for partitioning and grouping.
job.setPartitionerClass(FirstPartitioner.class);
job.setSortComparatorClass(KeyComparator.class);
job.setGroupingComparatorClass(GroupComparator.class);
- Join
- map side join (strict requirement on splits that same key in splits of different source)
- reduce side join which is more general
- Multiple inputs -> one map task for each source
- Secondary sort -> arrange records from different map tasks properly
- side data distribution
- small data in configuration -> need to be small because,
each time the configuration is read, all of its entries are read into memory.
- -files, -archives, -libjars to be copied to node once per job