MapReduce Workflow

check output folder

calculate splits

application master gets progress and completion reports from tasks. it also requests containers for map tasks and reduce tasks. it starts container by the nodemanager after container is assigned for task.

if uber task is enabled (mapreduce.job.ubertask.enable), uber task runs inside the application master if it's less than 10 mappers, one reducer or size of input within one block.

all map task must be completed by the sort phase of reduce.

resource requests are per-job basis, see mapreduce.map.memory.mb, mapreduce.reduce.memory.mb, mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores.

when job is completed, delete temp files, commit job and archive job history


- task failure (heart beats to AM)

user function error or JVM error

default retry times is four. can be configured by mapreduce.map.maxattempts and mapreduce.reduce.maxattempts

mapreduce.map.failures.maxpercent and mapreduce.reduce.failures.maxpercent 

- application master failure (heart beats to RM)

default retry times is 2. mapreduce.am.max-attempts and yarn.resourcemanager.am.max-attempts

use job history to recover completed tasks

- node manager failure (heart beats to RM)

could be blacklisted if application failures on the node exceed configured max values mapreduce.job.maxtaskfailures.per.tracker.

- resource manager failure (HA, stand-by resource manager)

all application info are persisted in zookeepr or shared state.

need to restart all application masters if it's failed


- shuffle and sort

  • Map

number of partitions is same as number of reducer tasks

multipe spill files for spills. combiner function runs after sort running by background process

single output file after map task is completed. need to merge multiple spill files into a sorted file.

  • Reduce
copy output of map tasks to memory first. spill to disk when it exceeds threshold. need to merge outputs from different tasks to a single sorted file
  • Configuration (tuning on different parameters, buffer size, spill percentage, background processes...)


- task execution

  • speculative task
  • output commit
public abstract class OutputCommitter {
public abstract void setupJob(JobContext jobContext) throws IOException;
public void commitJob(JobContext jobContext) throws IOException { }
public void abortJob(JobContext jobContext, JobStatus.State state)
throws IOException { }
public abstract void setupTask(TaskAttemptContext taskContext)
throws IOException;
public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)
throws IOException;
public abstract void commitTask(TaskAttemptContext taskContext)
throws IOException;
public abstract void abortTask(TaskAttemptContext taskContext)
throws IOException;
}
}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章