以hadoop 1.0.0版本分析hadoop是如何提交任務的

腳本分析

hadoop的bin目錄如下：

當調用 hadoop jar XXX.jar 命令後。hadoop腳本對應的是如下內容：

由此可知，hadoop 是通過org.apache.hadoop.util.RunJar類開始任務的

RunJar類分析

概覽

RunJar類是個含有main函數的啓動類,包含兩個靜態方法:

分析main函數:

 public static void main(String[] args) throws Throwable {
    String usage = "RunJar jarFile [mainClass] args...";

    if (args.length < 1) {
      System.err.println(usage);
      System.exit(-1);
    }

    int firstArg = 0;
    String fileName = args[firstArg++]; // 第一個參數是jar包名
    File file = new File(fileName);
    String mainClassName = null;

    JarFile jarFile;
    try {
      jarFile = new JarFile(fileName);
    } catch(IOException io) {
      throw new IOException("Error opening job jar: " + fileName)
        .initCause(io);
    }

    Manifest manifest = jarFile.getManifest();
    if (manifest != null) {
      mainClassName = manifest.getMainAttributes().getValue("Main-Class");
    }
    jarFile.close();

    if (mainClassName == null) {
      if (args.length < 2) {
        System.err.println(usage);
        System.exit(-1);
      }
      mainClassName = args[firstArg++]; // 第二個參數是 main 類
    }
    mainClassName = mainClassName.replaceAll("/", ".");

    File tmpDir = new File(new Configuration().get("hadoop.tmp.dir"));  // 在hdfs上建立臨時目錄
    tmpDir.mkdirs();
    if (!tmpDir.isDirectory()) {
      System.err.println("Mkdirs failed to create " + tmpDir);
      System.exit(-1);
    }
    final File workDir = File.createTempFile("hadoop-unjar", "", tmpDir);
    workDir.delete();
    workDir.mkdirs();
    if (!workDir.isDirectory()) {
      System.err.println("Mkdirs failed to create " + workDir);
      System.exit(-1);
    }

    Runtime.getRuntime().addShutdownHook(new Thread() {
        public void run() {  // 如果意外失敗,刪除臨時目錄
          try {
            FileUtil.fullyDelete(workDir);
          } catch (IOException e) {
          }
        }
      });

    unJar(file, workDir);  // 將 jar 包裏的文件解壓到臨時目錄

    ArrayList<URL> classPath = new ArrayList<URL>();  // 添加依賴
    classPath.add(new File(workDir+"/").toURL());
    classPath.add(file.toURL());
    classPath.add(new File(workDir, "classes/").toURL());
    File[] libs = new File(workDir, "lib").listFiles();
    if (libs != null) {
      for (int i = 0; i < libs.length; i++) {
        classPath.add(libs[i].toURL());
      }
    }

    ClassLoader loader =
      new URLClassLoader(classPath.toArray(new URL[0]));

    Thread.currentThread().setContextClassLoader(loader);
    Class<?> mainClass = Class.forName(mainClassName, true, loader);  //在當前線程通過反射生成用戶編寫的main類實例
    Method main = mainClass.getMethod("main", new Class[] {
      Array.newInstance(String.class, 0).getClass()
    });
    String[] newArgs = Arrays.asList(args)  //獲取main類的執行參數
      .subList(firstArg, args.length).toArray(new String[0]);
    try {
      main.invoke(null, new Object[] { newArgs }); // 執行main類
    } catch (InvocationTargetException e) {
      throw e.getTargetException();
    }
  }

總結一下RunJar類主要做如下幾件事:

在hdfs上建立臨時文件夾
將jar包的文件解壓到臨時文件夾,並添加環境變量
通過反射機制啓動用戶編寫的main函數

用戶主函數分析

通過用戶和JobTracker之間的通信協議,分析用戶架構

通過之前的文章,我們知道客戶端和JobTracker通過JobSubmissionProtocol協議進行通信,且通過submitJob方法提交任務.那我們看看哪些類實現了該協議:

然後看看哪些地方使用率該協議:

顯然① JobClient是客戶端,而且 ④位置的RPC反射是獲取遠程連接的地方.
接下來我們看看④

  private static JobSubmissionProtocol createRPCProxy(InetSocketAddress addr,
      Configuration conf) throws IOException {
    return (JobSubmissionProtocol) RPC.getProxy(JobSubmissionProtocol.class,
        JobSubmissionProtocol.versionID, addr, 
        UserGroupInformation.getCurrentUser(), conf,
        NetUtils.getSocketFactory(conf, JobSubmissionProtocol.class));
  }

然後我們追蹤哪裏使用了該方法,顯然第二塊是臨時變量,用完就毀,然後jobSubmitClient屬性纔是真正連接服務器的句柄.

最後我們結合一下前面的代碼,得到一個架構圖:

用戶擁有Job類的實例,Job類擁有JobClient的實例,JobClient通過jobSubmitClient屬性和JobTracker通信

用戶main函數概率

下面是用戶編寫的mapreduce程序

public class WCRunner {

	public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
		
		Job wcjob = new Job(conf);
		
		wcjob.setJarByClass(WCRunner.class);
		wcjob.setMapperClass(WCMapper.class);
		wcjob.setReducerClass(WCReducer.class);
		wcjob.setOutputKeyClass(Text.class);
		wcjob.setOutputValueClass(LongWritable.class);
		wcjob.setMapOutputKeyClass(Text.class);
		wcjob.setMapOutputValueClass(LongWritable.class);
		FileInputFormat.setInputPaths(wcjob, new Path(args[0]));
		FileOutputFormat.setOutputPath(wcjob, new Path(args[1]));

		wcjob.waitForCompletion(true);
	}
}

該程序展示了一個mapreduce程序的基本要素:

輸入(InputFromat)
map函數
reduce函數
輸出(OutputFormat)
mapreduce編程的基本模型見另一篇文檔:mapreduce編程模型

用戶main函數運行全流程分析

對於用戶而言,不需要了解複雜的mapreduce程序執行機制,只需要按照需求編寫maper和reducer函數之後,然後所有的內容都會交給Job類來管理.

Job類是JobContext類的一個子類,有幾個重要的屬性:

private JobState state 該屬性的值是Job內部定義的一個枚舉值:public static enum JobState {DEFINE, RUNNING}. 任務提交之前,一直是DEFINE狀態,任務提交之後是RUNNING狀態.
private JobClient jobClient : 和集羣中的Jobtracker交互的類
private RunningJob info : 這裏的RunningJob是個接口,實際使用的是JobClient的內部類NetworkedJob,用於監控運行中的任務
protected UserGroupInformation ugi;這個類很重要,屬於Job繼承自JobContext的屬性,在JobContext初始化的時候賦值.是hadoop的一種權限驗證機制.Job類想要和集羣進行通信(建立連接)的時候要先驗證通過才能執行ugi.doAs(run)裏面的run方法.除了Job類以外,JobClient也有ugi屬性,所以JobClient和集羣進行互操作的時候一般也會寫入到doAs方法裏面.

主函數運行到Job實例的Job.waitForCompletion()方法時,開始提交程序:

當執行到

  public boolean waitForCompletion(boolean verbose
                                   ) throws IOException, InterruptedException,
                                            ClassNotFoundException {
    if (state == JobState.DEFINE) {
    //1.通過connect函數建立和集羣的連接,並將建立的連接終端賦值給jobClient 
    //2.提交任務,並返回一個任務監控實例賦值給info
      submit();
    }
    if (verbose) {
      jobClient.monitorAndPrintJob(conf, info);
    } else {
      info.waitForCompletion();
    }
    return isSuccessful();
  }
  public void submit() throws IOException, InterruptedException, 
                              ClassNotFoundException {
    ensureState(JobState.DEFINE);
    setUseNewAPI();
    
    // 連接至集羣的jobtracker,連接終端是jobClient
    connect();
    // 真正提交任務的地方
    // 通過jobClient向終端提交任務,並返回一個NetworkedJob實例,監控任務執行狀態
    info = jobClient.submitJobInternal(conf); 
    super.setJobID(info.getID());
    state = JobState.RUNNING;
   }

在最後submitJobInternal方法裏,客戶端調用瞭如下代碼和JobTracker通信

          status = jobSubmitClient.submitJob(
              jobId, submitJobDir.toString(), jobCopy.getCredentials());

這個代碼裏發送了jobId,提交文件的地址,以及權限驗證,然而除此以爲並沒有提供任何關於數據及數據分片的信息,那這些信息放在哪裏了?繼續追蹤代碼,我們看到:

          // Write job file to JobTracker's fs        
          FSDataOutputStream out = 
            FileSystem.create(fs, submitJobFile,
                new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION));

          try {
            jobCopy.writeXml(out);
          } finally {
            out

顯然,任務的具體信息是已xml的形式存在hdfs上了.這就需要JobTracker自己得到任務後去讀取任務信息流.

以hadoop 1.0.0版本分析hadoop是如何提交任務的

腳本分析

RunJar類分析

概覽

分析main函數:

用戶主函數分析

通過用戶和JobTracker之間的通信協議,分析用戶架構

用戶main函數概率

用戶main函數運行全流程分析

JobTracker得到任務後會做哪些事情呢?

DAPPER 事務 TRANSACTION

scala trait,class 和 object在內存中的狀態

三種快速排序的圖示和python實現

hadoop系列之JobTracker啓動源碼解析

使用geohash的幾個要點

java NIO 之Buffer類

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結