【HDFS】hdfs的fsck是咋回事

有一次使用hadoop客戶端運行hadoop的fsck,客戶端報錯,顯示socket錯誤,連不上,rd童鞋恥笑說沒有配置http地址,教訓我等fsck是依靠http的一個工具,好吧,誰讓咱他媽的不懂呢,現在就來看看fsck到底是咋實現的。

elif [ "$COMMAND" = "fsck" ] ; then
  CLASS=org.apache.hadoop.hdfs.tools.DFSck
還是查看hadoop腳本,fsck工具的啓動入口在org.apache.hadoop.hdfs.tools.DFSck,
  public static void main(String[] args) throws Exception {
    // -files option is also used by GenericOptionsParser
    // Make sure that is not the first argument for fsck
    int res = -1;
    if ((args.length == 0 ) || ("-files".equals(args[0]))) 
      printUsage();
    else
      res = ToolRunner.run(new DFSck(new Configuration()), args);
    System.exit(res);
  }
從main可以看到,fsck的實現是DFSck,它實現工具類接口,重寫run方法。從構造方法看,它運行需要hdfs的配置文件,告訴它namenode的信息,還要告訴它用戶傳入的參數。這些廢話就不多說了,進去看看吧。

String proto = "http://";
          if(UserGroupInformation.isSecurityEnabled()) { 
             System.setProperty("https.cipherSuites", Krb5AndCertsSslSocketConnector.KRB5_CIPHER_SUITES.get(0));
             proto = "https://";
          }//根據是否開啓了安全屬性,選擇http協議或者https,果然是通過web
          
          final StringBuffer url = new StringBuffer(proto);
          url.append(NameNode.getInfoServer(getConf())).append("/fsck?ugi=").append(ugi.getShortUserName()).append("&path=");
	//從這裏/fsck可以看出,namenode實現了fsck的servlet,這個fsck命令行腳本只不過去向這個接口提交請求。
          String dir = "/";
          // find top-level dir first
          for (int idx = 0; idx < args.length; idx++) {
            if (!args[idx].startsWith("-")) { dir = args[idx]; break; }
          }
          url.append(URLEncoder.encode(dir, "UTF-8"));
          for (int idx = 0; idx < args.length; idx++) {
            if (args[idx].equals("-move")) { url.append("&move=1"); }
            else if (args[idx].equals("-delete")) { url.append("&delete=1"); }
            else if (args[idx].equals("-files")) { url.append("&files=1"); }
            else if (args[idx].equals("-openforwrite")) { url.append("&openforwrite=1"); }
            else if (args[idx].equals("-blocks")) { url.append("&blocks=1"); }
            else if (args[idx].equals("-locations")) { url.append("&locations=1"); }
            else if (args[idx].equals("-racks")) { url.append("&racks=1"); }
          }
          URL path = new URL(url.toString());
          SecurityUtil.fetchServiceTicket(path);
          URLConnection connection = path.openConnection();
          InputStream stream = connection.getInputStream();
從上面的代碼可以看出,這個工具類實現比較簡單,就是調用namenode的fsck這個接口,採用servlet的方式,然後把用戶的參數拼成url去提交request。然後拿到結果打印到屏幕上。我擦,當時不配http地址當然沒法執行了啊!看一下getInfoServer方法吧就知道了。

  public static String getInfoServer(Configuration conf) {
    String http = UserGroupInformation.isSecurityEnabled() ? "dfs.https.address"
        : "dfs.http.address";
    return NetUtils.getServerAddress(conf, "dfs.info.bindAddress",
        "dfs.info.port", http);
  }

這個namenode的infoserver就是提供訪問頁面的那個jetty-httpserver,實際上namenode也就起了這麼一個。ok,fsck到底做了啥,現在就轉變成去看fsck對應的servlet了。

關於namenode提供了哪些servlet,可以到namenode的啓動代碼去看,找到httpserver啓動的部分,就可以看到,fsck對應的servlet叫做FsckServlet

httpServer.addInternalServlet("fsck", "/fsck", FsckServlet.class, true);
          httpServer.addInternalServlet("getimage", "/getimage", 
              GetImageServlet.class, true);
          httpServer.addInternalServlet("listPaths", "/listPaths/*", 
              ListPathsServlet.class, false);
          httpServer.addInternalServlet("data", "/data/*", 
              FileDataServlet.class, false);
          httpServer.addInternalServlet("checksum", "/fileChecksum/*",
              FileChecksumServlets.RedirectServlet.class, false);
          httpServer.addInternalServlet("contentSummary", "/contentSummary/*",
              ContentSummaryServlet.class, false);

從這裏也可以看出,hdfs並沒有做插件支持,這個跟azkaban是不一樣的。下面繼續看這個fsckServerlet幹了啥

        public Object run() throws Exception {
          final NameNode nn = (NameNode) context.getAttribute("name.node");
          final int totalDatanodes = nn.namesystem.getNumberOfDatanodes(DatanodeReportType.LIVE); 
          final short minReplication = nn.namesystem.getMinReplication();

          new NamenodeFsck(conf, nn, nn.getNetworkTopology(), pmap, out,
              totalDatanodes, minReplication, remoteAddress).fsck();
                    return null;
          }

也很簡單,拿到namenode對象,看看現在有多少個活的dn,然後得到系統最小副本數,這些都是fsnameSystem對象管理的。然後把傳進來的參數、請求機器地址和前邊那些東西作爲NamenodeFsck類的構造參數,構造完了執行它的fsck方法。看來這個NamenodeFsck就是專門用來fsck的啦。

在繼續往下看代碼之前,我們先執行一個fsck看一下,對照着返回結果看看NamenodeFsck幹了啥


ok,就叫這個圖1吧,先不管它,這些都是我們熟悉的結果,看看NamenodeFsck的fsck方法吧。


  public void fsck() {
    final long startTime = System.currentTimeMillis();
    try {
      String msg = "FSCK started by " + UserGroupInformation.getCurrentUser()
          + " from " + remoteAddress + " for path " + path + " at " + new Date();
      LOG.info(msg);//用戶的fsck操作會被打到namenode的log裏。
      out.println(msg);//上邊這一句在圖1中確實有,哈哈。
      namenode.getNamesystem().logFsckEvent(path, remoteAddress);//記錄審計日誌
      Result res = new Result(conf);
      final HdfsFileStatus file = namenode.getFileInfo(path);//這裏就是找到文件對應的inode,關於怎麼根據文件路徑名找到對應的inode,前面有介紹。
//http://blog.csdn.net/tracymkgld/article/details/17553173 hdfsfileStatues,就是封裝了inode那些屬性一個東西。
      if (file != null) {
        check(path, file, res);
	//把check的結果放到Result對象。
        out.println(res);
        out.println(" Number of data-nodes:\t\t" + totalDatanodes);//活dn數目
        out.println(" Number of racks:\t\t" + networktopology.getNumOfRacks());//機架數
//fsck可以給出活dn數,機架數。
        out.println("FSCK ended at " + new Date() + " in "
            + (System.currentTimeMillis() - startTime + " milliseconds"));

        // DFSck client scans for the string HEALTHY/CORRUPT to check the status
        // of file system and return appropriate code. Changing the output string
        // might break testcases. Also note this must be the last line 
        // of the report.
        if (res.isHealthy()) {
          out.print("\n\nThe filesystem under path '" + path + "' " + HEALTHY_STATUS);
        }  else {
          out.print("\n\nThe filesystem under path '" + path + "' " + CORRUPT_STATUS);
        }
      } else {
        out.print("\n\nPath '" + path + "' " + NONEXISTENT_STATUS);
      }
    } catch (Exception e) {
      String errMsg = "Fsck on path '" + path + "' " + FAILURE_STATUS;
      LOG.warn(errMsg, e);
      out.println("FSCK ended at " + new Date() + " in "
          + (System.currentTimeMillis() - startTime + " milliseconds"));
      out.println(e.getMessage());
      out.print("\n\n"+errMsg);
    } finally {
      out.close();
    }
  }
看一下hdfsFileStatues裏邊裝了啥

   private static HdfsFileStatus createFileStatus(byte[] path, INode node) {
    // length is zero for directories
    return new HdfsFileStatus(
        node.isDirectory() ? 0 : ((INodeFile)node).computeContentSummary().getLength(), 
        node.isDirectory(), 
        node.isDirectory() ? 0 : ((INodeFile)node).getReplication(), 
        node.isDirectory() ? 0 : ((INodeFile)node).getPreferredBlockSize(),
        node.getModificationTime(),
        node.getAccessTime(),
        node.getFsPermission(),
        node.getUserName(),
        node.getGroupName(),
        path);
  }
  
好了,看一下check吧,我擦,挺雞巴長,原理比較簡單,主要就是要拿到文件對應多少個塊,每個塊location信息,是不是損壞了等,這些都是dn塊彙報給namenode的,namenode獲取很簡單,這裏就不多說了,至於塊彙報那些東西,回頭再梳理。

private void check(String parent, HdfsFileStatus file, Result res) throws IOException {
    String path = file.getFullName(parent);//不用管,就是從新得到你輸入的要fsck的文件名,之前拿去找inode了,轉成byte了
    boolean isOpen = false;

    if (file.isDir()) {//如果你要check的是個目錄
      byte[] lastReturnedName = HdfsFileStatus.EMPTY_NAME;
      DirectoryListing thisListing;
      if (showFiles) {
        out.println(path + " <dir>");
      }
      res.totalDirs++;
      do {
        assert lastReturnedName != null;
        thisListing = namenode.getListing(path, lastReturnedName);
        if (thisListing == null) {
          return;
        }
        HdfsFileStatus[] files = thisListing.getPartialListing();
        for (int i = 0; i < files.length; i++) {
          check(path, files[i], res);
        }
        lastReturnedName = thisListing.getLastName();
      } while (thisListing.hasMore());
      return;
    }
    long fileLen = file.getLen();
    // Get block locations without updating the file access time 
    // and without block access tokens
    LocatedBlocks blocks = namenode.getNamesystem().getBlockLocations(path, 0,
        fileLen, false, false);
    if (blocks == null) { // the file is deleted
      return;
    }
    isOpen = blocks.isUnderConstruction();
    if (isOpen && !showOpenFiles) {
      // We collect these stats about open files to report with default options
      res.totalOpenFilesSize += fileLen;
      res.totalOpenFilesBlocks += blocks.locatedBlockCount();
      res.totalOpenFiles++;
      return;
    }
    res.totalFiles++;
    res.totalSize += fileLen;
    res.totalBlocks += blocks.locatedBlockCount();
    if (showOpenFiles && isOpen) {
      out.print(path + " " + fileLen + " bytes, " +
        blocks.locatedBlockCount() + " block(s), OPENFORWRITE: ");
    } else if (showFiles) {
      out.print(path + " " + fileLen + " bytes, " +
        blocks.locatedBlockCount() + " block(s): ");
    } else {
      out.print('.');
    }
    if (res.totalFiles % 100 == 0) { out.println(); out.flush(); }
    int missing = 0;
    int corrupt = 0;
    long missize = 0;
    int underReplicatedPerFile = 0;
    int misReplicatedPerFile = 0;
    StringBuffer report = new StringBuffer();
    int i = 0;
    for (LocatedBlock lBlk : blocks.getLocatedBlocks()) {
      Block block = lBlk.getBlock();
      boolean isCorrupt = lBlk.isCorrupt();
      String blkName = block.toString();
      DatanodeInfo[] locs = lBlk.getLocations();
      res.totalReplicas += locs.length;
      short targetFileReplication = file.getReplication();
      if (locs.length > targetFileReplication) {
        res.excessiveReplicas += (locs.length - targetFileReplication);
        res.numOverReplicatedBlocks += 1;
      }
      // Check if block is Corrupt
      if (isCorrupt) {
        corrupt++;
        res.corruptBlocks++;
        out.print("\n" + path + ": CORRUPT block " + block.getBlockName()+"\n");
      }
      if (locs.length >= minReplication)
        res.numMinReplicatedBlocks++;
      if (locs.length < targetFileReplication && locs.length > 0) {
        res.missingReplicas += (targetFileReplication - locs.length);
        res.numUnderReplicatedBlocks += 1;
        underReplicatedPerFile++;
        if (!showFiles) {
          out.print("\n" + path + ": ");
        }
        out.println(" Under replicated " + block +
                    ". Target Replicas is " +
                    targetFileReplication + " but found " +
                    locs.length + " replica(s).");
      }
      // verify block placement policy
      int missingRacks = ReplicationTargetChooser.verifyBlockPlacement(
                    lBlk, targetFileReplication, networktopology);
      if (missingRacks > 0) {
        res.numMisReplicatedBlocks++;
        misReplicatedPerFile++;
        if (!showFiles) {
          if(underReplicatedPerFile == 0)
            out.println();
          out.print(path + ": ");
        }
        out.println(" Replica placement policy is violated for " + 
                    block +
                    ". Block should be additionally replicated on " + 
                    missingRacks + " more rack(s).");
      }
      report.append(i + ". " + blkName + " len=" + block.getNumBytes());
      if (locs.length == 0) {
        report.append(" MISSING!");
        res.addMissing(block.toString(), block.getNumBytes());
        missing++;
        missize += block.getNumBytes();
      } else {
        report.append(" repl=" + locs.length);
        if (showLocations || showRacks) {
          StringBuffer sb = new StringBuffer("[");
          for (int j = 0; j < locs.length; j++) {
            if (j > 0) { sb.append(", "); }
            if (showRacks)
              sb.append(NodeBase.getPath(locs[j]));
            else
              sb.append(locs[j]);
          }
          sb.append(']');
          report.append(" " + sb.toString());
        }
      }
      report.append('\n');
      i++;
    }
    if ((missing > 0) || (corrupt > 0)) {
      if (!showFiles && (missing > 0)) {
        out.print("\n" + path + ": MISSING " + missing
            + " blocks of total size " + missize + " B.");
      }
      res.corruptFiles++;
      switch(fixing) {
      case FIXING_NONE:
        break;
      case FIXING_MOVE:
        if (!isOpen)
          lostFoundMove(parent, file, blocks);
        break;
      case FIXING_DELETE:
        if (!isOpen)
          namenode.delete(path, true);
      }
    }
    if (showFiles) {
      if (missing > 0) {
        out.print(" MISSING " + missing + " blocks of total size " + missize + " B\n");
      }  else if (underReplicatedPerFile == 0 && misReplicatedPerFile == 0) {
        out.print(" OK\n");
      }
      if (showBlocks) {
        out.print(report.toString() + "\n");
      }
    }
  }
小結:

fsck是namenode本身提供的對外接口,通過servlet方式訪問調用,至於訪問方式,隨便,只要提交這個接口請求就行了,hadoop的shell命令行是通過一個工具類使用java提交的,你也可指直接在瀏覽器拼接url,例如:

http://10.4.19.42:50070/fsck?ugi=hadoop&path=/tmp/hadoop/wordcountjavain&files=1&blocks=1&locations=1&racks=1

等價於hadoop fsck /tmp/hadoop/wordcountjavain -fles -blocks -locations -racks  (-racks可以沒有,沒有也出現racks信息擦)

fsck的實質是通過name的大管家FsNamesystem對象(FSDirectory)管理的那套命名空間,及其塊彙報上來的信息,從namenode的內存中讀取inode的屬性及其block信息等,副本數,多少個塊,有木有順壞,這些結果都是現成的,並不需要namenode再去dn找找到對應的塊,然後讓datanode去檢查,所以這種“現成”的結果,即存在namenode內存的信息,就是你fsck得到的結果有時候是過時的,你更改文件一段時間後,才能從fsck到準確的記過,比如我把dn上得文件blk給換一個壞的,這時候namenode沒有拿到塊彙報信息,你不會從fsck結果立即感知到它壞了。

但是刪除等會從fsck得到信息,因爲刪除的原理前面也介紹過了無論是trash還是skiptrash,都只是把要刪除的文件進行標記(寄一本臺賬,有清理線程週期發佈rpc調用對應的dn去刪除blk),它直接影響namenode的內存和命名空間(還包括塊彙報信息,即dn弄過來的塊信息也會因爲刪除操作而被修改)。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章