【HDFS】hdfs的fsck是咋回事

有一次使用hadoop客戶端運行hadoop的fsck，客戶端報錯，顯示socket錯誤，連不上，rd童鞋恥笑說沒有配置http地址，教訓我等fsck是依靠http的一個工具，好吧，誰讓咱他媽的不懂呢，現在就來看看fsck到底是咋實現的。

elif [ "$COMMAND" = "fsck" ] ; then
  CLASS=org.apache.hadoop.hdfs.tools.DFSck

還是查看hadoop腳本，fsck工具的啓動入口在org.apache.hadoop.hdfs.tools.DFSck，

  public static void main(String[] args) throws Exception {
    // -files option is also used by GenericOptionsParser
    // Make sure that is not the first argument for fsck
    int res = -1;
    if ((args.length == 0 ) || ("-files".equals(args[0]))) 
      printUsage();
    else
      res = ToolRunner.run(new DFSck(new Configuration()), args);
    System.exit(res);
  }

從main可以看到，fsck的實現是DFSck，它實現工具類接口，重寫run方法。從構造方法看，它運行需要hdfs的配置文件，告訴它namenode的信息，還要告訴它用戶傳入的參數。這些廢話就不多說了，進去看看吧。

String proto = "http://";
          if(UserGroupInformation.isSecurityEnabled()) { 
             System.setProperty("https.cipherSuites", Krb5AndCertsSslSocketConnector.KRB5_CIPHER_SUITES.get(0));
             proto = "https://";
          }//根據是否開啓了安全屬性，選擇http協議或者https，果然是通過web
          
          final StringBuffer url = new StringBuffer(proto);
          url.append(NameNode.getInfoServer(getConf())).append("/fsck?ugi=").append(ugi.getShortUserName()).append("&path=");
	//從這裏/fsck可以看出，namenode實現了fsck的servlet，這個fsck命令行腳本只不過去向這個接口提交請求。
          String dir = "/";
          // find top-level dir first
          for (int idx = 0; idx < args.length; idx++) {
            if (!args[idx].startsWith("-")) { dir = args[idx]; break; }
          }
          url.append(URLEncoder.encode(dir, "UTF-8"));
          for (int idx = 0; idx < args.length; idx++) {
            if (args[idx].equals("-move")) { url.append("&move=1"); }
            else if (args[idx].equals("-delete")) { url.append("&delete=1"); }
            else if (args[idx].equals("-files")) { url.append("&files=1"); }
            else if (args[idx].equals("-openforwrite")) { url.append("&openforwrite=1"); }
            else if (args[idx].equals("-blocks")) { url.append("&blocks=1"); }
            else if (args[idx].equals("-locations")) { url.append("&locations=1"); }
            else if (args[idx].equals("-racks")) { url.append("&racks=1"); }
          }
          URL path = new URL(url.toString());
          SecurityUtil.fetchServiceTicket(path);
          URLConnection connection = path.openConnection();
          InputStream stream = connection.getInputStream();

從上面的代碼可以看出，這個工具類實現比較簡單，就是調用namenode的fsck這個接口，採用servlet的方式，然後把用戶的參數拼成url去提交request。然後拿到結果打印到屏幕上。我擦，當時不配http地址當然沒法執行了啊！看一下getInfoServer方法吧就知道了。

  public static String getInfoServer(Configuration conf) {
    String http = UserGroupInformation.isSecurityEnabled() ? "dfs.https.address"
        : "dfs.http.address";
    return NetUtils.getServerAddress(conf, "dfs.info.bindAddress",
        "dfs.info.port", http);
  }

這個namenode的infoserver就是提供訪問頁面的那個jetty-httpserver，實際上namenode也就起了這麼一個。ok,fsck到底做了啥，現在就轉變成去看fsck對應的servlet了。

關於namenode提供了哪些servlet，可以到namenode的啓動代碼去看，找到httpserver啓動的部分，就可以看到，fsck對應的servlet叫做FsckServlet

httpServer.addInternalServlet("fsck", "/fsck", FsckServlet.class, true);
          httpServer.addInternalServlet("getimage", "/getimage", 
              GetImageServlet.class, true);
          httpServer.addInternalServlet("listPaths", "/listPaths/*", 
              ListPathsServlet.class, false);
          httpServer.addInternalServlet("data", "/data/*", 
              FileDataServlet.class, false);
          httpServer.addInternalServlet("checksum", "/fileChecksum/*",
              FileChecksumServlets.RedirectServlet.class, false);
          httpServer.addInternalServlet("contentSummary", "/contentSummary/*",
              ContentSummaryServlet.class, false);

從這裏也可以看出，hdfs並沒有做插件支持，這個跟azkaban是不一樣的。下面繼續看這個fsckServerlet幹了啥

        public Object run() throws Exception {
          final NameNode nn = (NameNode) context.getAttribute("name.node");
          final int totalDatanodes = nn.namesystem.getNumberOfDatanodes(DatanodeReportType.LIVE); 
          final short minReplication = nn.namesystem.getMinReplication();

          new NamenodeFsck(conf, nn, nn.getNetworkTopology(), pmap, out,
              totalDatanodes, minReplication, remoteAddress).fsck();
                    return null;
          }

也很簡單，拿到namenode對象，看看現在有多少個活的dn，然後得到系統最小副本數，這些都是fsnameSystem對象管理的。然後把傳進來的參數、請求機器地址和前邊那些東西作爲NamenodeFsck類的構造參數，構造完了執行它的fsck方法。看來這個NamenodeFsck就是專門用來fsck的啦。

在繼續往下看代碼之前，我們先執行一個fsck看一下，對照着返回結果看看NamenodeFsck幹了啥

ok，就叫這個圖1吧，先不管它，這些都是我們熟悉的結果，看看NamenodeFsck的fsck方法吧。

  public void fsck() {
    final long startTime = System.currentTimeMillis();
    try {
      String msg = "FSCK started by " + UserGroupInformation.getCurrentUser()
          + " from " + remoteAddress + " for path " + path + " at " + new Date();
      LOG.info(msg);//用戶的fsck操作會被打到namenode的log裏。
      out.println(msg);//上邊這一句在圖1中確實有，哈哈。
      namenode.getNamesystem().logFsckEvent(path, remoteAddress);//記錄審計日誌
      Result res = new Result(conf);
      final HdfsFileStatus file = namenode.getFileInfo(path);//這裏就是找到文件對應的inode，關於怎麼根據文件路徑名找到對應的inode，前面有介紹。
//http://blog.csdn.net/tracymkgld/article/details/17553173 hdfsfileStatues,就是封裝了inode那些屬性一個東西。
      if (file != null) {
        check(path, file, res);
	//把check的結果放到Result對象。
        out.println(res);
        out.println(" Number of data-nodes:\t\t" + totalDatanodes);//活dn數目
        out.println(" Number of racks:\t\t" + networktopology.getNumOfRacks());//機架數
//fsck可以給出活dn數，機架數。
        out.println("FSCK ended at " + new Date() + " in "
            + (System.currentTimeMillis() - startTime + " milliseconds"));

        // DFSck client scans for the string HEALTHY/CORRUPT to check the status
        // of file system and return appropriate code. Changing the output string
        // might break testcases. Also note this must be the last line 
        // of the report.
        if (res.isHealthy()) {
          out.print("\n\nThe filesystem under path '" + path + "' " + HEALTHY_STATUS);
        }  else {
          out.print("\n\nThe filesystem under path '" + path + "' " + CORRUPT_STATUS);
        }
      } else {
        out.print("\n\nPath '" + path + "' " + NONEXISTENT_STATUS);
      }
    } catch (Exception e) {
      String errMsg = "Fsck on path '" + path + "' " + FAILURE_STATUS;
      LOG.warn(errMsg, e);
      out.println("FSCK ended at " + new Date() + " in "
          + (System.currentTimeMillis() - startTime + " milliseconds"));
      out.println(e.getMessage());
      out.print("\n\n"+errMsg);
    } finally {
      out.close();
    }
  }

看一下hdfsFileStatues裏邊裝了啥

   private static HdfsFileStatus createFileStatus(byte[] path, INode node) {
    // length is zero for directories
    return new HdfsFileStatus(
        node.isDirectory() ? 0 : ((INodeFile)node).computeContentSummary().getLength(), 
        node.isDirectory(), 
        node.isDirectory() ? 0 : ((INodeFile)node).getReplication(), 
        node.isDirectory() ? 0 : ((INodeFile)node).getPreferredBlockSize(),
        node.getModificationTime(),
        node.getAccessTime(),
        node.getFsPermission(),
        node.getUserName(),
        node.getGroupName(),
        path);
  }

好了，看一下check吧，我擦，挺雞巴長，原理比較簡單，主要就是要拿到文件對應多少個塊，每個塊location信息，是不是損壞了等，這些都是dn塊彙報給namenode的，namenode獲取很簡單，這裏就不多說了，至於塊彙報那些東西，回頭再梳理。

private void check(String parent, HdfsFileStatus file, Result res) throws IOException {
    String path = file.getFullName(parent);//不用管，就是從新得到你輸入的要fsck的文件名，之前拿去找inode了，轉成byte了
    boolean isOpen = false;

    if (file.isDir()) {//如果你要check的是個目錄
      byte[] lastReturnedName = HdfsFileStatus.EMPTY_NAME;
      DirectoryListing thisListing;
      if (showFiles) {
        out.println(path + " <dir>");
      }
      res.totalDirs++;
      do {
        assert lastReturnedName != null;
        thisListing = namenode.getListing(path, lastReturnedName);
        if (thisListing == null) {
          return;
        }
        HdfsFileStatus[] files = thisListing.getPartialListing();
        for (int i = 0; i < files.length; i++) {
          check(path, files[i], res);
        }
        lastReturnedName = thisListing.getLastName();
      } while (thisListing.hasMore());
      return;
    }
    long fileLen = file.getLen();
    // Get block locations without updating the file access time 
    // and without block access tokens
    LocatedBlocks blocks = namenode.getNamesystem().getBlockLocations(path, 0,
        fileLen, false, false);
    if (blocks == null) { // the file is deleted
      return;
    }
    isOpen = blocks.isUnderConstruction();
    if (isOpen && !showOpenFiles) {
      // We collect these stats about open files to report with default options
      res.totalOpenFilesSize += fileLen;
      res.totalOpenFilesBlocks += blocks.locatedBlockCount();
      res.totalOpenFiles++;
      return;
    }
    res.totalFiles++;
    res.totalSize += fileLen;
    res.totalBlocks += blocks.locatedBlockCount();
    if (showOpenFiles && isOpen) {
      out.print(path + " " + fileLen + " bytes, " +
        blocks.locatedBlockCount() + " block(s), OPENFORWRITE: ");
    } else if (showFiles) {
      out.print(path + " " + fileLen + " bytes, " +
        blocks.locatedBlockCount() + " block(s): ");
    } else {
      out.print('.');
    }
    if (res.totalFiles % 100 == 0) { out.println(); out.flush(); }
    int missing = 0;
    int corrupt = 0;
    long missize = 0;
    int underReplicatedPerFile = 0;
    int misReplicatedPerFile = 0;
    StringBuffer report = new StringBuffer();
    int i = 0;
    for (LocatedBlock lBlk : blocks.getLocatedBlocks()) {
      Block block = lBlk.getBlock();
      boolean isCorrupt = lBlk.isCorrupt();
      String blkName = block.toString();
      DatanodeInfo[] locs = lBlk.getLocations();
      res.totalReplicas += locs.length;
      short targetFileReplication = file.getReplication();
      if (locs.length > targetFileReplication) {
        res.excessiveReplicas += (locs.length - targetFileReplication);
        res.numOverReplicatedBlocks += 1;
      }
      // Check if block is Corrupt
      if (isCorrupt) {
        corrupt++;
        res.corruptBlocks++;
        out.print("\n" + path + ": CORRUPT block " + block.getBlockName()+"\n");
      }
      if (locs.length >= minReplication)
        res.numMinReplicatedBlocks++;
      if (locs.length < targetFileReplication && locs.length > 0) {
        res.missingReplicas += (targetFileReplication - locs.length);
        res.numUnderReplicatedBlocks += 1;
        underReplicatedPerFile++;
        if (!showFiles) {
          out.print("\n" + path + ": ");
        }
        out.println(" Under replicated " + block +
                    ". Target Replicas is " +
                    targetFileReplication + " but found " +
                    locs.length + " replica(s).");
      }
      // verify block placement policy
      int missingRacks = ReplicationTargetChooser.verifyBlockPlacement(
                    lBlk, targetFileReplication, networktopology);
      if (missingRacks > 0) {
        res.numMisReplicatedBlocks++;
        misReplicatedPerFile++;
        if (!showFiles) {
          if(underReplicatedPerFile == 0)
            out.println();
          out.print(path + ": ");
        }
        out.println(" Replica placement policy is violated for " + 
                    block +
                    ". Block should be additionally replicated on " + 
                    missingRacks + " more rack(s).");
      }
      report.append(i + ". " + blkName + " len=" + block.getNumBytes());
      if (locs.length == 0) {
        report.append(" MISSING!");
        res.addMissing(block.toString(), block.getNumBytes());
        missing++;
        missize += block.getNumBytes();
      } else {
        report.append(" repl=" + locs.length);
        if (showLocations || showRacks) {
          StringBuffer sb = new StringBuffer("[");
          for (int j = 0; j < locs.length; j++) {
            if (j > 0) { sb.append(", "); }
            if (showRacks)
              sb.append(NodeBase.getPath(locs[j]));
            else
              sb.append(locs[j]);
          }
          sb.append(']');
          report.append(" " + sb.toString());
        }
      }
      report.append('\n');
      i++;
    }
    if ((missing > 0) || (corrupt > 0)) {
      if (!showFiles && (missing > 0)) {
        out.print("\n" + path + ": MISSING " + missing
            + " blocks of total size " + missize + " B.");
      }
      res.corruptFiles++;
      switch(fixing) {
      case FIXING_NONE:
        break;
      case FIXING_MOVE:
        if (!isOpen)
          lostFoundMove(parent, file, blocks);
        break;
      case FIXING_DELETE:
        if (!isOpen)
          namenode.delete(path, true);
      }
    }
    if (showFiles) {
      if (missing > 0) {
        out.print(" MISSING " + missing + " blocks of total size " + missize + " B\n");
      }  else if (underReplicatedPerFile == 0 && misReplicatedPerFile == 0) {
        out.print(" OK\n");
      }
      if (showBlocks) {
        out.print(report.toString() + "\n");
      }
    }
  }

小結：

fsck是namenode本身提供的對外接口，通過servlet方式訪問調用，至於訪問方式，隨便，只要提交這個接口請求就行了，hadoop的shell命令行是通過一個工具類使用java提交的，你也可指直接在瀏覽器拼接url，例如：

http://10.4.19.42:50070/fsck?ugi=hadoop&path=/tmp/hadoop/wordcountjavain&files=1&blocks=1&locations=1&racks=1

等價於hadoop fsck /tmp/hadoop/wordcountjavain -fles -blocks -locations -racks (-racks可以沒有，沒有也出現racks信息擦)

fsck的實質是通過name的大管家FsNamesystem對象（FSDirectory）管理的那套命名空間，及其塊彙報上來的信息，從namenode的內存中讀取inode的屬性及其block信息等，副本數，多少個塊，有木有順壞，這些結果都是現成的，並不需要namenode再去dn找找到對應的塊，然後讓datanode去檢查，所以這種“現成”的結果，即存在namenode內存的信息，就是你fsck得到的結果有時候是過時的，你更改文件一段時間後，才能從fsck到準確的記過，比如我把dn上得文件blk給換一個壞的，這時候namenode沒有拿到塊彙報信息，你不會從fsck結果立即感知到它壞了。

但是刪除等會從fsck得到信息，因爲刪除的原理前面也介紹過了無論是trash還是skiptrash，都只是把要刪除的文件進行標記（寄一本臺賬，有清理線程週期發佈rpc調用對應的dn去刪除blk），它直接影響namenode的內存和命名空間（還包括塊彙報信息，即dn弄過來的塊信息也會因爲刪除操作而被修改）。

【HDFS】hdfs的fsck是咋回事

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

【HDFS】文件入Trash-rename操作

如何簡單地測算系統吞吐量

【HDFS】hdfs文件系統的刪除操作

批量數據的聚合以及groupby實現

【HDFS】存儲balancer到底咋回事

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結