Hadoop源碼分析筆記(七)：HDFS非遠程調用接口

HDFS非遠程調用接口

在網絡文件系統(NFS)是SUN公司在遠程調用(RPC)之上開發的，它的所有文件操作，包括文件/目錄API和用於讀寫文件數據，都通過遠程過程調用實現。客戶使用本地操作系統提供的系統調用訪問文件系統，當虛擬文件系統發現系統調用需要訪問NFS時，如在遠程服務器上創建目錄或對文件進行讀操作，虛擬文件系統會將操作傳遞給NFS客戶組件，由該組件通過RPC訪問相應的NFS服務器。

在上一節的分析中，特別是在對ClientProtocol的分析中個，我們知道，HDFS的文件和目錄相關事務部分，遵循了NFS的實現思路，文件/目錄API利用Hadoop遠程過程調用，發送到名字節點上去執行。但對文件數據的讀寫，HDFS採取了和網絡文件系統截然不同的實現方式。

HDFS沒有采用遠程過程調用實現對文件的讀寫，原因非常簡單，HDFS需要支撐超大文件，基於Hadoop IPC實現文件讀寫，效率達不到系統設計的要求。同時，HDFS提供了對數據的流式訪問，使用基於TCP的流式數據訪問接口，有利於批量處理數據，提高數據的吞吐量。

我們知道第二名字節點會根據一定策略合併名字節點上的命名空間鏡像和鏡像編輯日誌。但是NamenodeProtocol並沒有獲取原始鏡像數據和編輯日誌的遠程方法，也沒有上傳合併後新命名鏡像的遠程方法，上述兩個過程，是通過名字節點提供的，基於HTTP的流式接口進行的。第二名字節點利用名字節點的內建的HTTP服務器，使用HTTP的Get操作獲取數據，即原始命名空間鏡像和編輯日誌，合併操作完成後，利用HTTP協議通知名字節點，由名字節點使用HTTP GET，從第二名字節點下載新命名空間鏡像。名字節點和第二名字節點的這個接口也是非IPC的接口。

數據節點上的非IPC接口

數據節點提供對HDFS文件數據塊的讀寫功能：將HDFS文件數據寫到Linux本地文件系統的文件中，或者從這些本地文件中讀取HDFS文件數據。讀寫的對外接口是基於TCP的非IPC接口。除了數據塊的讀寫，數據節點還提供數據塊替換、數據塊拷貝和數據塊效驗信息等基於TCP的接口。也就是，數據節點通過流式接口，一共提供了5種操作，這些操作都有相應的請求幀結構和操作碼。

數據接口的流式接口操作碼定於在DataTransferProtocol接口中(org.apche.hadoop.hdfs.protocol包中)。接口定於如下：

public interface DataTransferProtocol {
  //操作碼定於如下
  
  /** Version for data transfers between clients and datanodes
   * This should change when serialization of DatanodeInfo, not just
   * when protocol changes. It is not very obvious. 
   */
  /*
   * Version 18:
   *    Change the block packet ack protocol to include seqno,
   *    numberOfReplies, reply0, reply1, ...
   */
  public static final int DATA_TRANSFER_VERSION = 17;

  // Processed at datanode stream-handler 讀寫文件塊操作碼
  public static final byte OP_WRITE_BLOCK = (byte) 80;
  public static final byte OP_READ_BLOCK = (byte) 81;
  /**
   * @deprecated As of version 15, OP_READ_METADATA is no longer supported
   */
  @Deprecated public static final byte OP_READ_METADATA = (byte) 82;
  public static final byte OP_REPLACE_BLOCK = (byte) 83;
  public static final byte OP_COPY_BLOCK = (byte) 84;
  public static final byte OP_BLOCK_CHECKSUM = (byte) 85;
  
  public static final int OP_STATUS_SUCCESS = 0;  
  public static final int OP_STATUS_ERROR = 1;  
  public static final int OP_STATUS_ERROR_CHECKSUM = 2;  
  public static final int OP_STATUS_ERROR_INVALID = 3;  
  public static final int OP_STATUS_ERROR_EXISTS = 4;  
  public static final int OP_STATUS_ERROR_ACCESS_TOKEN = 5;
  public static final int OP_STATUS_CHECKSUM_OK = 6;
  ......
}

下面介紹數據節點的讀寫數據流程：

1、讀數據

讀數據就是從數據節點的某個數據塊中讀取一段文件數據，由上面的操作碼定義我們知道，它的操作碼是(OP_READ_BLOCK)81。當客戶端需要讀數據時，它通過和數據節點的TCP連接，發送請求，由於TCP是基於字節流的，沒有消息邊界的概念，所有必須在流上定義一個數據幀並通過讀數據幀交互信息。客戶端讀取Block相關範例（org.apache.hadoop.hdfs.DFSClient .BlockReader）如下：

/********************************************************
 * DFSClient can connect to a Hadoop Filesystem and 
 * perform basic file tasks.  It uses the ClientProtocol
 * to communicate with a NameNode daemon, and connects 
 * directly to DataNodes to read/write block data.
 *
 * Hadoop DFS users should obtain an instance of 
 * DistributedFileSystem, which uses DFSClient to handle
 * filesystem tasks.
 *
 ********************************************************/
public class DFSClient implements FSConstants, java.io.Closeable {
//數據塊讀者類 /** This is a wrapper around connection to datadone
   * and understands checksum, offset etc
   */
  public static class BlockReader extends FSInputChecker {
	public static BlockReader newBlockReader( Socket sock, String file,
                                       long blockId, 
                                       Token<BlockTokenIdentifier> accessToken,
                                       long genStamp,
                                       long startOffset, long len,
                                       int bufferSize, boolean verifyChecksum,
                                       String clientName)
                                       throws IOException {
      // in and out will be closed when sock is closed (by the caller)
      DataOutputStream out = new DataOutputStream(
        new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT)));
      //讀文件塊傳送的數據
      //write the header.
      out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
      out.write( DataTransferProtocol.OP_READ_BLOCK );
      out.writeLong( blockId );
      out.writeLong( genStamp );
      out.writeLong( startOffset );
      out.writeLong( len );
      Text.writeString(out, clientName);
      accessToken.write(out);
      out.flush();
       DataInputStream in = new DataInputStream(
          new BufferedInputStream(NetUtils.getInputStream(sock), 
                                  bufferSize));
      
      short status = in.readShort();
      if (status != DataTransferProtocol.OP_STATUS_SUCCESS) {
        if (status == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) {
          throw new InvalidBlockTokenException(
              "Got access token error for OP_READ_BLOCK, self="
                  + sock.getLocalSocketAddress() + ", remote="
                  + sock.getRemoteSocketAddress() + ", for file " + file
                  + ", for block " + blockId + "_" + genStamp);
        } else {
          throw new IOException("Got error for OP_READ_BLOCK, self="
              + sock.getLocalSocketAddress() + ", remote="
              + sock.getRemoteSocketAddress() + ", for file " + file
              + ", for block " + blockId + "_" + genStamp);
        }
      }
}}
  ......
}

在上面的代碼中，我們主要關注newBlockReader方法中，讀文件塊傳送的數據部分。

首先，我們看到請求的最前面兩個字段，分時是接口版本號(DataTransferProtocol.DATA_TRANSFER_VERSION)和操作碼(DataTransferProtocol.OP_READ_BLOCK
)，版本好確保通信雙方對數據幀的理解是一致的，緊接着一個自己操作碼錶明操作的目的，這個兩個域會出現在所有的數據節點流接口請求中。

請求接下來是數據塊ID(blockId)和數據塊版本號(genStamp,實際就是文件塊創建的時間)，根據這兩個參數，數據節點可以確保操作的目標數據塊。如同普通的文件讀操作需要指定請求數據的開始位置和需要的數據長度，讀數據塊也需要提供偏移量(startOffset)和數據長度(len)。通過上述4個參數，發起讀請求的客戶端明確了這次請求獲得的數據。

接下來的客戶端名字(clientName)是一個字符串，只用於日誌輸出中，訪問令牌(accessToken)則是執行Hadoop安全檢查的需求。

讀請求的響應也有一定的幀結構，首先是應答頭(如：OP_STATUS_SUCCESS)，接下來是一系列的數據包。爲了保證數據的完整性，HDFS爲每個數據塊保持了響應的校驗信息，校驗基於塊，塊大小默認值爲512字節，即從數據塊開始，每512字節就會產生一個4字節的校驗和。在讀/寫數據塊時，也需要維護基於流的數據讀寫和基於塊的校驗和的關係。

2、寫數據

寫數據操作的複雜程度遠朝讀數據操作，該操作用於往數據節點上的某一數據塊上追加數據，其操作碼爲(OP_WRITE_BLOCK)80。在介紹寫數據前，先來考察HDFS寫數據的數據流管道。

數據流管道是Google實現他們的分佈式文件系統(GFS)時就已經引入，其目的是：在寫一份數據的多個副本是時，可以充分利用集羣中的每一臺機器的帶框，避免網絡瓶頸和高延時的連接，最小化推送所有數據的延遲。Hadoop文件系統也是先了文件數據流管道。

假設目前客戶端寫數據的文本副本數是3，也就是說在該HDFS集羣上，一共有三個數據節點會保存這份數據的三個副本，客戶端在發送數據時，不是同時往三個數據節點上寫數據，而是將數據發送往第一個數據節點，然後第一個數據節點在本地保存數據，同時推送數據到數據節點2，然後照這樣進行直到管道中的最後一個數據節點。確認包由最後一個數據節點產生，並逆流往客戶端放下回送，沿途的數據節點在確定本地寫成功後，才往上流傳遞應答。相對於客戶端往多個不同的數據節點同時寫數據的方式，處於數據流管道上的每一個節點都承擔了寫數據過程中的部分網絡流量，降低了客戶端發送多分數據對網絡的衝擊。客戶端寫Block操作相關範例如下(org.apache.hadoop.hdfs.DFSClient .DFSOutputStream.ResponseProcessor)：

/********************************************************
 * DFSClient can connect to a Hadoop Filesystem and 
 * perform basic file tasks.  It uses the ClientProtocol
 * to communicate with a NameNode daemon, and connects 
 * directly to DataNodes to read/write block data.
 *
 * Hadoop DFS users should obtain an instance of 
 * DistributedFileSystem, which uses DFSClient to handle
 * filesystem tasks.
 *
 ********************************************************/
public class DFSClient implements FSConstants, java.io.Closeable {
   ......
     /****************************************************************
   * DFSOutputStream creates files from a stream of bytes.
   *
   * The client application writes data that is cached internally by
   * this stream. Data is broken up into packets, each packet is
   * typically 64K in size. A packet comprises of chunks. Each chunk
   * is typically 512 bytes and has an associated checksum with it.
   *
   * When a client application fills up the currentPacket, it is
   * enqueued into dataQueue.  The DataStreamer thread picks up
   * packets from the dataQueue, sends it to the first datanode in
   * the pipeline and moves it from the dataQueue to the ackQueue.
   * The ResponseProcessor receives acks from the datanodes. When an
   * successful ack for a packet is received from all datanodes, the
   * ResponseProcessor removes the corresponding packet from the
   * ackQueue.
   *
   * In case of error, all outstanding packets and moved from
   * ackQueue. A new pipeline is setup by eliminating the bad
   * datanode from the original pipeline. The DataStreamer now
   * starts sending packets from the dataQueue.
  ****************************************************************/
  class DFSOutputStream extends FSOutputSummer implements Syncable { //
    // Processes reponses from the datanodes.  A packet is removed 
    // from the ackQueue when its response arrives.
    //
    private class ResponseProcessor extends Thread {
   ......
   // connects to the first datanode in the pipeline
    // Returns true if success, otherwise return failure.
    //
    private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client,
                    boolean recoveryFlag) {
      short pipelineStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS;
      String firstBadLink = "";
      if (LOG.isDebugEnabled()) {
        for (int i = 0; i < nodes.length; i++) {
          LOG.debug("pipeline = " + nodes[i].getName());
        }
      }

      // persist blocks on namenode on next flush
      persistBlocks = true;

      boolean result = false;
      try {
        LOG.debug("Connecting to " + nodes[0].getName());
        InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());
        s = socketFactory.createSocket();
        timeoutValue = 3000 * nodes.length + socketTimeout;
        NetUtils.connect(s, target, timeoutValue);
        s.setSoTimeout(timeoutValue);
        s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
        LOG.debug("Send buf size " + s.getSendBufferSize());
        long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length +
                            datanodeWriteTimeout;

        //
        // Xmit header info to datanode
        //
        DataOutputStream out = new DataOutputStream(
            new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), 
                                     DataNode.SMALL_BUFFER_SIZE));
        blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));
        //寫文件頭部分 版本號-操作碼-數據塊標識-版本號
        out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
        out.write( DataTransferProtocol.OP_WRITE_BLOCK );
        out.writeLong( block.getBlockId() );
        out.writeLong( block.getGenerationStamp() );
        out.writeInt( nodes.length );
        out.writeBoolean( recoveryFlag );       // recovery flag
        Text.writeString( out, client );
        out.writeBoolean(false); // Not sending src node information
        out.writeInt( nodes.length - 1 );
        for (int i = 1; i < nodes.length; i++) {
          nodes[i].write(out);
        }
        accessToken.write(out);
        checksum.writeHeader( out );
        out.flush();

        // receive ack for connect
        pipelineStatus = blockReplyStream.readShort();
        firstBadLink = Text.readString(blockReplyStream);
        if (pipelineStatus != DataTransferProtocol.OP_STATUS_SUCCESS) {
          if (pipelineStatus == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) {
            throw new InvalidBlockTokenException(
                "Got access token error for connect ack with firstBadLink as "
                    + firstBadLink);
          } else {
            throw new IOException("Bad connect ack with firstBadLink as "
                + firstBadLink);
          }
        }

        blockStream = out;
        result = true;     // success

      } catch (IOException ie) {

        LOG.info("Exception in createBlockOutputStream " + nodes[0].getName() +
            " " + ie);

        // find the datanode that matches
        if (firstBadLink.length() != 0) {
          for (int i = 0; i < nodes.length; i++) {
            if (nodes[i].getName().equals(firstBadLink)) {
              errorIndex = i;
              break;
            }
          }
        }
        hasError = true;
        setLastException(ie);
        blockReplyStream = null;
        result = false;
      } finally {
        if (!result) {
          IOUtils.closeSocket(s);
          s = null;
        }
      }
      return result;
    }
   ......
}}
   ......
}

DataNode與DataNode寫Block(org.apache.hadoop.hdfs.server.datanode.DataNode.DataTransfer)範例如下：

/**********************************************************
 * DataNode is a class (and program) that stores a set of
 * blocks for a DFS deployment.  A single deployment can
 * have one or many DataNodes.  Each DataNode communicates
 * regularly with a single NameNode.  It also communicates
 * with client code and other DataNodes from time to time.
 *
 * DataNodes store a series of named blocks.  The DataNode
 * allows client code to read these blocks, or to write new
 * block data.  The DataNode may also, in response to instructions
 * from its NameNode, delete blocks or copy blocks to/from other
 * DataNodes.
 *
 * The DataNode maintains just one critical table:
 *   block-> stream of bytes (of BLOCK_SIZE or less)
 *
 * This info is stored on a local disk.  The DataNode
 * reports the table's contents to the NameNode upon startup
 * and every so often afterwards.
 *
 * DataNodes spend their lives in an endless loop of asking
 * the NameNode for something to do.  A NameNode cannot connect
 * to a DataNode directly; a NameNode simply returns values from
 * functions invoked by a DataNode.
 *
 * DataNodes maintain an open server socket so that client code 
 * or other DataNodes can read/write data.  The host/port for
 * this server is reported to the NameNode, which then sends that
 * information to clients or other DataNodes that might be interested.
 *
 **********************************************************/
public class DataNode extends Configured 
{ 
 ......
/**
   * Used for transferring a block of data.  This class
   * sends a piece of data to another DataNode.
   */
  class DataTransfer implements Runnable {
    DatanodeInfo targets[];
    Block b;
    DataNode datanode;

     /**
     * Do the deed, write the bytes
     */
    public void run() {
      xmitsInProgress.getAndIncrement();
      Socket sock = null;
      DataOutputStream out = null;
      BlockSender blockSender = null;
      
      try {
        InetSocketAddress curTarget = 
          NetUtils.createSocketAddr(targets[0].getName());
        sock = newSocket();
        NetUtils.connect(sock, curTarget, socketTimeout);
        sock.setSoTimeout(targets.length * socketTimeout);

        long writeTimeout = socketWriteTimeout + 
                            HdfsConstants.WRITE_TIMEOUT_EXTENSION * (targets.length-1);
        OutputStream baseStream = NetUtils.getOutputStream(sock, writeTimeout);
        out = new DataOutputStream(new BufferedOutputStream(baseStream, 
                                                            SMALL_BUFFER_SIZE));

        blockSender = new BlockSender(b, 0, b.getNumBytes(), false, false, false, 
            datanode);
        DatanodeInfo srcNode = new DatanodeInfo(dnRegistration);

        //
        // Header info
        //
        out.writeShort(DataTransferProtocol.DATA_TRANSFER_VERSION);
        out.writeByte(DataTransferProtocol.OP_WRITE_BLOCK);
        out.writeLong(b.getBlockId());
        out.writeLong(b.getGenerationStamp());
        out.writeInt(0);           // no pipelining
        out.writeBoolean(false);   // not part of recovery
        Text.writeString(out, ""); // client
        out.writeBoolean(true); // sending src node information
        srcNode.write(out); // Write src node DatanodeInfo
        // write targets
        out.writeInt(targets.length - 1);
        for (int i = 1; i < targets.length; i++) {
          targets[i].write(out);
        }
        Token<BlockTokenIdentifier> accessToken = BlockTokenSecretManager.DUMMY_TOKEN;
        if (isBlockTokenEnabled) {
          accessToken = blockTokenSecretManager.generateToken(null, b,
              EnumSet.of(BlockTokenSecretManager.AccessMode.WRITE));
        }
        accessToken.write(out);
        // send data & checksum
        blockSender.sendBlock(out, baseStream, null);

        // no response necessary
        LOG.info(dnRegistration + ":Transmitted block " + b + " to " + curTarget);

      } catch (IOException ie) {
        LOG.warn(dnRegistration + ":Failed to transfer " + b + " to " + targets[0].getName()
            + " got " + StringUtils.stringifyException(ie));
        // check if there are any disk problem
        datanode.checkDiskError();
        
      } finally {
        xmitsInProgress.getAndDecrement();
        IOUtils.closeStream(blockSender);
        IOUtils.closeStream(out);
        IOUtils.closeSocket(sock);
      }
    }

}
   ......
}

從上面的源碼中，我們發現和讀請求類似，寫請求的最前面兩個字段分別是接口版本好和操作碼，接下來是數據塊標識和版本號，但和讀請求不同的是，寫數據請求沒有偏移量字段，也就是說，用戶只能往數據塊後面添加數據，不能修改寫入的文件內容。接下來是數據流管道的大小（nodes.length）即是，需要寫數據的數據節點的個數，接下來的標識符(isRecovery)表示這個寫操作是不是從錯誤恢復過程中的一部分。如果數據源是某個客戶端，則接下來請求會攜帶客戶端的名字，如果數據源是某個數據節點，則客戶端字符爲空，同時標誌位hasSrcDataNode置爲，在這種情況下，請求中緊接着的是源數據節點的信息字段srcDataNode，該信息就是源數據節點的DatanodeInof對象的序列化結果。寫請求數據源是某一個數據節點，表明該數據節點在執行數據複製。

爲了實現數據流管道功能，寫請求包括numTargets和targets字段，其中targets是數據目標列表，numTargets是這個列表的大小。如果numTargets的值爲零，表明當前數據節點是數據流管道中的最後一個節點。如果numTargets值大於0，那麼，數據目標列表中的第一項，就是數據流管道中，位於當前數據節點後面的下游數據推送目標。還是以前面的管道爲列，客戶端和數據節點1、數據節點2都會想它們的下游節點發送寫請求，客戶端發送的請求中，numTargets值爲2，數據目標列包含了數據節點2和數據節點3。如此，根據numTargets和targets中的信息，連接數據流管道上的各個節點的TCP連接被建立起來，爲後續寫數據準備好通道。

在上述管道中會依次逆向返回寫操作的結果給上游節點，當寫操作的每一個管道上的各個數據節點都順利的寫入磁盤時，最終的結果會是DataTransferProtocol.OP_STATUS_OK,否則爲DataTransferProtol.OP_STATUS_ERROR。

數據流管道是HDFS實現針對海量數據寫的一個優化，在進行寫操作之前，位於管道上的節點，根據上游節點發送過來的寫請求，建立管道，並應用與後續的具體寫操作中。寫數據時，數據流管道上傳輸的寫數據數據包，它複用了讀操作的數據格式，每個數據包都有對應的應答包，以保證數據成功到達各個節點，同時，數據源也能根據應答包，瞭解各個節點上的操作結果。

讀寫操作接口是HDFS數據節點上最重要的流式接口，除了這鏈各個接口，還有數據替換、數據塊拷貝等基於TCP流的接口。

名字節點和第二名字節點上面的非IPC接口

名字節點產生的fsimage(鏡像文件)和edit(修改記錄文件)s文件與第二名字節點間的傳送，由於這兩個文件都比較大，傳送交互這個過程並沒有採用Hadoop IPC，同時也沒有采用數據節點基於TCP的機制，而是使用了基於HTTP的流接口。

版權申明：本文部分摘自【蔡斌、陳湘萍】所著【Hadoop技術內幕深入解析Hadoop Common和HDFS架構設計與實現原理】一書，僅作爲學習筆記，用於技術交流，其商業版權由原作者保留，推薦大家購買圖書研究，轉載請保留原作者，謝謝！

劍邑龍泉

發佈了1 篇原創文章 · 獲贊 0 · 訪問量 3萬+

私信關注

Hadoop源碼分析筆記(七)：HDFS非遠程調用接口

HDFS非遠程調用接口

數據節點上的非IPC接口

名字節點和第二名字節點上面的非IPC接口

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

Hadoop源碼分析筆記(三)：Hadoop遠程過程調用

常用排序算法小結

Java IO流系統詳解

Hadoop源碼分析筆記(十一)：數據節點--數據節點整體運行

Hadoop源碼分析筆記(十二)：名字節點--文件系統目錄樹

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結