文章目錄

前言

上篇文章Ozone數據寫入過程分析，筆者分享了關於Ozone數據寫入的過程分析。本文，筆者來分享對應另外一個過程，數據讀取過程的分析。總體來說，Ozone數據的讀取和寫入過程中，有着部分共同點，都涉及到了Block，Chunk，buffer的概念。論複雜度而言，讀取過程還是比寫入過程要簡單，易懂一些。

Ozone數據的讀取過程：基於Block，Chunk offset的數據讀取

如果大家有細讀過筆者上篇關於Ozone數據寫入過程的文章，應該知道Ozone Key的數據是按照Block進行劃分的，而每個Block則進一步按照Chunk單位進行數據寫出的。一個Chunk對應一個Chunk文件。Block則是內部虛擬的概念，但是Datanode Container會存Block到其下所屬Chunk列表的信息。

在一個Key下，數據按照分段，分爲多個Block，每個Block數據的起始位置在全局模式下的偏移量自然是不同的。比如第二個Block的offset值等於上一個Block的長度。Block的下Chunk的數據組織也是同理。

除開數據的讀取要依賴Offset外，這裏還需要額外分別向其它服務讀取Block，Chunk信息，畢竟Client事先並不知道這些信息，主要有如下3個操作：

Client向OzoneManager發起查詢key信息的請求，返回的key信息中包含有其下所有Block的信息
Block Stream內部會向Datanode查詢Container db中的Block數據，Block信息裏包含有其所屬的Chunk信息
Chunk Stream向Datanode查詢實際chunk數據文件信息，然後加載到自身buffer內供外部讀取

綜上所述，其上的整體過程圖如下所示：

Ozone數據讀取相關代碼分析

下面我們來其中部分關鍵read相關方法的代碼實現分析。

首先是Client向OM服務查詢key信息操作，

  public OzoneInputStream readFile(String volumeName, String bucketName,
      String keyName) throws IOException {
    OmKeyArgs keyArgs = new OmKeyArgs.Builder()
        .setVolumeName(volumeName)
        .setBucketName(bucketName)
        .setKeyName(keyName)
        .setSortDatanodesInPipeline(topologyAwareReadEnabled)
        .build();
    // 1.client向OM查詢給你key的metadata信息，裏面包含有key下的block信息
    // 然後client用查詢得到的key信息構造輸入流對象.
    OmKeyInfo keyInfo = ozoneManagerClient.lookupFile(keyArgs);
    return createInputStream(keyInfo);
  }

然後會執行到後面KeyInputStream的初始化方法，創建多個Block Stream對象，

  private synchronized void initialize(String keyName,
      List<OmKeyLocationInfo> blockInfos,
      XceiverClientManager xceiverClientManager,
      boolean verifyChecksum) {
    this.key = keyName;
    this.blockOffsets = new long[blockInfos.size()];
    long keyLength = 0;
    // 2.KeyInputStream根據查詢得到的key block信息構造對應BlockOutputStream對象
    for (int i = 0; i < blockInfos.size(); i++) {
      OmKeyLocationInfo omKeyLocationInfo = blockInfos.get(i);
      if (LOG.isDebugEnabled()) {
        LOG.debug("Adding stream for accessing {}. The stream will be " +
            "initialized later.", omKeyLocationInfo);
      }
      // 3.構造BlockOutputStream並加入到block stream列表中
      addStream(omKeyLocationInfo, xceiverClientManager,
          verifyChecksum);
      // 4.更新當前創建的BlockOutputStream在全局key文件下的偏移量值
      this.blockOffsets[i] = keyLength;
      // 5.更新當前的key len，此值將成爲下一個BlockOutputStream的初始偏移量
      keyLength += omKeyLocationInfo.getLength();
    }
    this.length = keyLength;
  }

然後是基於Block offset的數據read操作，

  public synchronized int read(byte[] b, int off, int len) throws IOException {
    checkOpen();
    if (b == null) {
      throw new NullPointerException();
    }
    if (off < 0 || len < 0 || len > b.length - off) {
      throw new IndexOutOfBoundsException();
    }
    if (len == 0) {
      return 0;
    }
    int totalReadLen = 0;
    // 當還有剩餘需要讀取的數據時，繼續進行block的數據讀取
    while (len > 0) {
      // 噹噹前的block下標已經是最後一個block stream，並且最後一個block
      // stream的未讀數據長度爲0時，說明key文件數據已全部讀完，操作返回.
      if (blockStreams.size() == 0 ||
          (blockStreams.size() - 1 <= blockIndex &&
              blockStreams.get(blockIndex)
                  .getRemaining() == 0)) {
        return totalReadLen == 0 ? EOF : totalReadLen;
      }

      // 1.獲取當前準備讀取的BlockInputStream對象
      BlockInputStream current = blockStreams.get(blockIndex);
      // 2.計算後面需要讀取的數據長度，取剩餘需要讀取的數據長度和當前
      // BlockInputStream未讀的數據長度間的較小值.
      int numBytesToRead = Math.min(len, (int)current.getRemaining());
      // 3.從BlockInputStream中讀取數據到字節數組中
      int numBytesRead = current.read(b, off, numBytesToRead);
      if (numBytesRead != numBytesToRead) {
        // This implies that there is either data loss or corruption in the
        // chunk entries. Even EOF in the current stream would be covered in
        // this case.
        throw new IOException(String.format("Inconsistent read for blockID=%s "
                        + "length=%d numBytesToRead=%d numBytesRead=%d",
                current.getBlockID(), current.getLength(), numBytesToRead,
                numBytesRead));
      }
      // 4.更新相關指標，offset偏移量，剩餘需要讀取的數據長度更新
      totalReadLen += numBytesRead;
      off += numBytesRead;
      len -= numBytesRead;
      // 5.如果當前的Block數據讀完了，則block下標移向下一個block
      if (current.getRemaining() <= 0 &&
          ((blockIndex + 1) < blockStreams.size())) {
        blockIndex += 1;
      }
    }
    return totalReadLen;
  }

上面再次調用的Block Stream的read操作，裏面涉及到其實是Chunk stream的read操作，邏輯和上面方法基本一樣。

另外一個讀數據操作方法seek方法，

  public synchronized void seek(long pos) throws IOException {
    checkOpen();
    if (pos == 0 && length == 0) {
      // It is possible for length and pos to be zero in which case
      // seek should return instead of throwing exception
      return;
    }
    if (pos < 0 || pos > length) {
      throw new EOFException(
          "EOF encountered at pos: " + pos + " for key: " + key);
    }

    // 1. 更新Block的索引位置
    if (blockIndex >= blockStreams.size()) {
      // 如果Index超過最大值，則從blockOffsets中進行二分查找Index值
      blockIndex = Arrays.binarySearch(blockOffsets, pos);
    } else if (pos < blockOffsets[blockIndex]) {
      // 如果目標位置小於當前block的offset，則縮小範圍到[0, blockOffsets[blockIndex]]
      // 進行查找
      blockIndex =
          Arrays.binarySearch(blockOffsets, 0, blockIndex, pos);
    } else if (pos >= blockOffsets[blockIndex] + blockStreams
        .get(blockIndex).getLength()) {
      // 否則進行剩餘部分[blockOffsets[blockIndex+1], blockOffsets[blockStreams.size() - 1]]
      blockIndex = Arrays
          .binarySearch(blockOffsets, blockIndex + 1,
              blockStreams.size(), pos);
    }
    if (blockIndex < 0) {
      // Binary search returns -insertionPoint - 1  if element is not present
      // in the array. insertionPoint is the point at which element would be
      // inserted in the sorted array. We need to adjust the blockIndex
      // accordingly so that blockIndex = insertionPoint - 1
      blockIndex = -blockIndex - 2;
    }

    // 2.重置上次BlockOutputStream seek的位置
    blockStreams.get(blockIndexOfPrevPosition).resetPosition();

    // 3.重置當前Block下標後的block的位置
    for (int index =  blockIndex + 1; index < blockStreams.size(); index++) {
      blockStreams.get(index).seek(0);
    }
    // 4. 調整當前Block到目標給定的位置=給定位置-此block的全局偏移量
    blockStreams.get(blockIndex).seek(pos - blockOffsets[blockIndex]);
    blockIndexOfPrevPosition = blockIndex;
  }

因爲Block Stream內部的讀取邏輯和Key Stream的實現大體上一致，這裏就略過了。我們直接來看Chunk Stream的buffer數據讀取的過程。

Chunk Stream的read操作如下：

  public synchronized int read(byte[] b, int off, int len) throws IOException {
    // According to the JavaDocs for InputStream, it is recommended that
    // subclasses provide an override of bulk read if possible for performance
    // reasons.  In addition to performance, we need to do it for correctness
    // reasons.  The Ozone REST service uses PipedInputStream and
    // PipedOutputStream to relay HTTP response data between a Jersey thread and
    // a Netty thread.  It turns out that PipedInputStream/PipedOutputStream
    // have a subtle dependency (bug?) on the wrapped stream providing separate
    // implementations of single-byte read and bulk read.  Without this, get key
    // responses might close the connection before writing all of the bytes
    // advertised in the Content-Length.
    if (b == null) {
      throw new NullPointerException();
    }
    if (off < 0 || len < 0 || len > b.length - off) {
      throw new IndexOutOfBoundsException();
    }
    if (len == 0) {
      return 0;
    }
    checkOpen();
    int total = 0;
    while (len > 0) {
      // 1.準備讀取len長度數據到Buffer中
      int available = prepareRead(len);
      if (available == EOF) {
        // There is no more data in the chunk stream. The buffers should have
        // been released by now
        Preconditions.checkState(buffers == null);
        return total != 0 ? total : EOF;
      }
      // 2.從buffer讀數據到輸入數組中，此過程buffer的position會往後移動available長度
      buffers.get(bufferIndex).get(b, off + total, available);
      // 3.更新剩餘長度
      len -= available;
      total += available;
    }

    // 4.如果已經讀到Chunk尾部了，則釋放buffer空間
    if (chunkStreamEOF()) {
      // smart consumers determine EOF by calling getPos()
      // so we release buffers when serving the final bytes of data
      releaseBuffers();
    }

    return total;
  }

PrepareRead操作將會從Datanode中讀取chunk數據加載到buffer中，

  private synchronized int prepareRead(int len) throws IOException {
    for (;;) {
      if (chunkPosition >= 0) {
        if (buffersHavePosition(chunkPosition)) {
          // The current buffers have the seeked position. Adjust the buffer
          // index and position to point to the chunkPosition.
          adjustBufferPosition(chunkPosition - bufferOffset);
        } else {
          // Read a required chunk data to fill the buffers with seeked
          // position data
          readChunkFromContainer(len);
        }
      }
      // 如果Chunk之前沒有seek到某個位置，則獲取當前buffer，判斷是否包含數據
      if (buffersHaveData()) {
        // Data is available from buffers
        ByteBuffer bb = buffers.get(bufferIndex);
        return len > bb.remaining() ? bb.remaining() : len;
      } else  if (dataRemainingInChunk()) {
    	// 如果當前buffer不包含數據並且chunk有剩餘數據需要被讀，
    	// 則讀取chunk數據到buffer中
        readChunkFromContainer(len);
      } else {
        // All available input from this chunk stream has been consumed.
        return EOF;
      }
    }
  }

在每個 loop結束時，上面的chunkStreamEOF方法會進行已讀取位置的檢查，

  /**
   * 檢查是否已經抵達Chunk尾部.
   */
  private boolean chunkStreamEOF() {
    if (!allocated) {
      // Chunk data has not been read yet
      return false;
    }

    // 判斷讀取的位置是否已經達到Chunk末尾的2個條件：
    // 1)buffer中是否還有數據
    // 2)是否已經達到chunk的length長度
    if (buffersHaveData() || dataRemainingInChunk()) {
      return false;
    } else {
      Preconditions.checkState(bufferOffset + bufferLength == length,
          "EOF detected, but not at the last byte of the chunk");
      return true;
    }
  }

Chunk Stream利用ByteBuffer來減少頻繁的IO讀取操作，來提升效率。

OK，以上就是Ozone數據讀取的過程分析，核心點是基於數據偏移量在Block，Chunk間進行數據的讀取。

Ozone數據讀取過程分析

文章目錄

前言

Ozone數據的讀取過程：基於Block，Chunk offset的數據讀取

Ozone數據讀取相關代碼分析

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

HDFS Rolling Upgrade的實現要點分析

Alluxio基於冷熱數據分離的元數據管理策略

存儲系統元數據管理演變升級

Ozone的Erasure Coding方案設計

Ozone數據寫入過程分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結