前言
上篇文章Ozone數據寫入過程分析,筆者分享了關於Ozone數據寫入的過程分析。本文,筆者來分享對應另外一個過程,數據讀取過程的分析。總體來說,Ozone數據的讀取和寫入過程中,有着部分共同點,都涉及到了Block,Chunk,buffer的概念。論複雜度而言,讀取過程還是比寫入過程要簡單,易懂一些。
Ozone數據的讀取過程:基於Block,Chunk offset的數據讀取
如果大家有細讀過筆者上篇關於Ozone數據寫入過程的文章,應該知道Ozone Key的數據是按照Block進行劃分的,而每個Block則進一步按照Chunk單位進行數據寫出的。一個Chunk對應一個Chunk文件。Block則是內部虛擬的概念,但是Datanode Container會存Block到其下所屬Chunk列表的信息。
在一個Key下,數據按照分段,分爲多個Block,每個Block數據的起始位置在全局模式下的偏移量自然是不同的。比如第二個Block的offset值等於上一個Block的長度。Block的下Chunk的數據組織也是同理。
除開數據的讀取要依賴Offset外,這裏還需要額外分別向其它服務讀取Block,Chunk信息,畢竟Client事先並不知道這些信息,主要有如下3個操作:
- Client向OzoneManager發起查詢key信息的請求,返回的key信息中包含有其下所有Block的信息
- Block Stream內部會向Datanode查詢Container db中的Block數據,Block信息裏包含有其所屬的Chunk信息
- Chunk Stream向Datanode查詢實際chunk數據文件信息,然後加載到自身buffer內供外部讀取
綜上所述,其上的整體過程圖如下所示:
Ozone數據讀取相關代碼分析
下面我們來其中部分關鍵read相關方法的代碼實現分析。
首先是Client向OM服務查詢key信息操作,
public OzoneInputStream readFile(String volumeName, String bucketName,
String keyName) throws IOException {
OmKeyArgs keyArgs = new OmKeyArgs.Builder()
.setVolumeName(volumeName)
.setBucketName(bucketName)
.setKeyName(keyName)
.setSortDatanodesInPipeline(topologyAwareReadEnabled)
.build();
// 1.client向OM查詢給你key的metadata信息,裏面包含有key下的block信息
// 然後client用查詢得到的key信息構造輸入流對象.
OmKeyInfo keyInfo = ozoneManagerClient.lookupFile(keyArgs);
return createInputStream(keyInfo);
}
然後會執行到後面KeyInputStream的初始化方法,創建多個Block Stream對象,
private synchronized void initialize(String keyName,
List<OmKeyLocationInfo> blockInfos,
XceiverClientManager xceiverClientManager,
boolean verifyChecksum) {
this.key = keyName;
this.blockOffsets = new long[blockInfos.size()];
long keyLength = 0;
// 2.KeyInputStream根據查詢得到的key block信息構造對應BlockOutputStream對象
for (int i = 0; i < blockInfos.size(); i++) {
OmKeyLocationInfo omKeyLocationInfo = blockInfos.get(i);
if (LOG.isDebugEnabled()) {
LOG.debug("Adding stream for accessing {}. The stream will be " +
"initialized later.", omKeyLocationInfo);
}
// 3.構造BlockOutputStream並加入到block stream列表中
addStream(omKeyLocationInfo, xceiverClientManager,
verifyChecksum);
// 4.更新當前創建的BlockOutputStream在全局key文件下的偏移量值
this.blockOffsets[i] = keyLength;
// 5.更新當前的key len,此值將成爲下一個BlockOutputStream的初始偏移量
keyLength += omKeyLocationInfo.getLength();
}
this.length = keyLength;
}
然後是基於Block offset的數據read操作,
public synchronized int read(byte[] b, int off, int len) throws IOException {
checkOpen();
if (b == null) {
throw new NullPointerException();
}
if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
}
if (len == 0) {
return 0;
}
int totalReadLen = 0;
// 當還有剩餘需要讀取的數據時,繼續進行block的數據讀取
while (len > 0) {
// 噹噹前的block下標已經是最後一個block stream,並且最後一個block
// stream的未讀數據長度爲0時,說明key文件數據已全部讀完,操作返回.
if (blockStreams.size() == 0 ||
(blockStreams.size() - 1 <= blockIndex &&
blockStreams.get(blockIndex)
.getRemaining() == 0)) {
return totalReadLen == 0 ? EOF : totalReadLen;
}
// 1.獲取當前準備讀取的BlockInputStream對象
BlockInputStream current = blockStreams.get(blockIndex);
// 2.計算後面需要讀取的數據長度,取剩餘需要讀取的數據長度和當前
// BlockInputStream未讀的數據長度間的較小值.
int numBytesToRead = Math.min(len, (int)current.getRemaining());
// 3.從BlockInputStream中讀取數據到字節數組中
int numBytesRead = current.read(b, off, numBytesToRead);
if (numBytesRead != numBytesToRead) {
// This implies that there is either data loss or corruption in the
// chunk entries. Even EOF in the current stream would be covered in
// this case.
throw new IOException(String.format("Inconsistent read for blockID=%s "
+ "length=%d numBytesToRead=%d numBytesRead=%d",
current.getBlockID(), current.getLength(), numBytesToRead,
numBytesRead));
}
// 4.更新相關指標,offset偏移量,剩餘需要讀取的數據長度更新
totalReadLen += numBytesRead;
off += numBytesRead;
len -= numBytesRead;
// 5.如果當前的Block數據讀完了,則block下標移向下一個block
if (current.getRemaining() <= 0 &&
((blockIndex + 1) < blockStreams.size())) {
blockIndex += 1;
}
}
return totalReadLen;
}
上面再次調用的Block Stream的read操作,裏面涉及到其實是Chunk stream的read操作,邏輯和上面方法基本一樣。
另外一個讀數據操作方法seek方法,
public synchronized void seek(long pos) throws IOException {
checkOpen();
if (pos == 0 && length == 0) {
// It is possible for length and pos to be zero in which case
// seek should return instead of throwing exception
return;
}
if (pos < 0 || pos > length) {
throw new EOFException(
"EOF encountered at pos: " + pos + " for key: " + key);
}
// 1. 更新Block的索引位置
if (blockIndex >= blockStreams.size()) {
// 如果Index超過最大值,則從blockOffsets中進行二分查找Index值
blockIndex = Arrays.binarySearch(blockOffsets, pos);
} else if (pos < blockOffsets[blockIndex]) {
// 如果目標位置小於當前block的offset,則縮小範圍到[0, blockOffsets[blockIndex]]
// 進行查找
blockIndex =
Arrays.binarySearch(blockOffsets, 0, blockIndex, pos);
} else if (pos >= blockOffsets[blockIndex] + blockStreams
.get(blockIndex).getLength()) {
// 否則進行剩餘部分[blockOffsets[blockIndex+1], blockOffsets[blockStreams.size() - 1]]
blockIndex = Arrays
.binarySearch(blockOffsets, blockIndex + 1,
blockStreams.size(), pos);
}
if (blockIndex < 0) {
// Binary search returns -insertionPoint - 1 if element is not present
// in the array. insertionPoint is the point at which element would be
// inserted in the sorted array. We need to adjust the blockIndex
// accordingly so that blockIndex = insertionPoint - 1
blockIndex = -blockIndex - 2;
}
// 2.重置上次BlockOutputStream seek的位置
blockStreams.get(blockIndexOfPrevPosition).resetPosition();
// 3.重置當前Block下標後的block的位置
for (int index = blockIndex + 1; index < blockStreams.size(); index++) {
blockStreams.get(index).seek(0);
}
// 4. 調整當前Block到目標給定的位置=給定位置-此block的全局偏移量
blockStreams.get(blockIndex).seek(pos - blockOffsets[blockIndex]);
blockIndexOfPrevPosition = blockIndex;
}
因爲Block Stream內部的讀取邏輯和Key Stream的實現大體上一致,這裏就略過了。我們直接來看Chunk Stream的buffer數據讀取的過程。
Chunk Stream的read操作如下:
public synchronized int read(byte[] b, int off, int len) throws IOException {
// According to the JavaDocs for InputStream, it is recommended that
// subclasses provide an override of bulk read if possible for performance
// reasons. In addition to performance, we need to do it for correctness
// reasons. The Ozone REST service uses PipedInputStream and
// PipedOutputStream to relay HTTP response data between a Jersey thread and
// a Netty thread. It turns out that PipedInputStream/PipedOutputStream
// have a subtle dependency (bug?) on the wrapped stream providing separate
// implementations of single-byte read and bulk read. Without this, get key
// responses might close the connection before writing all of the bytes
// advertised in the Content-Length.
if (b == null) {
throw new NullPointerException();
}
if (off < 0 || len < 0 || len > b.length - off) {
throw new IndexOutOfBoundsException();
}
if (len == 0) {
return 0;
}
checkOpen();
int total = 0;
while (len > 0) {
// 1.準備讀取len長度數據到Buffer中
int available = prepareRead(len);
if (available == EOF) {
// There is no more data in the chunk stream. The buffers should have
// been released by now
Preconditions.checkState(buffers == null);
return total != 0 ? total : EOF;
}
// 2.從buffer讀數據到輸入數組中,此過程buffer的position會往後移動available長度
buffers.get(bufferIndex).get(b, off + total, available);
// 3.更新剩餘長度
len -= available;
total += available;
}
// 4.如果已經讀到Chunk尾部了,則釋放buffer空間
if (chunkStreamEOF()) {
// smart consumers determine EOF by calling getPos()
// so we release buffers when serving the final bytes of data
releaseBuffers();
}
return total;
}
PrepareRead操作將會從Datanode中讀取chunk數據加載到buffer中,
private synchronized int prepareRead(int len) throws IOException {
for (;;) {
if (chunkPosition >= 0) {
if (buffersHavePosition(chunkPosition)) {
// The current buffers have the seeked position. Adjust the buffer
// index and position to point to the chunkPosition.
adjustBufferPosition(chunkPosition - bufferOffset);
} else {
// Read a required chunk data to fill the buffers with seeked
// position data
readChunkFromContainer(len);
}
}
// 如果Chunk之前沒有seek到某個位置,則獲取當前buffer,判斷是否包含數據
if (buffersHaveData()) {
// Data is available from buffers
ByteBuffer bb = buffers.get(bufferIndex);
return len > bb.remaining() ? bb.remaining() : len;
} else if (dataRemainingInChunk()) {
// 如果當前buffer不包含數據並且chunk有剩餘數據需要被讀,
// 則讀取chunk數據到buffer中
readChunkFromContainer(len);
} else {
// All available input from this chunk stream has been consumed.
return EOF;
}
}
}
在每個 loop結束時,上面的chunkStreamEOF方法會進行已讀取位置的檢查,
/**
* 檢查是否已經抵達Chunk尾部.
*/
private boolean chunkStreamEOF() {
if (!allocated) {
// Chunk data has not been read yet
return false;
}
// 判斷讀取的位置是否已經達到Chunk末尾的2個條件:
// 1)buffer中是否還有數據
// 2)是否已經達到chunk的length長度
if (buffersHaveData() || dataRemainingInChunk()) {
return false;
} else {
Preconditions.checkState(bufferOffset + bufferLength == length,
"EOF detected, but not at the last byte of the chunk");
return true;
}
}
Chunk Stream利用ByteBuffer來減少頻繁的IO讀取操作,來提升效率。
OK,以上就是Ozone數據讀取的過程分析,核心點是基於數據偏移量在Block,Chunk間進行數據的讀取。