Elasticsearch 底層系列之分片恢復解析

我們是基礎架構部,騰訊雲 CES/CTSDB 產品後臺服務的支持團隊,我們擁有專業的ES開發運維能力,爲大家提供穩定、高性能的服務,歡迎有需求的童鞋接入,同時也歡迎各位交流 Elasticsearch、Lucene 相關的技術!

1. 前言

    在線上生產環境中,對於大規模的ES集羣出現節點故障的場景比較多,例如,網絡分區、機器故障、集羣壓力等等,都會導致節點故障。當外在環境恢復後,節點需要重新加入集羣,那麼當節點重新加入集羣時,由於ES的自平衡策略,需要將某些分片恢復到新加入的節點上,那麼ES的分片恢復流程是如何進行的呢?遇到分片恢復的坑該如何解決呢?(這裏線上用戶有碰到過,當恢復的併發調得較大時,會觸發es的bug導致分佈式死鎖)?分片恢復的完整性、一致性如何保證呢?,本文將通過ES源碼一窺究竟。注:ES分片恢復的場景有多種,本文只剖析最複雜的場景--peer recovery。

2. 分片恢復總體流程

    ES副本分片恢復主要涉及恢復的目標節點和源節點,目標節點即故障恢復的節點,源節點爲提供恢復的節點。目標節點向源節點發送分片恢復請求,源節點接收到請求後主要分兩階段來處理。第一階段,對需要恢復的shard創建snapshot,然後根據請求中的metadata對比如果 syncid 相同且 doc 數量相同則跳過,否則對比shard的segment文件差異,將有差異的segment文件發送給target node。第二階段,爲了保證target node數據的完整性,需要將本地的translog發送給target node,且對接收到的translog進行回放。整體流程如下圖所示。

分片恢復流程

    以上爲恢復的總體流程,具體實現細節,下面將結合源碼進行解析。

3. 副本分片流程

3.1. 目標節點請求恢復

    本節,我們通過源碼來剖析副本分片的詳細恢復流程。ES根據metadata的變化來驅動各個模塊工作,副本分片恢復的起始入口爲IndicesClusterStateService.createOrUpdateShards,這裏首先會判斷本地節點是否在routingNodes中,如果在,說明本地節點有分片創建或更新的需求,否則跳過。邏輯如下:

private void createOrUpdateShards(final ClusterState state) {
    RoutingNode localRoutingNode = state.getRoutingNodes().node(state.nodes().getLocalNodeId());
    if (localRoutingNode == null) {
        return;
    }
    DiscoveryNodes nodes = state.nodes();
    RoutingTable routingTable = state.routingTable();
    for (final ShardRouting shardRouting : localRoutingNode) {
        ShardId shardId = shardRouting.shardId();
        if (failedShardsCache.containsKey(shardId) == false) {
            AllocatedIndex<? extends Shard> indexService = indicesService.indexService(shardId.getIndex());
            Shard shard = indexService.getShardOrNull(shardId.id());
            if (shard == null) { // shard不存在則需創建
                createShard(nodes, routingTable, shardRouting, state);
            } else { // 存在則更新
                updateShard(nodes, shardRouting, shard, routingTable, state);
            }
        }
    }
}

    副本分片恢復走的是createShard分支,在該方法中,首先獲取shardRouting的類型,如果恢復類型爲PEER,說明該分片需要從遠端獲取,則需要找到源節點,然後調用IndicesService.createShard:

private void createShard(DiscoveryNodes nodes, RoutingTable routingTable, ShardRouting shardRouting, ClusterState state) {
    DiscoveryNode sourceNode = null;
    if (shardRouting.recoverySource().getType() == Type.PEER)  {
        sourceNode = findSourceNodeForPeerRecovery(logger, routingTable, nodes, shardRouting); // 如果恢復方式是peer,則會找到shard所在的源節點進行恢復
        if (sourceNode == null) {
            return;
        }
    }
        RecoveryState recoveryState = new RecoveryState(shardRouting, nodes.getLocalNode(), sourceNode);
        indicesService.createShard(shardRouting, recoveryState, recoveryTargetService, new RecoveryListener(shardRouting), repositoriesService, failedShardHandler);
        ... ...
}

private static DiscoveryNode findSourceNodeForPeerRecovery(Logger logger, RoutingTable routingTable, DiscoveryNodes nodes, ShardRouting shardRouting) {
    DiscoveryNode sourceNode = null;
    if (!shardRouting.primary()) {
        ShardRouting primary = routingTable.shardRoutingTable(shardRouting.shardId()).primaryShard();
        if (primary.active()) {
            sourceNode = nodes.get(primary.currentNodeId()); // 找到primary shard所在節點
        }
    } else if (shardRouting.relocatingNodeId() != null) {
        sourceNode = nodes.get(shardRouting.relocatingNodeId()); // 找到搬遷的源節點
    } else {
         ... ...
    }
    return sourceNode;
}

    源節點的確定分兩種情況,如果當前shard本身不是primary shard,則源節點爲primary shard所在節點,否則,如果當前shard正在搬遷中(從其他節點搬遷到本節點),則源節點爲數據搬遷的源頭節點。得到源節點後調用IndicesService.createShard,在該方法中調用方法IndexShard.startRecovery開始恢復。對於恢復類型爲PEER的任務,恢復動作的真正執行者爲PeerRecoveryTargetService.doRecovery。在該方法中,首先獲取shard的metadataSnapshot,該結構中包含shard的段信息,如syncid、checksum、doc數等,然後封裝爲 StartRecoveryRequest,通過RPC發送到源節點:

... ...
metadataSnapshot = recoveryTarget.indexShard().snapshotStoreMetadata();
... ...
// 創建recovery quest 
request = new StartRecoveryRequest(recoveryTarget.shardId(), recoveryTarget.indexShard().routingEntry().allocationId().getId(), recoveryTarget.sourceNode(), clusterService.localNode(), metadataSnapshot, recoveryTarget.state().getPrimary(), recoveryTarget.recoveryId());
... ...
// 向源節點發送請求,請求恢復
cancellableThreads.execute(() -> responseHolder.set(
        transportService.submitRequest(request.sourceNode(), PeerRecoverySourceService.Actions.START_RECOVERY, request,
                new FutureTransportResponseHandler<RecoveryResponse>() {
                    @Override
                    public RecoveryResponse newInstance() {
                        return new RecoveryResponse();
                    }
                }).txGet()));

    注意,請求的發送是異步的,但是這裏會調用 PlainTransportFuture.txGet() 方法,等待對端的回覆,否則將一直 阻塞 。至此,目標節點已將請求發送給源節點,源節點的執行邏輯隨後詳細分析。

3.2 源節點處理恢復請求

    源節點接收到請求後會調用恢復的入口函數recover:

class StartRecoveryTransportRequestHandler implements TransportRequestHandler<StartRecoveryRequest> {
    @Override
    public void messageReceived(final StartRecoveryRequest request, final TransportChannel channel) throws Exception {
        RecoveryResponse response = recover(request);
        channel.sendResponse(response);
    }
}

    recover方法根據request得到shard並構造RecoverySourceHandler對象,然後調用handler.recoverToTarget進入恢復的執行體:

public RecoveryResponse recoverToTarget() throws IOException { // 恢復分爲兩階段
    try (Translog.View translogView = shard.acquireTranslogView()) { 
        final IndexCommit phase1Snapshot;
        try {
            phase1Snapshot = shard.acquireIndexCommit(false);
        } catch (Exception e) {
            IOUtils.closeWhileHandlingException(translogView);
            throw new RecoveryEngineException(shard.shardId(), 1, "Snapshot failed", e);
        }
        try {
            phase1(phase1Snapshot, translogView);  // 第一階段,比較syncid和segment,然後得出有差異的部分,主動將數據推送給請求方
        } catch (Exception e) {
            throw new RecoveryEngineException(shard.shardId(), 1, "phase1 failed", e);
        } finally {
            try {
                shard.releaseIndexCommit(phase1Snapshot);
            } catch (IOException ex) {
                logger.warn("releasing snapshot caused exception", ex);
            }
        }
        // engine was just started at the end of phase 1
        if (shard.state() == IndexShardState.RELOCATED) {
            throw new IndexShardRelocatedException(request.shardId());
        }
        try {
            phase2(translogView.snapshot()); // 第二階段,發送translog
        } catch (Exception e) {
            throw new RecoveryEngineException(shard.shardId(), 2, "phase2 failed", e);
        }
        finalizeRecovery();
    }
    return response;
}

    從上面的代碼可以看出,恢復主要分兩個階段,第一階段恢復segment文件,第二階段發送translog。這裏有個關鍵的地方,在恢復前,首先需要獲取translogView及segment snapshot,translogView的作用是保證當前時間點到恢復結束時間段的translog不被刪除,segment snapshot的作用是保證當前時間點之前的segment文件不被刪除。接下來看看兩階段恢復的具體執行邏輯。phase1:

public void phase1(final IndexCommit snapshot, final Translog.View translogView) {
    final Store store = shard.store(); //拿到shard的存儲信息
    recoverySourceMetadata = store.getMetadata(snapshot); // 拿到snapshot的metadata
    String recoverySourceSyncId = recoverySourceMetadata.getSyncId();
            String recoveryTargetSyncId = request.metadataSnapshot().getSyncId();
            final boolean recoverWithSyncId = recoverySourceSyncId != null && recoverySourceSyncId.equals(recoveryTargetSyncId);
            if (recoverWithSyncId) { // 如果syncid相等,再繼續比較下文檔數,如果都相同則不用恢復
    final long numDocsTarget = request.metadataSnapshot().getNumDocs();
    final long numDocsSource = recoverySourceMetadata.getNumDocs();
    if (numDocsTarget != numDocsSource) {
        throw new IllegalStateException("... ...");
    } 
} else {
	final Store.RecoveryDiff diff = recoverySourceMetadata.recoveryDiff(request.metadataSnapshot()); // 找出target和source有差別的segment
	List<StoreFileMetaData> phase1Files = new ArrayList<>(diff.different.size() + diff.missing.size());
	phase1Files.addAll(diff.different);
	phase1Files.addAll(diff.missing);
	... ...
	final Function<StoreFileMetaData, OutputStream> outputStreamFactories =
        md -> new BufferedOutputStream(new RecoveryOutputStream(md, translogView), chunkSizeInBytes);
	sendFiles(store, phase1Files.toArray(new StoreFileMetaData[phase1Files.size()]), outputStreamFactories); // 將需要恢復的文件發送到target node
	... ...
    }
    prepareTargetForTranslog(translogView.totalOperations(), shard.segmentStats(false).getMaxUnsafeAutoIdTimestamp());

    從上面代碼可以看出,phase1的具體邏輯是,首先拿到待恢復shard的metadataSnapshot從而得到recoverySourceSyncId,根據request拿到recoveryTargetSyncId,比較兩邊的syncid,如果相同再比較源和目標的文檔數,如果也相同,說明在當前提交點之前源和目標的shard對應的segments都相同,因此不用恢復segment文件。如果兩邊的syncid不同,說明segment文件有差異,則需要找出所有有差異的文件進行恢復。通過比較recoverySourceMetadata和recoveryTargetSnapshot的差異性,可以找出所有有差別的segment文件。這塊邏輯如下:

public RecoveryDiff recoveryDiff(MetadataSnapshot recoveryTargetSnapshot) {
    final List<StoreFileMetaData> identical = new ArrayList<>();  // 相同的file 
    final List<StoreFileMetaData> different = new ArrayList<>();  // 不同的file
    final List<StoreFileMetaData> missing = new ArrayList<>();   // 缺失的file
    final Map<String, List<StoreFileMetaData>> perSegment = new HashMap<>();
    final List<StoreFileMetaData> perCommitStoreFiles = new ArrayList<>();
    ... ...
    for (List<StoreFileMetaData> segmentFiles : Iterables.concat(perSegment.values(), Collections.singleton(perCommitStoreFiles))) {
        identicalFiles.clear();
        boolean consistent = true;
        for (StoreFileMetaData meta : segmentFiles) {
            StoreFileMetaData storeFileMetaData = recoveryTargetSnapshot.get(meta.name());
            if (storeFileMetaData == null) {
                consistent = false;
                missing.add(meta); // 該segment在target node中不存在,則加入到missing
            } else if (storeFileMetaData.isSame(meta) == false) {
                consistent = false;
                different.add(meta); // 存在但不相同,則加入到different
            } else {
                identicalFiles.add(meta);  // 存在且相同
            }
        }
        if (consistent) {
            identical.addAll(identicalFiles);
        } else {
            // make sure all files are added - this can happen if only the deletes are different
            different.addAll(identicalFiles);
        }
    }
    RecoveryDiff recoveryDiff = new RecoveryDiff(Collections.unmodifiableList(identical), Collections.unmodifiableList(different), Collections.unmodifiableList(missing));
    return recoveryDiff;
}

    這裏將所有的segment file分爲三類:identical(相同)、different(不同)、missing(target缺失)。然後將different和missing的segment files作爲第一階段需要恢復的文件發送到target node。發送完segment files後,源節點還會向目標節點發送消息以通知目標節點清理臨時文件,然後也會發送消息通知目標節點打開引擎準備接收translog,這裏需要注意的是,這兩次網絡通信都會調用 PlainTransportFuture.txGet() 方法阻塞等待 對端回覆。至此,第一階段的恢復邏輯完畢。

    第二階段的邏輯比較簡單,只需將translog view到當前時間之間的所有translog發送給源節點即可。

3.3 目標節點開始恢復

  • 接收segment

    對應上一小節源節點恢復的第一階段,源節點將所有有差異的segment發送給目標節點,目標節點接收到後會將segment文件落盤。segment files的寫入函數爲RecoveryTarget.writeFileChunk:

public void writeFileChunk(StoreFileMetaData fileMetaData, long position, BytesReference content, boolean lastChunk, int totalTranslogOps) throws IOException {
    final Store store = store();
    final String name = fileMetaData.name();
    ... ...
    if (position == 0) {
        indexOutput = openAndPutIndexOutput(name, fileMetaData, store);
    } else {
        indexOutput = getOpenIndexOutput(name); // 加一層前綴,組成臨時文件
    }
    ... ...
    while((scratch = iterator.next()) != null) { 
        indexOutput.writeBytes(scratch.bytes, scratch.offset, scratch.length); // 寫臨時文件
    }
    ... ...
    store.directory().sync(Collections.singleton(temporaryFileName));  // 這裏會調用fsync落盤
}
  • 打開引擎

    經過上面的過程,目標節點完成了追數據的第一步。接收完segment後,目標節點打開shard對應的引擎準備接收translog,注意,這裏打開引擎後,正在恢復的shard便可進行寫入、刪除(操作包括primary shard同步的請求和translog中的操作命令)。打開引擎的邏輯如下:

private void internalPerformTranslogRecovery(boolean skipTranslogRecovery, boolean indexExists, long maxUnsafeAutoIdTimestamp) throws IOException {
    ... ...
    recoveryState.setStage(RecoveryState.Stage.TRANSLOG);
    final EngineConfig.OpenMode openMode;
    if (indexExists == false) {
        openMode = EngineConfig.OpenMode.CREATE_INDEX_AND_TRANSLOG;
    } else if (skipTranslogRecovery) {
        openMode = EngineConfig.OpenMode.OPEN_INDEX_CREATE_TRANSLOG;
    } else {
        openMode = EngineConfig.OpenMode.OPEN_INDEX_AND_TRANSLOG;
    }
    final EngineConfig config = newEngineConfig(openMode, maxUnsafeAutoIdTimestamp);
    // we disable deletes since we allow for operations to be executed against the shard while recovering
    // but we need to make sure we don't loose deletes until we are done recovering
    config.setEnableGcDeletes(false); // 恢復過程中不刪除translog
    Engine newEngine = createNewEngine(config); // 創建engine
    ... ...
}
  • 接收並重放translog

    打開引擎後,便可以根據translog中的命令進行相應的回放動作,回放的邏輯和正常的寫入、刪除類似,這裏需要根據translog還原出操作類型和操作數據,並根據操作數據構建相應的數據對象,然後再調用上一步打開的engine執行相應的操作,這塊邏輯如下:

private void performRecoveryOperation(Engine engine, Translog.Operation operation, boolean allowMappingUpdates, Engine.Operation.Origin origin) throws IOException {
    switch (operation.opType()) { // 還原出操作類型及操作數據並調用engine執行相應的動作
        case INDEX:
            Translog.Index index = (Translog.Index) operation;           
            // ...  根據index構建engineIndex對象 ...
            maybeAddMappingUpdate(engineIndex.type(), engineIndex.parsedDoc().dynamicMappingsUpdate(), engineIndex.id(), allowMappingUpdates);
            index(engine, engineIndex); // 執行寫入操作
            break;
        case DELETE:
            Translog.Delete delete = (Translog.Delete) operation;
            // ...  根據delete構建engineDelete對象 ...
            delete(engine, engineDelete); // 執行刪除操作
            break;
        default:
            throw new IllegalStateException("No operation defined for [" + operation + "]");
    }
}

    通過上面的步驟,translog的重放完畢,此後需要做一些收尾的工作,包括,refresh讓回放後的最新數據可見,打開translog gc:

public void finalizeRecovery() {
    recoveryState().setStage(RecoveryState.Stage.FINALIZE);
    Engine engine = getEngine();
    engine.refresh("recovery_finalization"); 
    engine.config().setEnableGcDeletes(true);
}

    到這裏,replica shard恢復的兩個階段便完成了,由於此時shard還處於INITIALIZING狀態,還需通知master節點啓動已恢復的shard:

private class RecoveryListener implements PeerRecoveryTargetService.RecoveryListener {
    @Override
    public void onRecoveryDone(RecoveryState state) {
        if (state.getRecoverySource().getType() == Type.SNAPSHOT) {
            SnapshotRecoverySource snapshotRecoverySource = (SnapshotRecoverySource) state.getRecoverySource();
            restoreService.indexShardRestoreCompleted(snapshotRecoverySource.snapshot(), shardRouting.shardId());
        }
        shardStateAction.shardStarted(shardRouting, "after " + state.getRecoverySource(), SHARD_STATE_ACTION_LISTENER);
    }
}

    至此,shard recovery的所有流程都已完成。

4. 答疑解惑

    通過上述源碼剖析後,本節將對文章開頭拋出的幾個問題進行答疑解惑,加深大家對分片恢復的理解。

  • 分佈式死鎖     通過上述源碼的分析,大家注意3.1和3.2末尾處加粗的地方,可以看出,源節點和目標節點都有調用PlainTransportFuture.txGet()方法阻塞線程同步返回結果,這是導致死鎖的關鍵點。具體問題描述及處理方法見https://cloud.tencent.com/developer/article/1370318,大家可以結合本文源碼分析搞清楚死鎖的原因。
  • 完整性     首先,phase1階段,保證了存量的歷史數據可以恢復到從分片。phase1階段完成後,從分片引擎打開,可以正常處理index、delete請求,而translog覆蓋完了整個phase1階段,因此在phase1階段中的index/delete操作都將被記錄下來,在phase2階段進行translog回放時,副本分片正常的index和delete操作和translog是並行執行的,這就保證了恢復開始之前的數據、恢復中的數據都會完整的寫入到副本分片,保證了數據的完整性。如下圖所示:
從分片恢復時序圖
  • 一致性

    由於phase1階段完成後,從分片便可正常處理寫入操作,而此時從分片的寫入和phase2階段的translog回放時並行執行的,如果translog的回放慢於正常的寫入操作,那麼可能會導致老的數據後寫入,造成數據不一致。ES爲了保證數據的一致性在進行寫入操作時,會比較當前寫入的版本和lucene文檔版本號,如果當前版本更小,說明是舊數據則不會將文檔寫入lucene。相關代碼如下:

final OpVsLuceneDocStatus opVsLucene = compareOpToLuceneDocBasedOnVersions(index);
if (opVsLucene == OpVsLuceneDocStatus.OP_STALE_OR_EQUAL) {
    plan = IndexingStrategy.skipAsStale(false, index.version());
} 

5. 小結

    本文結合ES源碼詳細分析了副本分片恢復的具體流程,並通過對源碼的理解對文章開頭提出的問題進行答疑解惑。後面,我們也將推出更多ES相關的文章,歡迎大家多多關注和交流。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章