Elasticsearch 底層系列之寫入解析

1. 簡介

Elasticsearch(ES)是一個基於Lucene的近實時分佈式存儲及搜索分析系統，其應用場景廣泛，可應用於日誌分析、全文檢索、結構化數據分析等多種場景，既可作爲NoSQL數據庫，也可作爲搜索引擎。由於ES具備如此強悍的能力，因此吸引了很多公司爭相使用，如維基百科、GitHub、Stack Overflow等。

對於ES的寫入，我們主要關心寫入的實時性及可靠性。本文將通過源碼來探索ES寫入的具體流程。

2. 分佈式寫入流程

ES的寫入模型參考了微軟的 PacificA協議。寫入操作必須在主分片上面完成之後才能被複制到相關的副本分片，如下圖所示：

寫操作一般會經歷三種節點：協調節點、主分片所在節點、從分片所在節點。上圖中NODE1可視爲協調節點，協調節點接收到請求後，確定寫入的文檔屬於分片0，於是將請求轉發到分片0的主分片所在的節點NODE3，NODE3完成寫入後，再將請求轉發給分片0所屬的從分片所在的節點NODE1和NODE2，待所有從分片寫入成功後，NODE3則認爲整個寫入成功並將結果反饋給協調節點，協調節點再將結果返回客戶端。

上述爲寫入的大體流程，整個流程的具體細節，下面會結合源碼進行解析。

3. 寫入源碼分析

ES的寫入有兩種方式一種是逐個文檔寫入（index），另一種是多個文檔批量寫入（bulk）。對於這兩種寫入方式，ES都會將其轉換爲bulk寫入。本節，我們就以bulk寫入爲例，根據代碼執行主線來分析ES寫入的流程。

3.1 bulk請求分發

ES對用戶請求一般會經過兩層處理，一層是Rest層，另一層是Transport層。Rest層主要進行請求參數解析，Transport層則進行實際用戶請求處理。在每一層請求處理前都有一次請求分發，如下圖所示：

客戶端發送的http請求由HttpServerTransport初步處理後進入RestController模塊，在RestController中進行實際的分發過程：

public void dispatchRequest(RestRequest request, RestChannel channel, ThreadContext threadContext) {
        if (request.rawPath().equals("/favicon.ico")) {
            handleFavicon(request, channel);
            return;
        }
        RestChannel responseChannel = channel;
        try {
            final int contentLength = request.hasContent() ? request.content().length() : 0;
            assert contentLength >= 0 : "content length was negative, how is that possible?";
            final RestHandler handler = getHandler(request);
        ... ...
}

void dispatchRequest(final RestRequest request, final RestChannel channel, final NodeClient client, ThreadContext threadContext,
                         final RestHandler handler) throws Exception {
            ... ...
            final RestHandler wrappedHandler = Objects.requireNonNull(handlerWrapper.apply(handler));
            wrappedHandler.handleRequest(request, channel, client);
            ... ...
        }
}

從上面的代碼可以看出在第一個dispatchRequest中，會根據request找到其對應的handler，然後在第二個dispatchRequest中會調用handler的handleRequest方法處理請求。那麼getHandler是如何根據請求找到對應的handler的呢？這塊的邏輯如下：

public void registerHandler(RestRequest.Method method, String path, RestHandler handler) {
        PathTrie<RestHandler> handlers = getHandlersForMethod(method);
        if (handlers != null) {
            handlers.insert(path, handler);
        } else {
            throw new IllegalArgumentException("Can't handle [" + method + "] for path [" + path + "]");
        }
}

private RestHandler getHandler(RestRequest request) {
        String path = getPath(request);
        PathTrie<RestHandler> handlers = getHandlersForMethod(request.method());
        if (handlers != null) {
            return handlers.retrieve(path, request.params());
        } else {
            return null;
        }
}

ES會通過RestController的registerHandler方法，提前把handler註冊到對應http請求方法（GET、PUT、POST、DELETE等）的handlers列表。這樣用戶請求到達時，就可以通過RestController的getHandler方法，並根據http請求方法和路徑取出對應的handler。對於bulk操作，其請求對應的handler是RestBulkAction，該類會在其構造函數中將其註冊到RestController，代碼如下：

public RestBulkAction(Settings settings, RestController controller) {
        super(settings);
        controller.registerHandler(POST, "/_bulk", this);
        controller.registerHandler(PUT, "/_bulk", this);
        controller.registerHandler(POST, "/{index}/_bulk", this);
        controller.registerHandler(PUT, "/{index}/_bulk", this);
        controller.registerHandler(POST, "/{index}/{type}/_bulk", this);
        controller.registerHandler(PUT, "/{index}/{type}/_bulk", this);
        this.allowExplicitIndex = MULTI_ALLOW_EXPLICIT_INDEX.get(settings);
}

RestBulkAction會將RestRequest解析並轉化爲BulkRequest，然後再對BulkRequest做處理，這塊的邏輯在prepareRequest方法中，部分代碼如下：

    public RestChannelConsumer prepareRequest(final RestRequest request, final NodeClient client) throws IOException {
       // 根據RestRquest構建BulkRequest
       ... ...
       // 處理bulkRequest
        return channel -> client.bulk(bulkRequest, new RestStatusToXContentListener<>(channel));
    }

NodeClient在處理BulkRequest請求時，會將請求的action轉化爲對應Transport層的action，然後再由Transport層的action來處理BulkRequest，action轉化的代碼如下：

    public <  Request extends ActionRequest, Response extends ActionResponse >
Task executeLocally(GenericAction<Request, Response> action, Request request, TaskListener<Response> listener) {
        return transportAction(action).execute(request, listener);
    }

    private <    Request extends ActionRequest,Response extends ActionResponse > 
           TransportAction<Request, Response> transportAction(GenericAction<Request, Response> action) {
       ... ...
        // actions是個action到transportAction的映射，這個映射關係是在節點啓動時初始化的
        TransportAction<Request, Response> transportAction = actions.get(action);
        ... ...
        return transportAction;
    }

TransportAction會調用一個請求過濾鏈來處理請求，如果相關的插件定義了對該action的過濾處理，則先會執行插件的處理邏輯，然後再進入TransportAction的處理邏輯，過濾鏈的處理邏輯如下：

public void proceed(Task task, String actionName, Request request, ActionListener<Response> listener) {
    int i = index.getAndIncrement();
    try {
        if (i < this.action.filters.length) {
            this.action.filters[i].apply(task, actionName, request, listener, this); // 應用插件的邏輯
        } else if (i == this.action.filters.length) {
            this.action.doExecute(task, request, listener);  // 執行TransportAction的處理邏輯
        } else ... ...
    } catch(Exception e) { ... ... }
}

對於Bulk請求，這裏的TransportAction對應的具體對象是TransportBulkAction的實例，到此，Rest層轉化爲Transport層的流程完成，下節將詳細介紹TransportBulkAction的處理邏輯。

3.2 寫入步驟

3.2.1 創建index

如果bulk寫入時，index未創建則es會自動創建出對應的index，處理邏輯在TransportBulkAction的doExecute方法中：

for (String index : indices) {
    boolean shouldAutoCreate;
    try {
        shouldAutoCreate = shouldAutoCreate(index, state);
    } catch (IndexNotFoundException e) {
        shouldAutoCreate = false;
        indicesThatCannotBeCreated.put(index, e);
    }
    if (shouldAutoCreate) {
        autoCreateIndices.add(index);
    }
}
... ...
for (String index : autoCreateIndices) {
 createIndex(index, bulkRequest.timeout(), new ActionListener<CreateIndexResponse>() {
   ... ...
}

我們可以看到，在for循環中，會遍歷bulk的所有index，然後檢查index是否需要自動創建，對於不存在的index，則會加入到自動創建的集合中，然後會調用createIndex方法創建index。index的創建由master來把控，master會根據分片分配和均衡的算法來決定在哪些data node上創建index對應的shard，然後將信息同步到data node上，由data node來執行具體的創建動作。index創建的具體流程在後面的文章中將會做分析，這裏不展開介紹了。

3.2.2 協調節點處理並轉發請求

創建完index後，index的各shard已在數據節點上建立完成，接着協調節點將會轉發寫入請求到文檔對應的primary shard。協調節點處理Bulk請求轉發的入口爲executeBulk方法：

void executeBulk(Task task, final BulkRequest bulkRequest, final long startTimeNanos, final ActionListener<BulkResponse> listener,
        final AtomicArray<BulkItemResponse> responses, Map<String, IndexNotFoundException> indicesThatCannotBeCreated) {
    new BulkOperation(task, bulkRequest, listener, responses, startTimeNanos, indicesThatCannotBeCreated).run();
}

真正的執行邏輯在BulkOperation的doRun方法中，首先，遍歷BulkRequest的所有子請求，然後根據請求的操作類型執行相應的邏輯，對於寫入請求，會首先根據IndexMetaData信息，爲每條寫入請求IndexRequest生成路由信息，並在process過程中按需生成_id字段：

for (int i = 0; i < bulkRequest.requests.size(); i++) {
    DocWriteRequest docWriteRequest = bulkRequest.requests.get(i);
    ... ...
    Index concreteIndex = concreteIndices.resolveIfAbsent(docWriteRequest);
    try {
        switch (docWriteRequest.opType()) {
            case CREATE:
            case INDEX:
                ... ...
                indexRequest.resolveRouting(metaData); // 根據metaData對indexRequest的routing賦值
                indexRequest.process(mappingMd, allowIdGeneration, concreteIndex.getName()); // 這裏，如果用戶沒有指定doc id，則會自動生成
                break;
            ... ...
        }
    } catch (... ...) { ... ... }
}

然後根據每個IndexRequest請求的路由信息（如果寫入時未指定路由，則es默認使用doc id作爲路由）得到所要寫入的目標shard id，並將DocWriteRequest封裝爲BulkItemRequest且添加到對應shardId的請求列表中。代碼如下：

for (int i = 0; i < bulkRequest.requests.size(); i++) {
  DocWriteRequest request = bulkRequest.requests.get(i); // 從bulk請求中得到每個doc寫入請求
  // 根據路由，找出doc寫入的目標shard id
  ShardId shardId = clusterService.operationRouting().indexShards(clusterState, concreteIndex, request.id(), request.routing()).shardId();
  // requestsByShard的key是shard id，value是對應的單個doc寫入請求（會被封裝成BulkItemRequest）的集合
  List<BulkItemRequest> shardRequests = requestsByShard.computeIfAbsent(shardId, shard -> new ArrayList<>());
  shardRequests.add(new BulkItemRequest(i, request));
}

上一步已經找出每個shard及其所需執行的doc寫入請求列表的對應關係，這裏就相當於將請求按shard進行了拆分，接下來會將每個shard對應的所有請求封裝爲BulkShardRequest並交由TransportShardBulkAction來處理：

for (Map.Entry<ShardId, List<BulkItemRequest>> entry : requestsByShard.entrySet()) {
    final ShardId shardId = entry.getKey();
    final List<BulkItemRequest> requests = entry.getValue();
    // 對每個shard id及對應的BulkItemRequest集合，封裝爲一個BulkShardRequest
    BulkShardRequest bulkShardRequest = new BulkShardRequest(shardId, bulkRequest.getRefreshPolicy(),
            requests.toArray(new BulkItemRequest[requests.size()]));
    shardBulkAction.execute(bulkShardRequest, new ActionListener<BulkShardResponse>() {
       ... ...
    });
}

執行邏輯最終會進入到doRun方法中，這裏會通過ClusterState獲取到primary shard的路由信息，然後得到primay shard所在的node，如果node爲當前協調節點則直接將請求發往本地，否則發往遠端：

protected void doRun() {
    ......
    final ShardRouting primary = primary(state); // 獲取primary shard的路由信息
    ... ...
    // 得到primary所在的node
    final DiscoveryNode node = state.nodes().get(primary.currentNodeId()); 
    if (primary.currentNodeId().equals(state.nodes().getLocalNodeId())) {
        // 如果primary所在的node和primary所在的node一致，則直接在本地執行 
        performLocalAction(state, primary, node, indexMetaData);
    } else {
        // 否則，發送到遠程node執行
        performRemoteAction(state, primary, node);
    }
}

在performAction方法中，會調用TransportService的sendRequest方法，將請求發送出去。如果對端返回異常，比如對端節點故障或者primary shard掛了，對於這些異常，協調節點會有重試機制，重試的邏輯爲等待獲取最新的集羣狀態，然後再根據集羣的最新狀態（通過集羣狀態可以拿到新的primary shard信息）重新執行上面的doRun邏輯；如果在等待集羣狀態更新時超時，則會執行最後一次重試操作（執行doRun）。這塊的代碼如下：

void retry(Exception failure) {
    assert failure != null;
    if (observer.isTimedOut()) {
        // 超時時已經做過最後一次嘗試，這裏將不會重試了
        finishAsFailed(failure);
        return;
    }
    setPhase(task, "waiting_for_retry");
    request.onRetry();
    request.primaryTerm(0L);
    observer.waitForNextChange(new ClusterStateObserver.Listener() {
        @Override
        public void onNewClusterState(ClusterState state) {
            run(); // 會調用doRun
        }
        @Override
        public void onClusterServiceClose() {
            finishAsFailed(new NodeClosedException(clusterService.localNode()));
        }
        @Override
        public void onTimeout(TimeValue timeout) { // 超時，做最後一次重試
            run();  // 會調用doRun
        }
    });
}

3.2.3 primary node

primary所在的node收到協調節點發過來的寫入請求後，開始正式執行寫入的邏輯，寫入執行的入口是在ReplicationOperation類的execute方法，該方法中執行的兩個關鍵步驟是，首先寫主shard，如果主shard寫入成功，再將寫入請求發送到從shard所在的節點。

public void execute() throws Exception {
    ......
    // 關鍵，這裏開始執行寫primary shard
    primaryResult = primary.perform(request); 
    final ReplicaRequest replicaRequest = primaryResult.replicaRequest();
    if (replicaRequest != null) {
        ......
        // 關鍵步驟，寫完primary後這裏轉發請求到replicas
        performOnReplicas(replicaRequest, shards);
    }
    successfulShards.incrementAndGet();
    decPendingAndFinishIfNeeded();
}

下面，我們來看寫primary的關鍵代碼，寫primary入口函數爲TransportShardBulkAction.shardOperationOnPrimary:

public WritePrimaryResult<BulkShardRequest, BulkShardResponse> shardOperationOnPrimary(
            BulkShardRequest request, IndexShard primary) throws Exception {
        ... ...
        Translog.Location location = null;
        for (int requestIndex = 0; requestIndex < request.items().length; requestIndex++) {
            if (isAborted(request.items()[requestIndex].getPrimaryResponse()) == false) {
                location = executeBulkItemRequest(metaData, primary, request, preVersions, preVersionTypes, location, requestIndex);
            }
        }
      ... ...
  }

寫主時，會遍歷一個bulk任務，逐個執行具體的寫入請求，ES調用InternalEngine.Index將數據寫入lucene並會將整個寫入操作命令添加到translog，如下所示：

final IndexResult indexResult;
if (plan.earlyResultOnPreFlightError.isPresent()) {
    indexResult = plan.earlyResultOnPreFlightError.get();
    assert indexResult.hasFailure();
} else if (plan.indexIntoLucene) {
    // 將數據寫入lucene，最終會調用lucene的文檔寫入接口
    indexResult = indexIntoLucene(index, plan);
} else {
    assert index.origin() != Operation.Origin.PRIMARY;
    indexResult = new IndexResult(plan.versionForIndexing, plan.currentNotFoundOrDeleted);
}
if (indexResult.hasFailure() == false &&
    plan.indexIntoLucene && // if we didn't store it in lucene, there is no need to store it in the translog
    index.origin() != Operation.Origin.LOCAL_TRANSLOG_RECOVERY) {
    Translog.Location location =
        translog.add(new Translog.Index(index, indexResult)); // 寫translog
    indexResult.setTranslogLocation(location);
}

從以上代碼可以看出，ES的寫入操作是先寫lucene，將數據寫入到lucene內存後再寫translog，這裏和傳統的WAL先寫日誌後寫內存有所區別。ES之所以先寫lucene後寫log主要原因大概是寫入Lucene時，Lucene會再對數據進行一些檢查，有可能出現寫入Lucene失敗的情況。如果先寫translog，那麼就要處理寫入translog成功但是寫入Lucene一直失敗的問題，所以ES採用了先寫Lucene的方式。

在寫完primary後，會繼續寫replicas，接下來需要將請求轉發到從節點上，如果replica shard未分配，則直接忽略；如果replica shard正在搬遷數據到其他節點，則將請求轉發到搬遷的目標shard上，否則，轉發到replica shard。這塊代碼如下：

private void performOnReplicas(ReplicaRequest replicaRequest, List<ShardRouting> shards) {
    final String localNodeId = primary.routingEntry().currentNodeId();
    // If the index gets deleted after primary operation, we skip replication
    for (final ShardRouting shard : shards) {
        if (executeOnReplicas == false || shard.unassigned()) {
            if (shard.primary() == false) {
                totalShards.incrementAndGet();
            }
            continue;
        }
        if (shard.currentNodeId().equals(localNodeId) == false) {
            performOnReplica(shard, replicaRequest);
        }
        if (shard.relocating() && shard.relocatingNodeId().equals(localNodeId) == false) {
            performOnReplica(shard.getTargetRelocatingShard(), replicaRequest);
        }
    }
}

performOnReplica方法會將請求轉發到目標節點，如果出現異常，如對端節點掛掉、shard寫入失敗等，對於這些異常，primary認爲該replica shard發生故障不可用，將會向master彙報並移除該replica。這塊的代碼如下：

private void performOnReplica(final ShardRouting shard, final ReplicaRequest replicaRequest) {
    
    totalShards.incrementAndGet();
    pendingActions.incrementAndGet();
    replicasProxy.performOn(shard, replicaRequest, new ActionListener<TransportResponse.Empty>() {
        @Override
        public void onResponse(TransportResponse.Empty empty) {
            successfulShards.incrementAndGet();
            decPendingAndFinishIfNeeded();
        }
        @Override
        public void onFailure(Exception replicaException) {
            if (TransportActions.isShardNotAvailableException(replicaException)) {
                decPendingAndFinishIfNeeded();
            } else {
                RestStatus restStatus = ExceptionsHelper.status(replicaException);
                shardReplicaFailures.add(new ReplicationResponse.ShardInfo.Failure(
                    shard.shardId(), shard.currentNodeId(), replicaException, restStatus, false));
                replicasProxy.failShard(shard, message, replicaException,
                    ReplicationOperation.this::decPendingAndFinishIfNeeded,
                    ReplicationOperation.this::onPrimaryDemoted,
                    throwable -> decPendingAndFinishIfNeeded()
                );
            }
        }
    });
}

replica的寫入邏輯和primary類似，這裏不再具體介紹。爲了防止primary掛掉後不丟數據，ES會等待所有replicas都寫入成功後再將結果反饋給客戶端。因此，寫入耗時會由耗時最長的replica決定。至此，ES的整個寫入流程已解析完畢。

4. 小結

本文主要分析了ES分佈式框架寫入的主體流程，對其中的很多細節未做詳細剖析，後面會通過一些文章對寫入涉及的細節做具體分析，歡迎大家一起交流討論。

Elasticsearch 底層系列之寫入解析

1. 簡介

2. 分佈式寫入流程

3. 寫入源碼分析

3.1 bulk請求分發

3.2 寫入步驟

3.2.1 創建index

3.2.2 協調節點處理並轉發請求

3.2.3 primary node

4. 小結

Elasticsearch 底層系列之寫入解析

Elasticsearch 底層系列之分片恢復解析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結