1. 簡介
Elasticsearch(ES)是一個基於Lucene的近實時分佈式存儲及搜索分析系統,其應用場景廣泛,可應用於日誌分析、全文檢索、結構化數據分析等多種場景,既可作爲NoSQL數據庫,也可作爲搜索引擎。由於ES具備如此強悍的能力,因此吸引了很多公司爭相使用,如維基百科、GitHub、Stack Overflow等。
對於ES的寫入,我們主要關心寫入的實時性及可靠性。本文將通過源碼來探索ES寫入的具體流程。
2. 分佈式寫入流程
ES的寫入模型參考了微軟的 PacificA協議。寫入操作必須在主分片上面完成之後才能被複制到相關的副本分片,如下圖所示 :
寫操作一般會經歷三種節點:協調節點、主分片所在節點、從分片所在節點。上圖中NODE1可視爲協調節點,協調節點接收到請求後,確定寫入的文檔屬於分片0,於是將請求轉發到分片0的主分片所在的節點NODE3,NODE3完成寫入後,再將請求轉發給分片0所屬的從分片所在的節點NODE1和NODE2,待所有從分片寫入成功後,NODE3則認爲整個寫入成功並將結果反饋給協調節點,協調節點再將結果返回客戶端。
上述爲寫入的大體流程,整個流程的具體細節,下面會結合源碼進行解析。
3. 寫入源碼分析
ES的寫入有兩種方式一種是逐個文檔寫入(index),另一種是多個文檔批量寫入(bulk)。對於這兩種寫入方式,ES都會將其轉換爲bulk寫入。本節,我們就以bulk寫入爲例,根據代碼執行主線來分析ES寫入的流程。
3.1 bulk請求分發
ES對用戶請求一般會經過兩層處理,一層是Rest層,另一層是Transport層。Rest層主要進行請求參數解析,Transport層則進行實際用戶請求處理。在每一層請求處理前都有一次請求分發,如下圖所示:
客戶端發送的http請求由HttpServerTransport初步處理後進入RestController模塊,在RestController中進行實際的分發過程:
public void dispatchRequest(RestRequest request, RestChannel channel, ThreadContext threadContext) { if (request.rawPath().equals("/favicon.ico")) { handleFavicon(request, channel); return; } RestChannel responseChannel = channel; try { final int contentLength = request.hasContent() ? request.content().length() : 0; assert contentLength >= 0 : "content length was negative, how is that possible?"; final RestHandler handler = getHandler(request); ... ... } void dispatchRequest(final RestRequest request, final RestChannel channel, final NodeClient client, ThreadContext threadContext, final RestHandler handler) throws Exception { ... ... final RestHandler wrappedHandler = Objects.requireNonNull(handlerWrapper.apply(handler)); wrappedHandler.handleRequest(request, channel, client); ... ... } }
從上面的代碼可以看出在第一個dispatchRequest中,會根據request找到其對應的handler,然後在第二個dispatchRequest中會調用handler的handleRequest方法處理請求。那麼getHandler是如何根據請求找到對應的handler的呢?這塊的邏輯如下:
public void registerHandler(RestRequest.Method method, String path, RestHandler handler) { PathTrie<RestHandler> handlers = getHandlersForMethod(method); if (handlers != null) { handlers.insert(path, handler); } else { throw new IllegalArgumentException("Can't handle [" + method + "] for path [" + path + "]"); } } private RestHandler getHandler(RestRequest request) { String path = getPath(request); PathTrie<RestHandler> handlers = getHandlersForMethod(request.method()); if (handlers != null) { return handlers.retrieve(path, request.params()); } else { return null; } }
ES會通過RestController的registerHandler方法,提前把handler註冊到對應http請求方法(GET、PUT、POST、DELETE等)的handlers列表。這樣用戶請求到達時,就可以通過RestController的getHandler方法,並根據http請求方法和路徑取出對應的handler。對於bulk操作,其請求對應的handler是RestBulkAction,該類會在其構造函數中將其註冊到RestController,代碼如下:
public RestBulkAction(Settings settings, RestController controller) { super(settings); controller.registerHandler(POST, "/_bulk", this); controller.registerHandler(PUT, "/_bulk", this); controller.registerHandler(POST, "/{index}/_bulk", this); controller.registerHandler(PUT, "/{index}/_bulk", this); controller.registerHandler(POST, "/{index}/{type}/_bulk", this); controller.registerHandler(PUT, "/{index}/{type}/_bulk", this); this.allowExplicitIndex = MULTI_ALLOW_EXPLICIT_INDEX.get(settings); }
RestBulkAction會將RestRequest解析並轉化爲BulkRequest,然後再對BulkRequest做處理,這塊的邏輯在prepareRequest方法中,部分代碼如下:
public RestChannelConsumer prepareRequest(final RestRequest request, final NodeClient client) throws IOException { // 根據RestRquest構建BulkRequest ... ... // 處理bulkRequest return channel -> client.bulk(bulkRequest, new RestStatusToXContentListener<>(channel)); }
NodeClient在處理BulkRequest請求時,會將請求的action轉化爲對應Transport層的action,然後再由Transport層的action來處理BulkRequest,action轉化的代碼如下:
public < Request extends ActionRequest, Response extends ActionResponse > Task executeLocally(GenericAction<Request, Response> action, Request request, TaskListener<Response> listener) { return transportAction(action).execute(request, listener); } private < Request extends ActionRequest,Response extends ActionResponse > TransportAction<Request, Response> transportAction(GenericAction<Request, Response> action) { ... ... // actions是個action到transportAction的映射,這個映射關係是在節點啓動時初始化的 TransportAction<Request, Response> transportAction = actions.get(action); ... ... return transportAction; }
TransportAction會調用一個請求過濾鏈來處理請求,如果相關的插件定義了對該action的過濾處理,則先會執行插件的處理邏輯,然後再進入TransportAction的處理邏輯,過濾鏈的處理邏輯如下:
public void proceed(Task task, String actionName, Request request, ActionListener<Response> listener) { int i = index.getAndIncrement(); try { if (i < this.action.filters.length) { this.action.filters[i].apply(task, actionName, request, listener, this); // 應用插件的邏輯 } else if (i == this.action.filters.length) { this.action.doExecute(task, request, listener); // 執行TransportAction的處理邏輯 } else ... ... } catch(Exception e) { ... ... } }
對於Bulk請求,這裏的TransportAction對應的具體對象是TransportBulkAction的實例,到此,Rest層轉化爲Transport層的流程完成,下節將詳細介紹TransportBulkAction的處理邏輯。
3.2 寫入步驟
3.2.1 創建index
如果bulk寫入時,index未創建則es會自動創建出對應的index,處理邏輯在TransportBulkAction的doExecute方法中:
for (String index : indices) { boolean shouldAutoCreate; try { shouldAutoCreate = shouldAutoCreate(index, state); } catch (IndexNotFoundException e) { shouldAutoCreate = false; indicesThatCannotBeCreated.put(index, e); } if (shouldAutoCreate) { autoCreateIndices.add(index); } } ... ... for (String index : autoCreateIndices) { createIndex(index, bulkRequest.timeout(), new ActionListener<CreateIndexResponse>() { ... ... }
我們可以看到,在for循環中,會遍歷bulk的所有index,然後檢查index是否需要自動創建,對於不存在的index,則會加入到自動創建的集合中,然後會調用createIndex方法創建index。index的創建由master來把控,master會根據分片分配和均衡的算法來決定在哪些data node上創建index對應的shard,然後將信息同步到data node上,由data node來執行具體的創建動作。index創建的具體流程在後面的文章中將會做分析,這裏不展開介紹了。
3.2.2 協調節點處理並轉發請求
創建完index後,index的各shard已在數據節點上建立完成,接着協調節點將會轉發寫入請求到文檔對應的primary shard。協調節點處理Bulk請求轉發的入口爲executeBulk方法:
void executeBulk(Task task, final BulkRequest bulkRequest, final long startTimeNanos, final ActionListener<BulkResponse> listener, final AtomicArray<BulkItemResponse> responses, Map<String, IndexNotFoundException> indicesThatCannotBeCreated) { new BulkOperation(task, bulkRequest, listener, responses, startTimeNanos, indicesThatCannotBeCreated).run(); }
真正的執行邏輯在BulkOperation的doRun方法中,首先,遍歷BulkRequest的所有子請求,然後根據請求的操作類型執行相應的邏輯,對於寫入請求,會首先根據IndexMetaData信息,爲每條寫入請求IndexRequest生成路由信息,並在process過程中按需生成_id字段:
for (int i = 0; i < bulkRequest.requests.size(); i++) { DocWriteRequest docWriteRequest = bulkRequest.requests.get(i); ... ... Index concreteIndex = concreteIndices.resolveIfAbsent(docWriteRequest); try { switch (docWriteRequest.opType()) { case CREATE: case INDEX: ... ... indexRequest.resolveRouting(metaData); // 根據metaData對indexRequest的routing賦值 indexRequest.process(mappingMd, allowIdGeneration, concreteIndex.getName()); // 這裏,如果用戶沒有指定doc id,則會自動生成 break; ... ... } } catch (... ...) { ... ... } }
然後根據每個IndexRequest請求的路由信息(如果寫入時未指定路由,則es默認使用doc id作爲路由)得到所要寫入的目標shard id,並將DocWriteRequest封裝爲BulkItemRequest且添加到對應shardId的請求列表中。代碼如下:
for (int i = 0; i < bulkRequest.requests.size(); i++) { DocWriteRequest request = bulkRequest.requests.get(i); // 從bulk請求中得到每個doc寫入請求 // 根據路由,找出doc寫入的目標shard id ShardId shardId = clusterService.operationRouting().indexShards(clusterState, concreteIndex, request.id(), request.routing()).shardId(); // requestsByShard的key是shard id,value是對應的單個doc寫入請求(會被封裝成BulkItemRequest)的集合 List<BulkItemRequest> shardRequests = requestsByShard.computeIfAbsent(shardId, shard -> new ArrayList<>()); shardRequests.add(new BulkItemRequest(i, request)); }
上一步已經找出每個shard及其所需執行的doc寫入請求列表的對應關係,這裏就相當於將請求按shard進行了拆分,接下來會將每個shard對應的所有請求封裝爲BulkShardRequest並交由TransportShardBulkAction來處理:
for (Map.Entry<ShardId, List<BulkItemRequest>> entry : requestsByShard.entrySet()) { final ShardId shardId = entry.getKey(); final List<BulkItemRequest> requests = entry.getValue(); // 對每個shard id及對應的BulkItemRequest集合,封裝爲一個BulkShardRequest BulkShardRequest bulkShardRequest = new BulkShardRequest(shardId, bulkRequest.getRefreshPolicy(), requests.toArray(new BulkItemRequest[requests.size()])); shardBulkAction.execute(bulkShardRequest, new ActionListener<BulkShardResponse>() { ... ... }); }
執行邏輯最終會進入到doRun方法中,這裏會通過ClusterState獲取到primary shard的路由信息,然後得到primay shard所在的node,如果node爲當前協調節點則直接將請求發往本地,否則發往遠端:
protected void doRun() { ...... final ShardRouting primary = primary(state); // 獲取primary shard的路由信息 ... ... // 得到primary所在的node final DiscoveryNode node = state.nodes().get(primary.currentNodeId()); if (primary.currentNodeId().equals(state.nodes().getLocalNodeId())) { // 如果primary所在的node和primary所在的node一致,則直接在本地執行 performLocalAction(state, primary, node, indexMetaData); } else { // 否則,發送到遠程node執行 performRemoteAction(state, primary, node); } }
在performAction方法中,會調用TransportService的sendRequest方法,將請求發送出去。如果對端返回異常,比如對端節點故障或者primary shard掛了,對於這些異常,協調節點會有重試機制,重試的邏輯爲等待獲取最新的集羣狀態,然後再根據集羣的最新狀態(通過集羣狀態可以拿到新的primary shard信息)重新執行上面的doRun邏輯;如果在等待集羣狀態更新時超時,則會執行最後一次重試操作(執行doRun)。這塊的代碼如下:
void retry(Exception failure) { assert failure != null; if (observer.isTimedOut()) { // 超時時已經做過最後一次嘗試,這裏將不會重試了 finishAsFailed(failure); return; } setPhase(task, "waiting_for_retry"); request.onRetry(); request.primaryTerm(0L); observer.waitForNextChange(new ClusterStateObserver.Listener() { @Override public void onNewClusterState(ClusterState state) { run(); // 會調用doRun } @Override public void onClusterServiceClose() { finishAsFailed(new NodeClosedException(clusterService.localNode())); } @Override public void onTimeout(TimeValue timeout) { // 超時,做最後一次重試 run(); // 會調用doRun } }); }
3.2.3 primary node
primary所在的node收到協調節點發過來的寫入請求後,開始正式執行寫入的邏輯,寫入執行的入口是在ReplicationOperation類的execute方法,該方法中執行的兩個關鍵步驟是,首先寫主shard,如果主shard寫入成功,再將寫入請求發送到從shard所在的節點。
public void execute() throws Exception { ...... // 關鍵,這裏開始執行寫primary shard primaryResult = primary.perform(request); final ReplicaRequest replicaRequest = primaryResult.replicaRequest(); if (replicaRequest != null) { ...... // 關鍵步驟,寫完primary後這裏轉發請求到replicas performOnReplicas(replicaRequest, shards); } successfulShards.incrementAndGet(); decPendingAndFinishIfNeeded(); }
下面,我們來看寫primary的關鍵代碼,寫primary入口函數爲TransportShardBulkAction.shardOperationOnPrimary:
public WritePrimaryResult<BulkShardRequest, BulkShardResponse> shardOperationOnPrimary( BulkShardRequest request, IndexShard primary) throws Exception { ... ... Translog.Location location = null; for (int requestIndex = 0; requestIndex < request.items().length; requestIndex++) { if (isAborted(request.items()[requestIndex].getPrimaryResponse()) == false) { location = executeBulkItemRequest(metaData, primary, request, preVersions, preVersionTypes, location, requestIndex); } } ... ... }
寫主時,會遍歷一個bulk任務,逐個執行具體的寫入請求,ES調用InternalEngine.Index將數據寫入lucene並會將整個寫入操作命令添加到translog,如下所示:
final IndexResult indexResult; if (plan.earlyResultOnPreFlightError.isPresent()) { indexResult = plan.earlyResultOnPreFlightError.get(); assert indexResult.hasFailure(); } else if (plan.indexIntoLucene) { // 將數據寫入lucene,最終會調用lucene的文檔寫入接口 indexResult = indexIntoLucene(index, plan); } else { assert index.origin() != Operation.Origin.PRIMARY; indexResult = new IndexResult(plan.versionForIndexing, plan.currentNotFoundOrDeleted); } if (indexResult.hasFailure() == false && plan.indexIntoLucene && // if we didn't store it in lucene, there is no need to store it in the translog index.origin() != Operation.Origin.LOCAL_TRANSLOG_RECOVERY) { Translog.Location location = translog.add(new Translog.Index(index, indexResult)); // 寫translog indexResult.setTranslogLocation(location); }
從以上代碼可以看出,ES的寫入操作是先寫lucene,將數據寫入到lucene內存後再寫translog,這裏和傳統的WAL先寫日誌後寫內存有所區別。ES之所以先寫lucene後寫log主要原因大概是寫入Lucene時,Lucene會再對數據進行一些檢查,有可能出現寫入Lucene失敗的情況。如果先寫translog,那麼就要處理寫入translog成功但是寫入Lucene一直失敗的問題,所以ES採用了先寫Lucene的方式。
在寫完primary後,會繼續寫replicas,接下來需要將請求轉發到從節點上,如果replica shard未分配,則直接忽略;如果replica shard正在搬遷數據到其他節點,則將請求轉發到搬遷的目標shard上,否則,轉發到replica shard。這塊代碼如下:
private void performOnReplicas(ReplicaRequest replicaRequest, List<ShardRouting> shards) { final String localNodeId = primary.routingEntry().currentNodeId(); // If the index gets deleted after primary operation, we skip replication for (final ShardRouting shard : shards) { if (executeOnReplicas == false || shard.unassigned()) { if (shard.primary() == false) { totalShards.incrementAndGet(); } continue; } if (shard.currentNodeId().equals(localNodeId) == false) { performOnReplica(shard, replicaRequest); } if (shard.relocating() && shard.relocatingNodeId().equals(localNodeId) == false) { performOnReplica(shard.getTargetRelocatingShard(), replicaRequest); } } }
performOnReplica方法會將請求轉發到目標節點,如果出現異常,如對端節點掛掉、shard寫入失敗等,對於這些異常,primary認爲該replica shard發生故障不可用,將會向master彙報並移除該replica。這塊的代碼如下:
private void performOnReplica(final ShardRouting shard, final ReplicaRequest replicaRequest) { totalShards.incrementAndGet(); pendingActions.incrementAndGet(); replicasProxy.performOn(shard, replicaRequest, new ActionListener<TransportResponse.Empty>() { @Override public void onResponse(TransportResponse.Empty empty) { successfulShards.incrementAndGet(); decPendingAndFinishIfNeeded(); } @Override public void onFailure(Exception replicaException) { if (TransportActions.isShardNotAvailableException(replicaException)) { decPendingAndFinishIfNeeded(); } else { RestStatus restStatus = ExceptionsHelper.status(replicaException); shardReplicaFailures.add(new ReplicationResponse.ShardInfo.Failure( shard.shardId(), shard.currentNodeId(), replicaException, restStatus, false)); replicasProxy.failShard(shard, message, replicaException, ReplicationOperation.this::decPendingAndFinishIfNeeded, ReplicationOperation.this::onPrimaryDemoted, throwable -> decPendingAndFinishIfNeeded() ); } } }); }
replica的寫入邏輯和primary類似,這裏不再具體介紹。爲了防止primary掛掉後不丟數據,ES會等待所有replicas都寫入成功後再將結果反饋給客戶端。因此,寫入耗時會由耗時最長的replica決定。至此,ES的整個寫入流程已解析完畢。
4. 小結
本文主要分析了ES分佈式框架寫入的主體流程,對其中的很多細節未做詳細剖析,後面會通過一些文章對寫入涉及的細節做具體分析,歡迎大家一起交流討論。