ES多集羣間數據同步

1.引言

自己在google上搜了一下，自己大概總結了一下集羣中某節點要訪問遠程集羣節點中的數據，並保證數據的一致性和穩定性。舉個例子，現有三個集羣分別是：集羣A、集羣B和集羣C，每個集羣對應的有三個節點，一共是九個節點；集羣A中的node1中的業務數據需要從集羣C中node1中某索引中獲取（意思是說：集羣A需要的一部分數據被分割在其他兩個集羣中），這時就需要考慮同步遠程集羣數據啦啦啦啦。我這裏用的是es5.6

2.正題

這裏提供了兩種方式，tribe節點（部落）,Cross cluster search(跨集羣搜索，穩定性待測)，tribe有兩個缺陷問題：如果集羣數和節點數很多回導致瓶頸，多個集羣下如果有相同索引index的話，他會在多集羣中選擇一個。而Cross cluster search恰好解決了tribe出現的問題。這兩種方式主要都是通過配置文件來實現

3.自動發現機制

自動發現（Disovery）

該模塊主要負責集羣中節點的自動發現和Master節點的選舉。節點之間使用p2p的方式進行直接通信，不存在單點故障的問題。Elasticsearch中，Master節點維護集羣的全局狀態，比如節點加入和離開時進行shard的重新分配。

1. Azure discovery 插件方式，多播

2. EC2 discovery 插件方式，多播

3. Google Compute Engine (GCE)discovery 插件方式多播

4. zen discovery默認實現多播/單播

多播配置下，節點向集羣發送多播請求，其他節點收到請求後會做出響應。配置參數如下：

    discovery.zen.ping.multicast.group:224.2.2.4  組地址  
    discovery.zen.ping.multicast.port：54328  端口  
    discovery.zen.ping.multicast.ttl:3 廣播消息ttl  
    discovery.zen.ping.multicast.address:null綁定的地址，null表示綁定所有可用的網絡接口  
    discovery.zen.ping.multicast.enabled:true 多播自動發現禁用開關

單播配置下，節點向指定的主機發送單播請求，配置如下：

discovery.zen.ping.unicast.hosts:host1：port1，host2,：port2

3.tribe方式

當我們的數據節點因爲寫入壓力過大時, 可能會使節點之間的心跳通信超過這個時間, 那麼可能會引起重新選舉master的可能. 這次將新增三個實例分佈到這三臺服務器上, 做master節點.下面是master節點的主要配置:

cluster.name: eagleye_es node.name: "eagleye_es_xx_master"
node.master: true node.data: false #ping 其它節點的超時時間 
discovery.zen.ping_timeout: 30s 
#心跳timeout設爲2分鐘，超過6次心跳沒有迴應，則認爲該節點脫離master，每隔20s發送一次心跳。 
discovery.zen.fd.ping_timeout: 120s 
discovery.zen.fd.ping_retries: 6 
.zen.fd.ping_interval: 20s 
#要選出可用master, 最少需要幾個master節點 
discovery.zen.minimum_master_nodes: 2 
path.logs: /var/log/es_master 
#不使用交換區 
bootstrap.mlockall: true 
transport.tcp.port: 8309 
transport.tcp.compress: true 
http.port: 8209

tribe:
  t1:
    cluster.name:   cluster_one
  t2:
    cluster.name:   cluster_two
    network.host:   10.1.2.3

tribe:
  hot: 
   cluster.name: eagleye_es
 blocks: 
  write: true
metadata: true 
on_conflict: prefer_hot 


threadpool: 
    search: 
      tyep: fixed 
      size: 24 
#用來保存請求的隊列 
      queue_size: 100

這裏的on_conflict設置，當多個集羣內，索引名稱有衝突的時候，tribe節點默認會把請求輪詢轉發到各個集羣上，這顯然是不可以的。索引設置了一個優先級，在索引名衝突的時候，偏向於轉發給某一個集羣。

最後我們在查詢程序中, 就不能指定集羣的名字了, 而是直接通過tribe節點進行檢索，如下：

4.Cross cluster search方式

elasticsearch。跨集羣搜索節點的yml配置文件只需要列出應該連接到的遠程集羣，例如:

search:
    remote:
        cluster_one: 
            seeds: 127.0.0.1:9300
        cluster_two: 
            seeds: 127.0.0.1:9301

cluster_one和cluster_two是表示連接到每個集羣的任意集羣別名。這些名稱隨後用於區分本地索引和遠程索引。

使用集羣設置API爲集羣中的所有節點添加遠程集羣的等效示例如下:

PUT _cluster/settings
{
  "persistent": {
    "search": {
      "remote": {
        "cluster_one": {
          "seeds": [
            "127.0.0.1:9300"
          ]
        },
        "cluster_two": {
          "seeds": [
            "127.0.0.1:9301"
          ]
        }
      }
    }
  }
}

通過將其種子設置爲null，可以從集羣設置中刪除遠程集羣:

PUT _cluster/settings
{
  "persistent": {
    "search": {
      "remote": {
        "cluster_one": {
          "seeds": null 
        }
      }
    }
  }
}

要在遠程集羣cluster_1上搜索twitter索引，必須使用由:字符分隔的羣集別名來前綴索引名稱:

POST /cluster_one:twitter,twitter/tweet/_search
{
  "query": {
    "match_all": {}
  }
}

二.最終採用的方式ReIndex

重新入庫 Reindex

POST _reindex{

"conflicts": "proceed",//有衝突繼續，默認是有衝突終止

"size":1000,//設定條數 "source": { "index": "twitter" //也可以爲 ["twitter", "blog"]

"type": "tweet", // 或["type1","type2"] //紅字限制範圍 ，非必須 限制文檔

"query": { "term": { "user": "kimchy" } }，//添加查詢來限制文檔

"sort": { "date": "desc" } //排序

"_source": ["user", "tweet"]，//指定字段

 "size": 100,//滾動批次1000更改批處理大小:

 }, "dest": { "index": "new_twitter"

 "op_type": "create" //設置將導致_reindex只在目標索引中創建丟失的文檔,create 只插入沒有的數據

"version_type": "external"，//沒有設置 version_type或設置爲internal 將覆蓋掉相同id的數據,設置爲external 將更新相同ID文檔當version比較後的時候

"routing": "=cat" ,//將路由設置爲cat

"pipeline": "some_ingest_pipeline",//指定管道來使用Ingest節點特性

},

"script": { // 執行腳本 

 "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')} ", 

 "lang": "painless" 

}}

Reindex支持從遠程的彈性搜索集羣中進行索引:

遠程拷貝：

需要配置目標es 白名單 ：reindex.remote.whitelist: ["10.10.10.130:8400"]

從遠程服務器上的轉換使用的是on-heap緩衝區，默認值爲100mb的最大大小。如果遠程索引包含非常大的文檔，則需要使用較小的批處理大小。下面的示例設置了非常非常小的批處理大小10。

POST _reindex{ "source": {  "remote": { "host": "http://otherhost:9200", "username": "user", "password": "pass",

  "socket_timeout": "1m", //設定超時時間，默認爲30秒 "connect_timeout": "10s"//連接超時設置爲10秒 }, "index": "source",

 "size": 10, "query": { "match": { "test": "data" } } }, "dest": { "index": "dest" }}

URL參數

Reindex API還支持刷新、wait_for_completion、wait_for_active_shards、timeout和requests_per_second。

Response body

The JSON response looks like this:

{ "took" : 639, //從開始到結束整個操作的毫秒數 "updated": 0,//成功更新的文檔數。 "created": 123,//成功創建的文檔數 "batches": 1,//滾動響應的數量被重新索引所拉回 "version_conflicts": 2,//重新索引命中的版本衝突的數量 "retries": { "bulk": 0,//重試的批量操作的數量 "search": 0 //批量是重試的批量操作的數量 } "throttled_millis": 0,//請求休眠的毫秒數,以符合requests_per_second "failures" : [ ]//所有索引失敗的數組。如果這不是空的，那麼請求就會因爲這些失敗而中止}

使用Task API獲取所有運行的reindex請求的狀態:

GET _tasks?detailed=true&actions=*reindex

根據id直接查找任務:

GET /_tasks/taskId:1

取消任務

POST _tasks/task_id:1/_cancel

更改requests_per_second參數的值:

POST _reindex/task_id:1/_rethrottle?requests_per_second=-1

例子：

在集羣的es配置文件中配置好reindex.remote.whitelist:（端口爲http協議端口號）

http://10.10.10.143:8400/_reindex/

{

 "conflicts": "proceed",

 "size": 1000,

 "source": {

 "remote": {

 "host": "http://10.10.10.139:8200",

 "username": "andong",

 "password": "andong",

 "socket_timeout": "1m",

 "connect_timeout": "10s"

},

 "index": "~iholstein_monit_log_2018-05-17",

 "size": 10,

 "query": {

 "match_all": {}

}

},

 "dest": {

 "index": "dest11112",

 "op_type": "create",

 "version_type": "external"

}

}

_reindex' -H 'Content-Type:application/json' -d '

並行化執行reindex操作

1、手動並行化

如下是兩個slices的手動並行化reindex：

POST _reindex  

{

  "source": {  

    "index": "my_index_name",  

    "slice": {   // 第一slice執行操作  

      "id": 0,  

      "max": 2  

}

},

  "dest": {  

    "index": "my_index_name_new"  

}

}

POST _reindex  

{

  "source": {  

    "index": "my_index_name",  

    "slice": {   // 第二slice執行操作  

      "id": 1,  

      "max": 2  

}

},

  "dest": {  

    "index": "my_index_name_new"  

}

}

可以通過以下命令查看執行的結果：

GET _refresh  

POST my_index_name/_search?size=0&filter_path=hits.total  

2、自動並行化

如下是自動劃分的n個slices,只是將需要手動劃分的過程自動化處理，將一個操作拆分爲多個子操作並行化處理，其他查詢方式等都一樣

POST _reindex?slices=5&refresh  

{

  "source": {  

    "index": "my_index_name"  

},

"dest": {  

    "index": "my_index_name_new"  

}

}

3、slices數量設置要求

        數量不能過大，比如500可能出現CPU問題；

        查詢性能角度看，設置slices爲源索引的分片的倍數是比較合適的，一倍是最有效的；

        索引性能角度看，應該隨着可用資源的數量線性地擴展；

        然而索引或查詢性能是否在此過程中佔據主導，取決於許多因素，比如重新索引的文檔和重新索引的集羣。

注意:remote reindex不能使用並行化處理，即不能使用slices參數，這一點官方文檔上沒有明確指出，但是在使用的時候會報錯，去掉即可。

4. 數據量大、無刪除操作、有更新時間

數據量較大且無刪除操作時，可以使用滾動遷移的方式，減小停止寫服務的時間。滾動遷移需要有一個類似於更新時間的字段代表新數據的寫時序。可以在數據遷移完成後，再停止寫服務，快速更新一次。即可切換到新集羣，恢復讀寫。

{

 "source": {

 "remote": {

 "host": "'${oldClusterHost}'",

 "username": "'${oldClusterUser}'",

 "password": "'${oldClusterPass}'"

},

 "index": "'${indexName}'",

 "query": {

 "bool": {

 "must_not": {

 "exists": {

 "field": "'${timeField}'"

}

}

}

}

},

 "dest": {

 "index": "'${indexName}'"

}

}

使用java的reindex API每天將增量數據同步到dev等集羣環境中 

<dependency>  

<span style="white-space:pre;"> </span><groupId>org.elasticsearch.module</groupId>  

    <artifactId>reindex</artifactId>  

    <version>2.4.6</version>  

</dependency>  

由於在項目中使用了es的date字段，所以只需要每天安裝開始和結束時間獲取數據並進行remote reindex即可：

/**

 * 根據每天的開始和結束時間同步增量數據 

 * 時間格式爲"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" 

 * @param start 開始時間 

 * @param end 結束時間 

*/

 private void reindexGeleevrFromProToDevByDay(String start, String end) { 

 StringBuilder queryString = new StringBuilder(512); 

 queryString.append("{") 

 .append("\"range\" : {") 

 .append("\"orderTime\" : {") 

 .append("\"from\" : \"").append(start).append("\",") 

 .append("\"to\" : \"").append(end).append("\",") 

 .append("\"include_lower\" : true,") 

 .append("\"include_upper\" : true,") 

 .append("\"boost\" : 1.0") 

 .append("}") 

 .append("}") 

 .append("}"); 

 RemoteInfo remoteInfo = getRemoteInfo(queryString.toString()); 

 TransportClient client = ESClient.me(); 

 builder = ReindexAction.INSTANCE.newRequestBuilder(client); 

 BulkByScrollResponse response = builder.source(ESConfig.COMPANY) 

 .setRemoteInfo(remoteInfo) 

 .destination(ESConfig.COMPANY) 

 .abortOnVersionConflict(true) 

 .get(/*TimeValue.timeValueHours(1)*/); 

 // builder.source().setScroll("20m").setRouting("candycane"); 

 long updated = response.getUpdated(); 

 int failed = response.getBulkFailures().size(); 

 logger.info("reindex geleevr on {} updated = {} failed = {}" , date, updated, failed); 

}

 private static RemoteInfo getRemoteInfo() { 

 return new RemoteInfo("http", "192.168.10.20", 9200, new BytesArray("{\"match_all\":{}}"), null, null, 

 Collections.emptyMap(), RemoteInfo.DEFAULT_SOCKET_TIMEOUT, RemoteInfo.DEFAULT_CONNECT_TIMEOUT); 

}

 private static RemoteInfo getRemoteInfo(String query) { 

 return new RemoteInfo("http", "192.168.10.20", 9200, new BytesArray(query), null, null, 

 null, RemoteInfo.DEFAULT_SOCKET_TIMEOUT, RemoteInfo.DEFAULT_CONNECT_TIMEOUT); 

}

參考：

點擊打開鏈接https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-cross-cluster-search.html#_using_cross_cluster_search

https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-tribe.html

qq_21873747

發佈了21 篇原創文章 · 獲贊 33 · 訪問量 4萬+

私信關注

ES多集羣間數據同步

ES多集羣間數據同步

1.引言

2.正題

3.自動發現機制

自動發現（Disovery）

3.tribe方式

4.Cross cluster search方式

二.最終採用的方式ReIndex

【SQL進階】CASE語句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

KubeOperator初學者寶典

通過中間件方式實現IE代理

MongoDB安裝

Kubernetes 安裝

Redis集羣搭建

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結