解決saturn失效分片立即執行後後臺分片項顯示不正確問題

背景:

官方的saturn組件,在失效分片後不能立即執行任務,需要等當前正常的分片執行完之後纔可以執行,而且失效分片單個executor上一個一個執行,而我們的需求是立即執行,且是並行執行失效分片。解決方案見:解決saturn executor失敗分片轉移立即執行之源碼分析

問題:

解決如上問題之後,發現saturn後臺,在執行完失效分片之後,單個作業的分片項顯示不正確(經代碼分析,確實沒有處理這塊的邏輯),仍顯示最初始的分片序號,那麼如何根據失效分片更新實際分配的節點,正確顯示呢?

解決:

手動加失效分片重新更新節點信息代碼如下

com/vip/saturn/job/internal/sharding/ShardingService.java新增方法

/**
 *
 * @param getLocalHostFailoverItems
 * @throws Exception
 */
public synchronized void removeAndCreateShardingInfo(List<Integer> getLocalHostFailoverItems) throws Exception {
   LogUtils.info(log, jobName, "removeAndCreateShardingInfo start.");
   //加分佈式鎖,防止多個executor對同一個節點進行更新
   CuratorFramework client = getJobNodeStorage().getClient();
   InterProcessMutex mutex = new InterProcessMutex(client,"/saturnSharding/lock");
   try {
      mutex.acquire();
      // 所有jobserver的(檢查+創建),加上設置sharding necessary內容爲0,都是一個事務
      CuratorTransactionFinal curatorTransactionFinal = getJobNodeStorage().getClient().inTransaction()
            .check().forPath("/").and();
      //遍歷所有的servers/executorName下分片序號,判斷失敗的分片是否在該executorName下,是則刪除
      Set<Integer> getLocalHostFailoverItemSets = new HashSet<>(getLocalHostFailoverItems);
      for (String each : serverService.getAllServers()) {
         if(StringUtils.equals(each,executorName)){
            continue;
         }
         String value = getJobNodeStorage().getJobNodeDataDirectly(ShardingNode.getShardingNode(each));
         if(StringUtils.isEmpty(value)){
            continue;
         }
         List<Integer> getShardingItemsByexecutorName = ItemUtils.toItemList(value);
         Set<Integer> getShardingItemSetsByexecutorName = new HashSet<>(getShardingItemsByexecutorName);
         getShardingItemSetsByexecutorName.removeAll(getLocalHostFailoverItemSets);
         getJobNodeStorage().removeJobNodeIfExisted(ShardingNode.getShardingNode(each));
         curatorTransactionFinal.create().forPath(
               JobNodePath.getNodeFullPath(jobName, ShardingNode.getShardingNode(each)),
               ItemUtils.toItemsString(new ArrayList<>(getShardingItemSetsByexecutorName)).getBytes(StandardCharsets.UTF_8)).and();
      }
      LogUtils.info(log, jobName, "removeAndCreateShardingInfo delete LocalHostFailoverItems.");
      //在local executorName下新加該失敗的分片
      String getLocalHostItems = getJobNodeStorage().getJobNodeDataDirectly(ShardingNode.getShardingNode(executorName));
      List<Integer> getLocalHostItemsList = new ArrayList<>();
      if(StringUtils.isEmpty(getLocalHostItems)){
         getLocalHostItemsList = getLocalHostFailoverItems;
      } else {
         getLocalHostItemsList = ItemUtils.toItemList(getLocalHostItems);
         getLocalHostItemsList.addAll(getLocalHostFailoverItems);
      }
      getJobNodeStorage().removeJobNodeIfExisted(ShardingNode.getShardingNode(executorName));
      curatorTransactionFinal.create().forPath(
            JobNodePath.getNodeFullPath(jobName, ShardingNode.getShardingNode(executorName)),
            ItemUtils.toItemsString(getLocalHostItemsList).getBytes(StandardCharsets.UTF_8)).and();
      curatorTransactionFinal.commit();
      LogUtils.info(log, jobName, "removeAndCreateShardingInfo append LocalHostFailoverItems.");
   } catch (Exception e) {
      // 可能多個sharding task導致計算結果有滯後,但是server機器已經被刪除,導致commit失敗
      // 實際上可能不影響最終結果,仍然能正常分配分片,因爲還會有resharding事件被響應
      // 修改日誌級別爲warn級別,避免不必要的告警
      LogUtils.warn(log, jobName, "Commit shards failed", e);
   } finally {
      mutex.release();
   }
}

然後在com/vip/saturn/job/basic/AbstractElasticJob.java方法的execute

中,在代碼executeJobInternal(shardingContext);前新加如下代碼

//處理失敗分片的分片項
if(!failoverService.getLocalHostFailoverItems().isEmpty()){
   shardingService.removeAndCreateShardingInfo(failoverService.getLocalHostFailoverItems());
}

Ok,到此結束,經測試完全可行。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章