背景:
官方的saturn組件,在失效分片後不能立即執行任務,需要等當前正常的分片執行完之後纔可以執行,而且失效分片單個executor上一個一個執行,而我們的需求是立即執行,且是並行執行失效分片。解決方案見:解決saturn executor失敗分片轉移立即執行之源碼分析
問題:
解決如上問題之後,發現saturn後臺,在執行完失效分片之後,單個作業的分片項顯示不正確(經代碼分析,確實沒有處理這塊的邏輯),仍顯示最初始的分片序號,那麼如何根據失效分片更新實際分配的節點,正確顯示呢?
解決:
手動加失效分片重新更新節點信息代碼如下
com/vip/saturn/job/internal/sharding/ShardingService.java新增方法
/**
*
* @param getLocalHostFailoverItems
* @throws Exception
*/
public synchronized void removeAndCreateShardingInfo(List<Integer> getLocalHostFailoverItems) throws Exception {
LogUtils.info(log, jobName, "removeAndCreateShardingInfo start.");
//加分佈式鎖,防止多個executor對同一個節點進行更新
CuratorFramework client = getJobNodeStorage().getClient();
InterProcessMutex mutex = new InterProcessMutex(client,"/saturnSharding/lock");
try {
mutex.acquire();
// 所有jobserver的(檢查+創建),加上設置sharding necessary內容爲0,都是一個事務
CuratorTransactionFinal curatorTransactionFinal = getJobNodeStorage().getClient().inTransaction()
.check().forPath("/").and();
//遍歷所有的servers/executorName下分片序號,判斷失敗的分片是否在該executorName下,是則刪除
Set<Integer> getLocalHostFailoverItemSets = new HashSet<>(getLocalHostFailoverItems);
for (String each : serverService.getAllServers()) {
if(StringUtils.equals(each,executorName)){
continue;
}
String value = getJobNodeStorage().getJobNodeDataDirectly(ShardingNode.getShardingNode(each));
if(StringUtils.isEmpty(value)){
continue;
}
List<Integer> getShardingItemsByexecutorName = ItemUtils.toItemList(value);
Set<Integer> getShardingItemSetsByexecutorName = new HashSet<>(getShardingItemsByexecutorName);
getShardingItemSetsByexecutorName.removeAll(getLocalHostFailoverItemSets);
getJobNodeStorage().removeJobNodeIfExisted(ShardingNode.getShardingNode(each));
curatorTransactionFinal.create().forPath(
JobNodePath.getNodeFullPath(jobName, ShardingNode.getShardingNode(each)),
ItemUtils.toItemsString(new ArrayList<>(getShardingItemSetsByexecutorName)).getBytes(StandardCharsets.UTF_8)).and();
}
LogUtils.info(log, jobName, "removeAndCreateShardingInfo delete LocalHostFailoverItems.");
//在local executorName下新加該失敗的分片
String getLocalHostItems = getJobNodeStorage().getJobNodeDataDirectly(ShardingNode.getShardingNode(executorName));
List<Integer> getLocalHostItemsList = new ArrayList<>();
if(StringUtils.isEmpty(getLocalHostItems)){
getLocalHostItemsList = getLocalHostFailoverItems;
} else {
getLocalHostItemsList = ItemUtils.toItemList(getLocalHostItems);
getLocalHostItemsList.addAll(getLocalHostFailoverItems);
}
getJobNodeStorage().removeJobNodeIfExisted(ShardingNode.getShardingNode(executorName));
curatorTransactionFinal.create().forPath(
JobNodePath.getNodeFullPath(jobName, ShardingNode.getShardingNode(executorName)),
ItemUtils.toItemsString(getLocalHostItemsList).getBytes(StandardCharsets.UTF_8)).and();
curatorTransactionFinal.commit();
LogUtils.info(log, jobName, "removeAndCreateShardingInfo append LocalHostFailoverItems.");
} catch (Exception e) {
// 可能多個sharding task導致計算結果有滯後,但是server機器已經被刪除,導致commit失敗
// 實際上可能不影響最終結果,仍然能正常分配分片,因爲還會有resharding事件被響應
// 修改日誌級別爲warn級別,避免不必要的告警
LogUtils.warn(log, jobName, "Commit shards failed", e);
} finally {
mutex.release();
}
}
然後在com/vip/saturn/job/basic/AbstractElasticJob.java方法的execute
中,在代碼executeJobInternal(shardingContext);前新加如下代碼
//處理失敗分片的分片項
if(!failoverService.getLocalHostFailoverItems().isEmpty()){
shardingService.removeAndCreateShardingInfo(failoverService.getLocalHostFailoverItems());
}
Ok,到此結束,經測試完全可行。