Elastic-job 啓動“假死”的問題分析
問題記錄
最近項目引入Elastic Job
實現定時任務的分佈式調度。引入的版本2.1.5
,加入相關的job配置後啓動項目,主線程假死
,不進行後續邏輯處理和日誌輸出。
輸出的日誌如下:
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.049] [] [StdSchedulerFactory] [Using default implementation for ThreadExecutor]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.130] [] [SchedulerSignalerImpl] [Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.131] [] [QuartzScheduler] [Quartz Scheduler v.2.2.1 created.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.135] [] [JobShutdownHookPlugin] [Registering Quartz shutdown hook.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.136] [] [RAMJobStore] [RAMJobStore initialized.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [QuartzScheduler] [Scheduler meta-data: Quartz Scheduler (v2.2.1) 'dailyScanMercReratingJob' with instanceId 'NON_CLUSTERED'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.
Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.
]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler 'dailyScanMercReratingJob' initialized from an externally provided properties instance.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]
解決方案
直接上解決方案:將項目的curator的框架版本全部調整爲2.10.0
,包括
- curator-client;
- curator-recipes;
- curator-framework
問題追蹤
在項目的Spring
框架的以下位置打斷點追蹤項目啓動過程:
- ContextLoaderListener.contextInitialized() 方法
- AbstractApplicationContext.refresh() 方法;
發現代碼在AbstractApplicationContext.refresh()
方法裏,執行: finishBeanFactoryInitialization(beanFactory)
時陷入等待一直無法跳出繼續執行。
根據Spring框架的啓動機制,finishBeanFactoryInitialization
是完成單例bean的初始化的方法,這個方法會去真正操作elastic-job
對於job
的操作代碼。
從日誌中發現代碼最後一行輸出爲:
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]
在StdSchedulerFactory
類中找到關於Quartz scheduler version
相關的日誌打斷點繼續跟蹤:
//…………………… 省略之前的代碼
jrsf.initialize(scheduler);
qs.initialize();
getLog().info(
"Quartz scheduler '" + scheduler.getSchedulerName()
+ "' initialized from " + propSrc);
//這裏斷點
getLog().info("Quartz scheduler version: " + qs.getVersion());
// prevents the repository from being garbage collected
qs.addNoGCObject(schedRep);
// prevents the db manager from being garbage collected
if (dbMgr != null) {
qs.addNoGCObject(dbMgr);
}
schedRep.bind(scheduler);
return scheduler;
在上面的位置斷點後發現,elastic job
繼續執行,持續跟蹤最終跟蹤到SchedulerFacade
類的registerStartUpInfo
方法:
/**
* 註冊作業啓動信息.
*
* @param enabled 作業是否啓用
*/
public void registerStartUpInfo(final boolean enabled) {
listenerManager.startAllListeners();
leaderService.electLeader();
serverService.persistOnline(enabled);
instanceService.persistOnline();
shardingService.setReshardingFlag();
monitorService.listen();
if (!reconcileService.isRunning()) {
reconcileService.startAsync();
}
}
代碼在leaderService.electLeader();
陷入等待。
根據以上最終可用得出結論:
- elastic-job 在job的選主過程中陷入了無限等待,即無法選出主節點執行任務。
根據對LeaderService
的代碼的研究,elastic job
選主使用的是 curator
框架的 LeaderLatch
類完成的。
具體時線程wait
的操作在:JobNodeStorage
的executeInLeader
方法中:
/**
* 在主節點執行操作.
*
* @param latchNode 分佈式鎖使用的作業節點名稱
* @param callback 執行操作的回調
*/
public void executeInLeader(final String latchNode, final LeaderExecutionCallback callback) {
try (LeaderLatch latch = new LeaderLatch(getClient(), jobNodePath.getFullPath(latchNode))) {
latch.start();
latch.await();
callback.execute();
//CHECKSTYLE:OFF
} catch (final Exception ex) {
//CHECKSTYLE:ON
handleException(ex);
}
}
上面的方法調用 latch.await();
來等待獲取 leadership
。由於無法獲取主節點,導致線程一致wait。
LeaderLatch
大概的機制爲:所有客戶端向zk的同一個path
競爭的寫入數據,誰先寫入成功誰就獲取了leadership
。
LeaderLatch
的await
方法如下:
public void await() throws InterruptedException, EOFException
{
synchronized(this)
{
while ( (state.get() == State.STARTED) && !hasLeadership.get() )
{
wait();
}
}
if ( state.get() != State.STARTED )
{
throw new EOFException();
}
}
如果LeaderLatch
無法獲取leadership
那麼就當前的Thread
就會一直陷入wait
。
問題解決
定位到問題的發生點,解決問題的思路就要看爲什麼無法獲取到leadership
。
登錄到ZK上查詢節點信息,發現正常項目啓動後,elastic job
會向zk的寫入如下格式的節點內容:
/{job-namespace}/{job-id}/leader/election/latch
但是異常的項目是沒有這個節點的,所以應該是ZK的操作發生了問題。具體哪裏發生了問題這裏還沒有發現。
繼續將項目日誌調整爲DEBUG
級別會發下有如下的日誌輸出:
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.687] [] [RAMJobStore] [RAMJobStore initialized.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [QuartzScheduler] [Scheduler meta-data: Quartz Scheduler (v2.2.1) 'dailyScanMercReratingJob' with instanceId 'NON_CLUSTERED'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.
Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.
]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [StdSchedulerFactory] [Quartz scheduler 'dailyScanMercReratingJob' initialized from an externally provided properties instance.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]
[DEBUG] [Timer-0] [2018-10-10 17:51:49.553] [] [UpdateChecker] [Checking for available updated version of Quartz...]
[DEBUG] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:49.586] [] [LeaderService] [Elect a new leader now.]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.724] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.738] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.759] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.769] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.791] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.803] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.813] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.818] [] [LeaderService] [Elect a new leader now.]
[DEBUG] [Timer-0] [2018-10-10 17:51:51.261] [] [UpdateChecker] [Quartz version update check failed: Server returned HTTP response code: 403 for URL: http://www.terracotta.org/kit/reflector?kitID=quartz&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_112&platform=x86_64&tc-version=2.2.1&tc-product=Quartz&source=Quartz&uptime-secs=1&patch=UNKNOWN]
這行日誌的輸出代碼位於:elastic-job
的RegExceptionHandler.handleException()
方法:
/**
* 處理異常.
*
* <p>處理掉中斷和連接失效異常並繼續拋註冊中心.</p>
*
* @param cause 待處理異常.
*/
public static void handleException(final Exception cause) {
if (null == cause) {
return;
}
if (isIgnoredException(cause) || null != cause.getCause() && isIgnoredException(cause.getCause())) {
log.debug("Elastic job: ignored exception for: {}", cause.getMessage());
} else if (cause instanceof InterruptedException) {
Thread.currentThread().interrupt();
} else {
throw new RegException(cause);
}
}
這裏 elastic job
ignore了zk的操作異常,導致選主失敗但是並沒有做兼容處理,主線程陷入 wait()
。
NoNodeException
根據上面的查詢,無法選主是因爲curator
框架拋出了NoNodeException
,通過google很容找到解決這個問題的方法:統一curator的版本。
關於爲什麼會拋出這個問題,需要深入研究下,留待考察和研究。大概的研究思路:根據異常的堆棧找到拋出異常的位置,對比下curator
在2.7.0
和2.10.0
兩個版本下調用的代碼差異和問題點。出現NoNodeException
的其中一種猜測:2.10.0
版本的代碼在選主的時候底層API會自己創建Node,而2.7.0
的代碼Node可能需要選主的代碼來創建。(只是一種猜測)
最終解決方案
將項目中curator
中的jar包版本全部統一爲2.10.0
問題解決。
NOTE:注意jar包是否完全升級了要去打包後的項目的lib下面觀察下,看所有的jar是否全部都是2.10.0並且沒有其他版本的jar。
總結
調試此類問題很耗時,經驗:
- 如果對框架比較熟悉,先嚐試跟蹤看問題代碼發送地點;
- 去框架的
github
官網的issue
中查看是否有同類問題; - 如果還是無法解決,將日誌級別調整爲DEBUG再仔細觀察下日誌;
此外寫框架的時候:
- 最好不要吞掉異常,不管什麼原因,如果要ignore Exception,最好是打印一個INFO或者WARNING級別的日誌。