系列文章
前言
上一篇文章Quartz數據庫表分析介紹了Quartz默認提供的11張表,本文將具體分析Quartz是如何調度的,是如何通過數據庫的方式來現在分佈式調度。
調度線程
Quartz內部提供的調度類是QuartzScheduler,而QuartzScheduler會委託QuartzSchedulerThread去實時調度;當調度完需要去執行job的時候QuartzSchedulerThread並沒有直接去執行job,
而是交給ThreadPool去執行job,具體使用什麼ThreadPool,初始化多線線程,可以在配置文件中進行配置:
org.quartz.threadPool.class: org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount: 10
org.quartz.threadPool.threadPriority: 5
常用的線程池是SimpleThreadPool,這裏默認啓動了10個線程,在SimpleThreadPool會創建10個WorkerThread,由WorkerThread去執行具體的job;
調度分析
QuartzSchedulerThread是調度的核心類,具體Quartz是如何實現調度的,可以查看QuartzSchedulerThread核心源碼:
public void run() {
boolean lastAcquireFailed = false;
while (!halted.get()) {
try {
// check if we're supposed to pause...
synchronized (sigLock) {
while (paused && !halted.get()) {
try {
// wait until togglePause(false) is called...
sigLock.wait(1000L);
} catch (InterruptedException ignore) {
}
}
if (halted.get()) {
break;
}
}
int availThreadCount = qsRsrcs.getThreadPool().blockForAvailableThreads();
if(availThreadCount > 0) { // will always be true, due to semantics of blockForAvailableThreads...
List<OperableTrigger> triggers = null;
long now = System.currentTimeMillis();
clearSignaledSchedulingChange();
try {
triggers = qsRsrcs.getJobStore().acquireNextTriggers(
now + idleWaitTime, Math.min(availThreadCount, qsRsrcs.getMaxBatchSize()), qsRsrcs.getBatchTimeWindow());
lastAcquireFailed = false;
if (log.isDebugEnabled())
log.debug("batch acquisition of " + (triggers == null ? 0 : triggers.size()) + " triggers");
} catch (JobPersistenceException jpe) {
if(!lastAcquireFailed) {
qs.notifySchedulerListenersError(
"An error occurred while scanning for the next triggers to fire.",
jpe);
}
lastAcquireFailed = true;
continue;
} catch (RuntimeException e) {
if(!lastAcquireFailed) {
getLog().error("quartzSchedulerThreadLoop: RuntimeException "
+e.getMessage(), e);
}
lastAcquireFailed = true;
continue;
}
if (triggers != null && !triggers.isEmpty()) {
now = System.currentTimeMillis();
long triggerTime = triggers.get(0).getNextFireTime().getTime();
long timeUntilTrigger = triggerTime - now;
while(timeUntilTrigger > 2) {
synchronized (sigLock) {
if (halted.get()) {
break;
}
if (!isCandidateNewTimeEarlierWithinReason(triggerTime, false)) {
try {
// we could have blocked a long while
// on 'synchronize', so we must recompute
now = System.currentTimeMillis();
timeUntilTrigger = triggerTime - now;
if(timeUntilTrigger >= 1)
sigLock.wait(timeUntilTrigger);
} catch (InterruptedException ignore) {
}
}
}
if(releaseIfScheduleChangedSignificantly(triggers, triggerTime)) {
break;
}
now = System.currentTimeMillis();
timeUntilTrigger = triggerTime - now;
}
// this happens if releaseIfScheduleChangedSignificantly decided to release triggers
if(triggers.isEmpty())
continue;
// set triggers to 'executing'
List<TriggerFiredResult> bndles = new ArrayList<TriggerFiredResult>();
boolean goAhead = true;
synchronized(sigLock) {
goAhead = !halted.get();
}
if(goAhead) {
try {
List<TriggerFiredResult> res = qsRsrcs.getJobStore().triggersFired(triggers);
if(res != null)
bndles = res;
} catch (SchedulerException se) {
qs.notifySchedulerListenersError(
"An error occurred while firing triggers '"
+ triggers + "'", se);
//QTZ-179 : a problem occurred interacting with the triggers from the db
//we release them and loop again
for (int i = 0; i < triggers.size(); i++) {
qsRsrcs.getJobStore().releaseAcquiredTrigger(triggers.get(i));
}
continue;
}
}
for (int i = 0; i < bndles.size(); i++) {
TriggerFiredResult result = bndles.get(i);
TriggerFiredBundle bndle = result.getTriggerFiredBundle();
Exception exception = result.getException();
if (exception instanceof RuntimeException) {
getLog().error("RuntimeException while firing trigger " + triggers.get(i), exception);
qsRsrcs.getJobStore().releaseAcquiredTrigger(triggers.get(i));
continue;
}
// it's possible to get 'null' if the triggers was paused,
// blocked, or other similar occurrences that prevent it being
// fired at this time... or if the scheduler was shutdown (halted)
if (bndle == null) {
qsRsrcs.getJobStore().releaseAcquiredTrigger(triggers.get(i));
continue;
}
JobRunShell shell = null;
try {
shell = qsRsrcs.getJobRunShellFactory().createJobRunShell(bndle);
shell.initialize(qs);
} catch (SchedulerException se) {
qsRsrcs.getJobStore().triggeredJobComplete(triggers.get(i), bndle.getJobDetail(), CompletedExecutionInstruction.SET_ALL_JOB_TRIGGERS_ERROR);
continue;
}
if (qsRsrcs.getThreadPool().runInThread(shell) == false) {
// this case should never happen, as it is indicative of the
// scheduler being shutdown or a bug in the thread pool or
// a thread pool being used concurrently - which the docs
// say not to do...
getLog().error("ThreadPool.runInThread() return false!");
qsRsrcs.getJobStore().triggeredJobComplete(triggers.get(i), bndle.getJobDetail(), CompletedExecutionInstruction.SET_ALL_JOB_TRIGGERS_ERROR);
}
}
continue; // while (!halted)
}
} else { // if(availThreadCount > 0)
// should never happen, if threadPool.blockForAvailableThreads() follows contract
continue; // while (!halted)
}
long now = System.currentTimeMillis();
long waitTime = now + getRandomizedIdleWaitTime();
long timeUntilContinue = waitTime - now;
synchronized(sigLock) {
try {
if(!halted.get()) {
// QTZ-336 A job might have been completed in the mean time and we might have
// missed the scheduled changed signal by not waiting for the notify() yet
// Check that before waiting for too long in case this very job needs to be
// scheduled very soon
if (!isScheduleChanged()) {
sigLock.wait(timeUntilContinue);
}
}
} catch (InterruptedException ignore) {
}
}
} catch(RuntimeException re) {
getLog().error("Runtime error occurred in main trigger firing loop.", re);
}
} // while (!halted)
// drop references to scheduler stuff to aid garbage collection...
qs = null;
qsRsrcs = null;
}
1.halted和paused
這是兩個boolean值的標誌參數,分別表示:停止和暫停;halted默認爲false,當QuartzScheduler執行shutdown()時纔會更新爲true;paused默認是true,當QuartzScheduler執行start()時
更新爲false;正常啓動之後QuartzSchedulerThread就可以往下執行了;
2.availThreadCount
查詢SimpleThreadPool是否有可用的WorkerThread,如果availThreadCount>0可以往下繼續執行其他邏輯,否則繼續檢查;
3.acquireNextTriggers
查詢一段時間內將要被調度的triggers,這裏有3個比較重要的參數分別是:idleWaitTime,maxBatchSize,batchTimeWindow,這3個參數都可以在配置文件中進行配置:
org.quartz.scheduler.idleWaitTime:30000
org.quartz.scheduler.batchTriggerAcquisitionMaxCount:1
org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow:0
idleWaitTime:在調度程序處於空閒狀態時,調度程序將在重新查詢可用觸發器之前等待的時間量(以毫秒爲單位),默認是30秒;
batchTriggerAcquisitionMaxCount:允許調度程序節點一次獲取(用於觸發)的觸發器的最大數量,默認是1;
batchTriggerAcquisitionFireAheadTimeWindow:允許觸發器在其預定的火災時間之前被獲取和觸發的時間(毫秒)的時間量,默認是0;
往下繼續查看acquireNextTriggers方法源碼:
public List<OperableTrigger> acquireNextTriggers(final long noLaterThan, final int maxCount, final long timeWindow)
throws JobPersistenceException {
String lockName;
if(isAcquireTriggersWithinLock() || maxCount > 1) {
lockName = LOCK_TRIGGER_ACCESS;
} else {
lockName = null;
}
return executeInNonManagedTXLock(lockName,
new TransactionCallback<List<OperableTrigger>>() {
public List<OperableTrigger> execute(Connection conn) throws JobPersistenceException {
return acquireNextTrigger(conn, noLaterThan, maxCount, timeWindow);
}
},
......
});
}
可以發現只有在設置了acquireTriggersWithinLock或者batchTriggerAcquisitionMaxCount>1情況下才使用LOCK_TRIGGER_ACCESS鎖,也就是說在默認參數配置的情況下,這裏是沒有使用鎖的,
那麼如果多個節點同時去執行acquireNextTriggers,會不會出現同一個trigger在多個節點都被執行?
注:acquireTriggersWithinLock可以在配置文件中進行配置:
org.quartz.jobStore.acquireTriggersWithinLock=true
acquireTriggersWithinLock:獲取triggers的時候是否需要使用鎖,默認是false,如果batchTriggerAcquisitionMaxCount>1最好同時設置acquireTriggersWithinLock爲true;
帶着問題繼續查看TransactionCallback內部的acquireNextTrigger方法源碼:
protected List<OperableTrigger> acquireNextTrigger(Connection conn, long noLaterThan, int maxCount, long timeWindow)
throws JobPersistenceException {
if (timeWindow < 0) {
throw new IllegalArgumentException();
}
List<OperableTrigger> acquiredTriggers = new ArrayList<OperableTrigger>();
Set<JobKey> acquiredJobKeysForNoConcurrentExec = new HashSet<JobKey>();
final int MAX_DO_LOOP_RETRY = 3;
int currentLoopCount = 0;
do {
currentLoopCount ++;
try {
List<TriggerKey> keys = getDelegate().selectTriggerToAcquire(conn, noLaterThan + timeWindow, getMisfireTime(), maxCount);
// No trigger is ready to fire yet.
if (keys == null || keys.size() == 0)
return acquiredTriggers;
long batchEnd = noLaterThan;
for(TriggerKey triggerKey: keys) {
// If our trigger is no longer available, try a new one.
OperableTrigger nextTrigger = retrieveTrigger(conn, triggerKey);
if(nextTrigger == null) {
continue; // next trigger
}
// If trigger's job is set as @DisallowConcurrentExecution, and it has already been added to result, then
// put it back into the timeTriggers set and continue to search for next trigger.
JobKey jobKey = nextTrigger.getJobKey();
JobDetail job;
try {
job = retrieveJob(conn, jobKey);
} catch (JobPersistenceException jpe) {
try {
getLog().error("Error retrieving job, setting trigger state to ERROR.", jpe);
getDelegate().updateTriggerState(conn, triggerKey, STATE_ERROR);
} catch (SQLException sqle) {
getLog().error("Unable to set trigger state to ERROR.", sqle);
}
continue;
}
if (job.isConcurrentExectionDisallowed()) {
if (acquiredJobKeysForNoConcurrentExec.contains(jobKey)) {
continue; // next trigger
} else {
acquiredJobKeysForNoConcurrentExec.add(jobKey);
}
}
if (nextTrigger.getNextFireTime().getTime() > batchEnd) {
break;
}
// We now have a acquired trigger, let's add to return list.
// If our trigger was no longer in the expected state, try a new one.
int rowsUpdated = getDelegate().updateTriggerStateFromOtherState(conn, triggerKey, STATE_ACQUIRED, STATE_WAITING);
if (rowsUpdated <= 0) {
continue; // next trigger
}
nextTrigger.setFireInstanceId(getFiredTriggerRecordId());
getDelegate().insertFiredTrigger(conn, nextTrigger, STATE_ACQUIRED, null);
if(acquiredTriggers.isEmpty()) {
batchEnd = Math.max(nextTrigger.getNextFireTime().getTime(), System.currentTimeMillis()) + timeWindow;
}
acquiredTriggers.add(nextTrigger);
}
// if we didn't end up with any trigger to fire from that first
// batch, try again for another batch. We allow with a max retry count.
if(acquiredTriggers.size() == 0 && currentLoopCount < MAX_DO_LOOP_RETRY) {
continue;
}
// We are done with the while loop.
break;
} catch (Exception e) {
throw new JobPersistenceException(
"Couldn't acquire next trigger: " + e.getMessage(), e);
}
} while (true);
// Return the acquired trigger list
return acquiredTriggers;
}
首先看一下在執行selectTriggerToAcquire方法時引入了新的參數:misfireTime=當前時間-MisfireThreshold,MisfireThreshold可以在配置文件中進行配置:
org.quartz.jobStore.misfireThreshold: 60000
misfireThreshold:叫觸發器超時,比如有10個線程,但是有11個任務,這樣就有一個任務被延遲執行了,可以理解爲調度引擎可以忍受這個超時的時間;具體的查詢SQL如下所示:
SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY
FROM qrtz_TRIGGERS
WHERE SCHED_NAME = 'myScheduler'
AND TRIGGER_STATE = 'WAITING'
AND NEXT_FIRE_TIME <= noLaterThan
AND (MISFIRE_INSTR = -1 OR
(MISFIRE_INSTR != -1 AND NEXT_FIRE_TIME >= noEarlierThan))
ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC
這裏的noLaterThan=當前時間+idleWaitTime+batchTriggerAcquisitionFireAheadTimeWindow,
noEarlierThan=當前時間-MisfireThreshold;
在查詢完之後,會遍歷執行updateTriggerStateFromOtherState()方法更新trigger的狀態從STATE_WAITING到STATE_ACQUIRED,並且會判斷rowsUpdated是否大於0,這樣就算多個節點都查詢到相同的trigger,但是肯定只會有一個節點更新成功;更新完狀態之後,往qrtz_fired_triggers表中插入一條記錄,表示當前trigger已經觸發,狀態爲STATE_ACQUIRED;
4.executeInNonManagedTXLock
Quartz的分佈式鎖被用在很多地方,下面具體看一下Quartz是如何實現分佈式鎖的,executeInNonManagedTXLock方法源碼如下:
protected <T> T executeInNonManagedTXLock(
String lockName,
TransactionCallback<T> txCallback, final TransactionValidator<T> txValidator) throws JobPersistenceException {
boolean transOwner = false;
Connection conn = null;
try {
if (lockName != null) {
// If we aren't using db locks, then delay getting DB connection
// until after acquiring the lock since it isn't needed.
if (getLockHandler().requiresConnection()) {
conn = getNonManagedTXConnection();
}
transOwner = getLockHandler().obtainLock(conn, lockName);
}
if (conn == null) {
conn = getNonManagedTXConnection();
}
final T result = txCallback.execute(conn);
try {
commitConnection(conn);
} catch (JobPersistenceException e) {
rollbackConnection(conn);
if (txValidator == null || !retryExecuteInNonManagedTXLock(lockName, new TransactionCallback<Boolean>() {
@Override
public Boolean execute(Connection conn) throws JobPersistenceException {
return txValidator.validate(conn, result);
}
})) {
throw e;
}
}
Long sigTime = clearAndGetSignalSchedulingChangeOnTxCompletion();
if(sigTime != null && sigTime >= 0) {
signalSchedulingChangeImmediately(sigTime);
}
return result;
} catch (JobPersistenceException e) {
rollbackConnection(conn);
throw e;
} catch (RuntimeException e) {
rollbackConnection(conn);
throw new JobPersistenceException("Unexpected runtime exception: "
+ e.getMessage(), e);
} finally {
try {
releaseLock(lockName, transOwner);
} finally {
cleanupConnection(conn);
}
}
}
大致分成3個步驟:獲取鎖,執行邏輯,釋放鎖;getLockHandler().obtainLock表示獲取鎖txCallback.execute(conn)表示執行邏輯,commitConnection(conn)表示釋放鎖
Quartz的分佈式鎖接口類是Semaphore,默認具體的實現是StdRowLockSemaphore,具體接口如下:
public interface Semaphore {
boolean obtainLock(Connection conn, String lockName) throws LockException;
void releaseLock(String lockName) throws LockException;
boolean requiresConnection();
}
具體看一下obtainLock()是如何獲取鎖的,源碼如下:
public boolean obtainLock(Connection conn, String lockName)
throws LockException {
if (!isLockOwner(lockName)) {
executeSQL(conn, lockName, expandedSQL, expandedInsertSQL);
getThreadLocks().add(lockName);
} else if(log.isDebugEnabled()) {
}
return true;
}
protected void executeSQL(Connection conn, final String lockName, final String expandedSQL, final String expandedInsertSQL) throws LockException {
PreparedStatement ps = null;
ResultSet rs = null;
SQLException initCause = null;
int count = 0;
do {
count++;
try {
ps = conn.prepareStatement(expandedSQL);
ps.setString(1, lockName);
rs = ps.executeQuery();
if (!rs.next()) {
getLog().debug(
"Inserting new lock row for lock: '" + lockName + "' being obtained by thread: " +
Thread.currentThread().getName());
rs.close();
rs = null;
ps.close();
ps = null;
ps = conn.prepareStatement(expandedInsertSQL);
ps.setString(1, lockName);
int res = ps.executeUpdate();
if(res != 1) {
if(count < 3) {
try {
Thread.sleep(1000L);
} catch (InterruptedException ignore) {
Thread.currentThread().interrupt();
}
continue;
}
}
}
return; // obtained lock, go
} catch (SQLException sqle) {
......
} while(count < 4);
}
obtainLock首先判斷是否已經獲取到鎖,如果沒有執行方法executeSQL,其中有兩條重要的SQL,分別是:expandedSQL和expandedInsertSQL,以SCHED_NAME = ‘myScheduler’爲例:
SELECT * FROM QRTZ_LOCKS WHERE SCHED_NAME = 'myScheduler' AND LOCK_NAME = ? FOR UPDATE
INSERT INTO QRTZ_LOCKS(SCHED_NAME, LOCK_NAME) VALUES ('myScheduler', ?)
select語句後面添加了FOR UPDATE,如果LOCK_NAME存在,當多個節點去執行此SQL時,只有第一個節點會成功,其他的節點都將進入等待;
如果LOCK_NAME不存在,多個節點同時執行expandedInsertSQL,只會有一個節點插入成功,執行插入失敗的節點將進入重試,重新執行expandedSQL;
txCallback執行完之後,執行commitConnection操作,這樣當前節點就釋放了LOCK_NAME,其他節點可以競爭獲取鎖,最後執行了releaseLock;
5.triggersFired
表示觸發trigger,具體代碼如下:
protected TriggerFiredBundle triggerFired(Connection conn,
OperableTrigger trigger)
throws JobPersistenceException {
JobDetail job;
Calendar cal = null;
// Make sure trigger wasn't deleted, paused, or completed...
try { // if trigger was deleted, state will be STATE_DELETED
String state = getDelegate().selectTriggerState(conn,
trigger.getKey());
if (!state.equals(STATE_ACQUIRED)) {
return null;
}
} catch (SQLException e) {
throw new JobPersistenceException("Couldn't select trigger state: "
+ e.getMessage(), e);
}
try {
job = retrieveJob(conn, trigger.getJobKey());
if (job == null) { return null; }
} catch (JobPersistenceException jpe) {
try {
getLog().error("Error retrieving job, setting trigger state to ERROR.", jpe);
getDelegate().updateTriggerState(conn, trigger.getKey(),
STATE_ERROR);
} catch (SQLException sqle) {
getLog().error("Unable to set trigger state to ERROR.", sqle);
}
throw jpe;
}
if (trigger.getCalendarName() != null) {
cal = retrieveCalendar(conn, trigger.getCalendarName());
if (cal == null) { return null; }
}
try {
getDelegate().updateFiredTrigger(conn, trigger, STATE_EXECUTING, job);
} catch (SQLException e) {
throw new JobPersistenceException("Couldn't insert fired trigger: "
+ e.getMessage(), e);
}
Date prevFireTime = trigger.getPreviousFireTime();
// call triggered - to update the trigger's next-fire-time state...
trigger.triggered(cal);
String state = STATE_WAITING;
boolean force = true;
if (job.isConcurrentExectionDisallowed()) {
state = STATE_BLOCKED;
force = false;
try {
getDelegate().updateTriggerStatesForJobFromOtherState(conn, job.getKey(),
STATE_BLOCKED, STATE_WAITING);
getDelegate().updateTriggerStatesForJobFromOtherState(conn, job.getKey(),
STATE_BLOCKED, STATE_ACQUIRED);
getDelegate().updateTriggerStatesForJobFromOtherState(conn, job.getKey(),
STATE_PAUSED_BLOCKED, STATE_PAUSED);
} catch (SQLException e) {
throw new JobPersistenceException(
"Couldn't update states of blocked triggers: "
+ e.getMessage(), e);
}
}
if (trigger.getNextFireTime() == null) {
state = STATE_COMPLETE;
force = true;
}
storeTrigger(conn, trigger, job, true, state, force, false);
job.getJobDataMap().clearDirtyFlag();
return new TriggerFiredBundle(job, trigger, cal, trigger.getKey().getGroup()
.equals(Scheduler.DEFAULT_RECOVERY_GROUP), new Date(), trigger
.getPreviousFireTime(), prevFireTime, trigger.getNextFireTime());
}
首先查詢trigger的狀態是否STATE_ACQUIRED狀態,如果不是直接返回null;然後通過通過jobKey獲取對應的jobDetail,更新對應的FiredTrigger爲EXECUTING狀態;最後判定job的DisallowConcurrentExecution是否開啓,如果開啓了不能併發執行job,那麼trigger的狀態爲STATE_BLOCKED狀態,否則爲STATE_WAITING;如果狀態爲STATE_BLOCKED,那麼下次調度
對應的trigger不會被拉取,只有等對應的job執行完之後,更新狀態爲STATE_WAITING之後纔可以執行,保證了job的串行;
6.執行job
通過ThreadPool來執行封裝job的JobRunShell;
問題解釋
在文章Spring整合Quartz分佈式調度中,最後做了幾次測試分佈式調度,現在可以做出相應的解釋
1.同一trigger同一時間只會在一個節點執行
上文中可以發現Quartz使用了分佈式鎖和狀態來保證只有一個節點能執行;
2.任務沒有執行完,可以重新開始
因爲調度線程和任務執行線程是分開的,認爲執行在Threadpool中執行,互相不影響;
3.通過DisallowConcurrentExecution註解保證任務的串行
在triggerFired中如果使用了DisallowConcurrentExecution,會引入STATE_BLOCKED狀態,保證任務的串行;
總結
本文從源碼的角度大致介紹了一下Quartz調度的流程,當然太細節的東西沒有去深入;通過本文大致可以對多節點調度產生的現象做一個合理的解釋。