調度服務 ScheduledExecutorService 經常卡頓問題的排查及解決方法

文章目錄

問題描述

首先，給出調度服務的 Java 代碼示例：

@Slf4j
@Component
public class TaskProcessSchedule {

    // 核心線程數
    private static final int THREAD_COUNT = 10;

    // 查詢數據步長
    private static final int ROWS_STEP = 30;

    @Resource
    private TaskDao taskDao;

    @Resource
    private TaskService taskService;

    private static ScheduledExecutorService scheduledExecutorService = Executors.newScheduledThreadPool(THREAD_COUNT);

    public TaskProcessSchedule() {
        for (int i = 0; i < THREAD_COUNT; i++) {
            scheduledExecutorService.scheduleAtFixedRate(
                    new TaskWorker(i * ROWS_STEP, ROWS_STEP),
                    10,
                    2,
                    TimeUnit.SECONDS
            );
        }
        log.info("TaskProcessSchedule scheduleAtFixedRate start success.");
    }
 
    class TaskWorker implements Runnable {
        private int offset;
        private int rows;

        TaskWorker(int offset, int rows) {
            this.offset = offset;
            this.rows = rows;
        }

        @Override
        public void run() {
            List<Task> taskList = taskDao.selectProcessingTaskByLimitRange(offset, rows);
            if (CollectionUtils.isEmpty(taskList)) {
                return;
            }
            log.info("TaskWorker: current schedule thread name is {}, taskList is {}", Thread.currentThread().getName(), JsonUtil.toJson(taskList));
            taskService.processTask(taskList);         
        }
    }
}

如上述代碼所示，啓動 10 個調度線程，延遲 10 秒，開始執行定時邏輯，然後每隔 2 秒執行一次定時任務。定時任務類爲TaskWorker，其要做的事就是根據offset和rows參數，到數據庫撈取指定範圍的待處理記錄，然後送到TaskService的processTask方法中進行處理。從邏輯上來看，該定時沒有什麼毛病，但是在執行定時任務的時候，卻經常出現卡頓的問題，表現出來的現象就是：定時任務不執行了。

問題定位

既然已經知道問題的現象了，現在我們就來看看如果定位問題。

使用jps命令，查詢當前服務器運行的 Java 進程PID

當然，也可以直接使用jps | grep "ServerName"查詢指定服務的PID，其中ServerName爲服務名稱。

使用jstack PID | grep "schedule"命令，查詢調度線程的狀態

如上圖所示，發現我們啓動的 10 個調度線程均處於WAITING狀態。

使用jstack PID | grep "schedule-task-10" -A 50命令，查詢指定線程的詳細信息

如上圖所示，我們可以知道調度線程在執行DelayedWorkQueue的take()方法的時候被卡主了。

深入分析

通過上面的問題定位，我們已經知道了代碼卡在了這裏：

at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)

那麼接下來，我們就詳細分析一下出問題的代碼。

        public RunnableScheduledFuture<?> take() throws InterruptedException {
            final ReentrantLock lock = this.lock;
            lock.lockInterruptibly();
            try {
                for (;;) {
                    RunnableScheduledFuture<?> first = queue[0];
                    if (first == null)
                        available.await();
                    else {
                        long delay = first.getDelay(NANOSECONDS);
                        if (delay <= 0)
                            return finishPoll(first);
                        first = null; // don't retain ref while waiting
                        if (leader != null)
                            available.await(); // 1088 行代碼
                        else {
                            Thread thisThread = Thread.currentThread();
                            leader = thisThread;
                            try {
                                available.awaitNanos(delay);
                            } finally {
                                if (leader == thisThread)
                                    leader = null;
                            }
                        }
                    }
                }
            } finally {
                if (leader == null && queue[0] != null)
                    available.signal();
                lock.unlock();
            }
        }

由於上述代碼可知，當延遲隊列的任務爲空，或者當任務不爲空且leader線程不爲null的時候，都會調用await方法；而且，就算leader爲null，後續也會調用awaitNanos方法進行延遲設置。下面，我們再來看看提交任務的方法scheduleAtFixedRate：

    public ScheduledFuture<?> scheduleAtFixedRate(Runnable command,
                                                  long initialDelay,
                                                  long period,
                                                  TimeUnit unit) {
        if (command == null || unit == null)
            throw new NullPointerException();
        if (period <= 0)
            throw new IllegalArgumentException();
        ScheduledFutureTask<Void> sft =
            new ScheduledFutureTask<Void>(command,
                                          null,
                                          triggerTime(initialDelay, unit),
                                          unit.toNanos(period));
        RunnableScheduledFuture<Void> t = decorateTask(command, sft);
        sft.outerTask = t;
        delayedExecute(t);
        return t;
    }

在scheduleAtFixedRate方法中會調用decorateTask方法裝飾任務t，然後再將該任務扔到delayedExecute方法中進行處理。

    private void delayedExecute(RunnableScheduledFuture<?> task) {
        if (isShutdown())
            reject(task);
        else {
            super.getQueue().add(task);
            if (isShutdown() &&
                !canRunInCurrentRunState(task.isPeriodic()) &&
                remove(task))
                task.cancel(false);
            else
                ensurePrestart();
        }
    }

在delayedExecute方法中，主要是檢查線程池中是否可以創建線程，如果不可以，則拒絕任務；否則，向任務隊列中添加任務並調用ensurePrestart方法。

    void ensurePrestart() {
        int wc = workerCountOf(ctl.get());
        if (wc < corePoolSize)
            addWorker(null, true);
        else if (wc == 0)
            addWorker(null, false);
    }

在ensurePrestart方法中，主要就是判斷工作線程數量是否大於核心線程數，然後根據判斷的結果，使用不同的參數調用addWorker方法。

    private boolean addWorker(Runnable firstTask, boolean core) {
        retry:
        for (;;) {
            int c = ctl.get();
            int rs = runStateOf(c);

            // Check if queue empty only if necessary.
            if (rs >= SHUTDOWN &&
                ! (rs == SHUTDOWN &&
                   firstTask == null &&
                   ! workQueue.isEmpty()))
                return false;

            for (;;) {
                int wc = workerCountOf(c);
                if (wc >= CAPACITY ||
                    wc >= (core ? corePoolSize : maximumPoolSize))
                    return false;
                if (compareAndIncrementWorkerCount(c))
                    break retry;
                c = ctl.get();  // Re-read ctl
                if (runStateOf(c) != rs)
                    continue retry;
                // else CAS failed due to workerCount change; retry inner loop
            }
        }

        boolean workerStarted = false;
        boolean workerAdded = false;
        Worker w = null;
        try {
            w = new Worker(firstTask);
            final Thread t = w.thread;
            if (t != null) {
                final ReentrantLock mainLock = this.mainLock;
                mainLock.lock();
                try {
                    // Recheck while holding lock.
                    // Back out on ThreadFactory failure or if
                    // shut down before lock acquired.
                    int rs = runStateOf(ctl.get());

                    if (rs < SHUTDOWN ||
                        (rs == SHUTDOWN && firstTask == null)) {
                        if (t.isAlive()) // precheck that t is startable
                            throw new IllegalThreadStateException();
                        workers.add(w);
                        int s = workers.size();
                        if (s > largestPoolSize)
                            largestPoolSize = s;
                        workerAdded = true;
                    }
                } finally {
                    mainLock.unlock();
                }
                if (workerAdded) {
                    t.start();
                    workerStarted = true;
                }
            }
        } finally {
            if (! workerStarted)
                addWorkerFailed(w);
        }
        return workerStarted;
    }

在addWorker方法中，主要目的就是將任務添加到workers工作線程池並啓動工作線程。接下來，我們再來看看Worker的執行邏輯，也就是run方法：

        public void run() {
            runWorker(this);
        }

在run方法中，主要就是將調用轉發到外部的runWorker方法：

    final void runWorker(Worker w) {
        Thread wt = Thread.currentThread();
        Runnable task = w.firstTask;
        w.firstTask = null;
        w.unlock(); // allow interrupts
        boolean completedAbruptly = true;
        try {
            while (task != null || (task = getTask()) != null) {
                w.lock();
                if ((runStateAtLeast(ctl.get(), STOP) ||
                     (Thread.interrupted() &&
                      runStateAtLeast(ctl.get(), STOP))) &&
                    !wt.isInterrupted())
                    wt.interrupt();
                try {
                    beforeExecute(wt, task);
                    Throwable thrown = null;
                    try {
                        task.run(); // 執行調度任務
                    } catch (RuntimeException x) {
                        thrown = x; throw x;
                    } catch (Error x) {
                        thrown = x; throw x;
                    } catch (Throwable x) {
                        thrown = x; throw new Error(x);
                    } finally {
                        afterExecute(task, thrown);
                    }
                } finally {
                    task = null;
                    w.completedTasks++;
                    w.unlock();
                }
            }
            completedAbruptly = false;
        } finally {
            processWorkerExit(w, completedAbruptly);
        }
    }

在runWorker方法中，核心操作就是調用task.run()，其中task爲Runnable類型，其實現類爲ScheduledFutureTask，而ScheduledFutureTask繼承了FutureTask類。對於FutureTask類，如果在執行run方法的過程中拋出異常，則這個異常並不會顯示拋出，而是需要我們調用FutureTask的get方法來獲取，因此如果我們在執行調度任務的時候沒有進行異常處理，則異常會被吞噬。

特別地，在FutureTask類中，大量操作了sun.misc.Unsafe LockSupport類，而這個類的park方法，正是上面我們排查問題時定位到調度任務卡住的地方。除此之外，如果我們詳細閱讀了ScheduledExecutorService的scheduleAtFixedRate的 doc 文檔，如下所示：

/**
     * Creates and executes a periodic action that becomes enabled first
     * after the given initial delay, and subsequently with the given
     * period; that is executions will commence after
     * {@code initialDelay} then {@code initialDelay+period}, then
     * {@code initialDelay + 2 * period}, and so on.
     * If any execution of the task
     * encounters an exception, subsequent executions are suppressed.
     * Otherwise, the task will only terminate via cancellation or
     * termination of the executor.  If any execution of this task
     * takes longer than its period, then subsequent executions
     * may start late, but will not concurrently execute.
     *
     * @param command the task to execute
     * @param initialDelay the time to delay first execution
     * @param period the period between successive executions
     * @param unit the time unit of the initialDelay and period parameters
     * @return a ScheduledFuture representing pending completion of
     *         the task, and whose {@code get()} method will throw an
     *         exception upon cancellation
     * @throws RejectedExecutionException if the task cannot be
     *         scheduled for execution
     * @throws NullPointerException if command is null
     * @throws IllegalArgumentException if period less than or equal to zero
     */
    public ScheduledFuture<?> scheduleAtFixedRate(Runnable command,
                                                  long initialDelay,
                                                  long period,
                                                  TimeUnit unit);

我們會發現這樣一句話：

If any execution of the task encounters an exception, subsequent executions are suppressed.

翻譯過來，就是：

如果任務的任何執行遇到異常，則禁止後續的執行。

說白了，就是在執行調度任務的時候，如果遇到了（未捕獲）的異常，則後續的任務都不會執行了。

解決方法

到這裏，我們已經知道了問題產生的原因。下面，我們就修改開篇的示例代碼，進行優化：

@Slf4j
@Component
public class TaskProcessSchedule {

    // 核心線程數
    private static final int THREAD_COUNT = 10;

    // 查詢數據步長
    private static final int ROWS_STEP = 30;

    @Resource
    private TaskDao taskDao;

    @Resource
    private TaskService taskService;

    private static ScheduledExecutorService scheduledExecutorService = Executors.newScheduledThreadPool(THREAD_COUNT);

    public TaskProcessSchedule() {
        for (int i = 0; i < THREAD_COUNT; i++) {
            scheduledExecutorService.scheduleAtFixedRate(
                    new TaskWorker(i * ROWS_STEP, ROWS_STEP),
                    10,
                    2,
                    TimeUnit.SECONDS
            );
        }
        log.info("TaskProcessSchedule scheduleAtFixedRate start success.");
    }
 
    class TaskWorker implements Runnable {
        private int offset;
        private int rows;

        TaskWorker(int offset, int rows) {
            this.offset = offset;
            this.rows = rows;
        }

        @Override
        public void run() {
            List<Task> taskList = taskDao.selectProcessingTaskByLimitRange(offset, rows);
            if (CollectionUtils.isEmpty(taskList)) {
                return;
            }
            log.info("TaskWorker: current schedule thread name is {}, taskList is {}", Thread.currentThread().getName(), JsonUtil.toJson(taskList));
            try { // 新增異常處理
            	taskService.processTask(taskList);         
            } catch (Throwable e) {
                log.error("TaskWorker come across a error {}", e);
            }
        }
    }
}

如上述代碼所示，我們對任務的核心邏輯進行了try-catch處理，這樣當任務再拋出異常的時候，僅會忽略拋出異常的任務，而不會影響後續的任務。這也說明一件事，那就是：我們在編碼的時候，要特別注意對異常情況的處理。

調度服務 ScheduledExecutorService 經常卡頓問題的排查及解決方法

文章目錄

問題描述

問題定位

深入分析

解決方法

詳述 Java NIO 以及 Socket 處理粘包和斷包方法

由阿里巴巴 Java 開發規約 HashMap 條目引發的故事

帶你詳細瞭解，一致性哈希算法的實現原理

設置 Linux 別名命令 alias 永久生效的方法

在使用 Spring Boot 的過程中，你可能不太知道的點？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結