[Android][framework]WatchDog學習筆記

代碼結構

Watchdog.java在system_server進程中單例，通過getInstance()獲取實例對象並添加檢測與回調；

    public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }
        return sWatchdog;
    }

主要成員變量

    /* This handler will be used to post message back onto the main thread */
    //所有HandlerChecker集合
    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
    //線程名爲foreground thread的HandlerChecker，實例也存在於mHandlerCheckers中；
    final HandlerChecker mMonitorChecker;
    //暫時沒有找到使用
    ContentResolver mResolver;
    //用於回調AMS的dropbox功能
    ActivityManagerService mActivity;

    //com.android.phone進程的PID
    int mPhonePid;
    //ActivityController
    IActivityController mController;
    //是否允許重啓系統（system_server）
    boolean mAllowRestart = true;
    //文件描述符數量監控（eng與userdebug版本有效，此處暫時無視）
    final OpenFdMonitor mOpenFdMonitor;

繼承自Thread，因此run方法爲核心邏輯；


    @Override
    public void run() {
        //標誌位初始化
        boolean waitedHalf = false;
        //無限循環，直至發現超時
        while (true) {
            final List<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                //默認爲DEFAULT_TIMEOUT的一半(60/2=30s)
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    //核心邏輯1，使用Checker所持有Handler post Checker內部的run方法，遍歷其註冊的所有Monitor
                    hc.scheduleCheckLocked();
                }

	            ...
                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                //記錄起點時間戳
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        //第一次等待30s，之後根據計算決定
                        wait(timeout);
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    //核心邏輯2，如果wait時出現異常，此處timeout>0,並重新循環補足30s，否則此處結果爲負，跳出while語句
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                boolean fdLimitTriggered = false;
                //eng與userdebug版本有效，用於檢測進程打開的文件數量是否超過proc/<$pid>/limits中"Max open files"這一項的值；
                if (mOpenFdMonitor != null) {
                    fdLimitTriggered = mOpenFdMonitor.monitor();
                }

                //如果該進程持有的文件描述符數量超出限制，此處走else邏輯
                if (!fdLimitTriggered) {
                    //核心邏輯4，遍歷所有Checker，獲取最差的狀態（狀態對應的最大值）
                    final int waitState = evaluateCheckerCompletionLocked();
                    if (waitState == COMPLETED) {
                        // The monitors have returned; reset
                        waitedHalf = false;
                        continue;
                    } else if (waitState == WAITING) {
                        //如果爲WAITING，表示運行阻塞，但是還沒有到30s
                        // still waiting but within their configured intervals; back off and recheck
                        continue;
                    } else if (waitState == WAITED_HALF) {
                        //WAITED_HALF表示阻塞超過30s，因此需要做一些操作：如果是剛超過30s，會置waitedHalf爲true，並通知AMS進行dump操作，然後再進行剩下30s的監測；
                        if (!waitedHalf) {
                            // We've waited half the deadlock-detection interval.  Pull a stack
                            // trace and wait another half.
                            ArrayList<Integer> pids = new ArrayList<Integer>();
                            pids.add(Process.myPid());
                            ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                getInterestingNativePids());
                            waitedHalf = true;
                        }
                        continue;
                    }

                    //以上狀態均不適用時，說明阻塞超時60s：
                    // something is overdue!
                    blockedCheckers = getBlockedCheckersLocked();
                    subject = describeCheckersLocked(blockedCheckers);
                } else {
                    blockedCheckers = Collections.emptyList();
                    subject = "Open FD high water mark reached";
                }
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList<Integer> pids = new ArrayList<>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, getInterestingNativePids());

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(2000);

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
                Slog.w(TAG, "*** GOODBYE!");
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

核心邏輯即爲將60s的閾值一分爲二，並通過線程Sleep後判斷各個Monitor的線程運行狀態以確定是否出現死鎖，如果在前30s出現，則通知AMS進行一次dumpStackTraces，然後繼續Sleep 30s，如果還是處於WAITED_HALF或者OVERDUE，則判定爲SWT Trigger，此時會再進行一系列判斷，保存一些列狀態信息，並決定是否要重啓系統（自殺）；

運行流程

前面已經說到，Watchdog爲單例設計，且繼承自Thread，那麼就必然存在創建與運行的代碼；
使用Opengrok跟蹤，很快可以定位到，其構造與運行是在SystemServer.java的startOtherServices()方法中執行的：

        private void startOtherServices() {
            ...
            traceBeginAndSlog("InitWatchdog");
            final Watchdog watchdog = Watchdog.getInstance();
            watchdog.init(context, mActivityManagerService);
            traceEnd();
            ...
        }

於是一個名爲watchdog的線程就跑起來了，在跑起來的同時，它會創建一些其他線程的檢查器（Checker）：

    private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        // 前臺線程，也是默認的Checker，大部分服務的Monitor都是掛在這個Checker上的；
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        // 主線程Checker
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        // UI線程Checker
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        // IO線程Checker
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        // 顯示線程Checker
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        // 將Binder線程的Monitor掛在前臺線程的Checker上；
        addMonitor(new BinderThreadMonitor());

        //文件描述符數量監控器的初始化
        mOpenFdMonitor = OpenFdMonitor.create();

        // See the notes on DEFAULT_TIMEOUT.
        assert DB ||
                DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
    }

看到這裏估計就已經凌亂了，又是Checker，又是Monitor的，如下是一個梳理：

Watchdog是一個線程實例對象，線程名爲watchdog；
watchdog線程構造時會連帶創建出若干個Checker；
其中用FgThread線程Handler創建的Checker是整個Watchdog暴露給外部用於註冊Monitor的默認Checker；
外部服務除了註冊Monitor以外，也可以通過addThread傳入對應線程的Handler以註冊新的Checker；

此時我們再來看看Monitor接口的定義：

    public interface Monitor {
        void monitor();
    }

Monitor接口很簡單，就只有一個monitor方法；姑且記住即可；
然後是HandlerChecker類的定義：


  /**
     * Used for checking status of handle threads and scheduling monitor callbacks.
     */
    public final class HandlerChecker implements Runnable {
        private final Handler mHandler;
        private final String mName;
        private final long mWaitMax;
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
        private boolean mCompleted;
        private Monitor mCurrentMonitor;
        private long mStartTime;

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

        public void addMonitor(Monitor monitor) {
            mMonitors.add(monitor);
        }

        public void scheduleCheckLocked() {
            //如果沒有註冊Monitor，且當前Handler所綁定的Looper沒有阻塞，則標記mCompleted爲true，意爲此線程現在狀態爲COMPLETED；
            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
                // If the target looper has recently been polling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread.  Note that we
                // only do this if mCheckReboot is false and we have no
                // monitors, since those would need to be executed at this point.
                mCompleted = true;
                return;
            }

            //如果mCompleted爲false，表示此時scheduleCheckLocked()是重複調用，此時不進行任何操作；
            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            //初始化標誌位
            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            //請求在當前Handler所在線程立即執行run方法進行check
            mHandler.postAtFrontOfQueue(this);
        }

        public boolean isOverdueLocked() {
            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
        }

        /**
        * 核心方法
        */
        public int getCompletionStateLocked() {
            //若所有Monitor的monitor方法執行完畢，返回狀態COMPLETED
            if (mCompleted) {
                return COMPLETED;
            //否則需要判斷運行時長
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                //小於閾值（默認60s）的一半時，返回WAITING
                if (latency < mWaitMax/2) {
                    return WAITING;
                //大於閾值（默認60s）的一半時，且小於閾值時，返回WAITED_HALF
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            //異常情況，超時，與isOverdueLocked()判斷邏輯相同；
            return OVERDUE;
        }
        
        ...

        /**
        * 核心方法
        */
        @Override
        public void run() {
            //遍歷該Checker下所有Monitor
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                //依次執行Monitor接口實現類的的monitor方法；
                mCurrentMonitor.monitor();
            }

            //若所有Monitor的monitor方法執行完畢，則恢復標誌位（對應狀態COMPLETED）
            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }
    }

由此看來，Watchdog判斷核心即爲判斷Monitor接口實現類的monitor方法執行是否阻塞，那麼我們來看一下一半monitor方法到底都做了什麼。

以PowerManagerService爲例：

    @Override // Watchdog.Monitor implementation
    public void monitor() {
        // Grab and release lock for watchdog monitor to detect deadlocks.
        synchronized (mLock) {
        }
    }

emmmmm… 一句空的同步代碼，是不是顛覆三觀了？再看看以複雜著稱的AMS：

    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }

好吧，不是PMS寫得敷衍，它就是該這麼玩的；

原理分析

在我們捋一捋他的設計思想之前，我們先來複習下死鎖的相關知識：
以下是Oracle對JAVA deadlock的官方描述：

    “Deadlock describes a situation where two or more threads are blocked forever, waiting for each other."

	Source: https://docs.oracle.com/javase/tutorial/essential/concurrency/deadlock.html

簡而言之，就是至少兩個線程持由於代碼執行需要等待對方釋放鎖而導致的永久性阻塞。

那麼PMS/AMS的monitor方法做了什麼？答案是嘗試持鎖；

由於在同步鎖內部，monitor方法沒有做任何操作，也就是說它只是“嘗試性”地持一下鎖，以確認有沒有其他線程持有該鎖；

如果沒有，那麼monitor方法可以立即執行完畢，Checker的run方法也可以立即執行完畢；

如果有，那麼monitor方法會等待同步代碼執行完畢再進持鎖，這段時間損耗就會算進Checker中；

當然，長時間持有鎖不等效於死鎖，因此Watchdog將檢查閾值設定到了60s，以30s爲一個階段，進行兩個階段判斷，只有等待超過60s的長時間無法持有這個鎖，纔會判定爲死鎖（超過30s會dumpStackTrace()）；

[Android][framework]WatchDog學習筆記

代碼結構

運行流程

原理分析

圖解

DAPPER 事務 TRANSACTION

Ubuntu18.04使用Samba搭建私人與共享網盤

[持續更新]dumpsys meminfo字段解讀

[Android Q][cgroup][blkio]關於blkio分組bg在Android Q上失效的分析

[Android][SQLite]sqlite打log方法

LMKD淺析（三）——Android Q新特性(MTK篇)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結