linux 2.6.23時鐘中斷與調度分析

一:前言
時鐘是整個操作系統的脈搏,它爲進程的時間片調度,定時事件提供了依據.另外,用戶空間的很多操作都依賴於時鐘,例如select.poll,make. 操作系統管理的時間爲分兩種,一種稱爲當前時間,也即我們日常生活所用的時間.這個時間一般保存在CMOS中.主板中有特定的芯片爲其提供計時依據.另外一種時間稱爲相對時間.例如系統運行時間.顯然對計算機而然,相對時間比當前時間更爲重要.

二:與時鐘有關的硬件處理.
1):實時時鐘(RTC)
該時鐘獨立於CPU和其它芯片.即使PC斷電,該時鐘還是繼續運行.該計時由一塊單獨的芯片處理,並把時鐘值存放CMOS.該時間可參在IRQ8上週期性的產生時間信號.頻率在2Hz ~ 8192Hz之間.但在linux中,只是用RTC來獲取當前時間.

2):時間戳計時器(TSC)
CPU附帶了一個64位的時間戳寄存器,當時鍾信號到來的時候.該寄存器內容自動加1

3):可編程中斷定時器(PIC)
該設備可以週期性的發送一個時間中斷信號.發送中斷信號的間隔可以對其進行編程控制.在linux系統中,該中斷時間間隔由HZ表示.這個時間間隔也被稱爲一個節拍(tick).
在 ./include/asm-i386/param.h 定義
10#ifndef HZ
11#define HZ 100
12#endif

4):CPU本地定時器
在處理器的本地APIC還提供了另外的一定定時設備.CPU本地定時器也可以單次或者週期性的產生中斷信號.與上次描述的PIC相比.它有以下幾點的區別:
APIC本地計時器是32位.而PIC是16位.由此APIC本地計時器可以提供更低頻率的中斷信號
本地APIC只把中斷信號發送給本地CPU.而PIC發送的中斷信號任何CPU都可以處理
APIC定時器是基於總線時鐘信號的.而PIC有自己的內部時鐘振盪器

5):高精度計時器(HPET)
在linux2.6中增加了對HPET的支持.HPET是一種由intel開發的新型定時芯片.該設備有一組寄時器,每個寄時器對應有自己的時鐘信號,時鐘信號到來的時候就會自動加1.
實際上,在intel多理器系統與單處理器系統還有所不同:
在單處理系統中.所有計時活動過由PIC產生的時鐘中斷信號觸發的
在多處理系統中,所有普通活動是由PIC產生的中斷觸發.所有具體的CPU活動,都由本地APIC觸發的.

6)內核在2.6引入hrtimer，增強了clock event模型。在當前內核裏，有3種中斷源將自己註冊爲一個clock_event_device結構:

struct clock_event_device {
    const char        *name;
    unsigned int        features;
    unsigned long        max_delta_ns;
    unsigned long        min_delta_ns;
    unsigned long        mult;
    int            shift;
    int            rating;
    int            irq;
    cpumask_t        cpumask;
    int            (*set_next_event)(unsigned long evt,
                         struct clock_event_device *);
    void            (*set_mode)(enum clock_event_mode mode,
                     struct clock_event_device *);
    void            (*event_handler)(struct clock_event_device *);
    void            (*broadcast)(cpumask_t mask);
    struct list_head    list;
    enum clock_event_mode    mode;
    ktime_t            next_event;
};

最重要的是set_next_event(), event_handler(). 前者是設置下一個clock事件的觸發條件, 一般就是往clock device裏重設一下定時器. 後者是event handler, 事件處理函數. 該處理函數會在時鐘中斷ISR裏被調用. 如果這個clock用來做爲ticker時鐘, 那麼handler的執行和之前kernel的時鐘中斷ISR基本相同, 類似timer_tick(). 事件處理函數可以在運行時動態替換, 這就給kernel一個改變整個時鐘中斷處理方式的機會, 也就給highres tick及dynamic tick一個動態掛載的機會.


        1) LAPIC Timer (每個CPU一個lapic_events)
           arch/x86/kernel/apic_32.c,         setup_APIC_timer():

                clockevents_register_device(levt);

           嗯，lapic_clockevent是每個lapic_events的默認值，每個CPU都在setup_APIC_Timer中用它對lapic_events
           進行初始化。

        2) HPET
           arch/x86/kernel/hpet.c,         hpet_legacy_clockevent_register():

                   clockevents_register_device(&hpet_clockevent);

        3) PIT
           arch/x86/kernel/i8253.c,         setup_pit_timer():

                clockevents_register_device(&pit_clockevent);

    同時，系統中有一個唯一的global_clock_event指針，從代碼上來看，它或者被賦值爲&pit_clockevent，或者
    被賦值爲hpet_clockevent。我的理解是，只要CONFIG_HPET_TIMER=y，那麼就是後者。而且，LAPIC Timer只
    能是一種Local Interrupt Source，因此不可能用來給global_clock_event賦值。

7)

和硬件計時器（相關的數據結構主要有兩個：

struct clocksource ：對硬件設備的抽象，描述時鐘源信息
struct clock_event_device ：時鐘的事件信息，包括當硬件時鐘中斷髮生時要執行那些操作（實際上保存了相應函數的指針）。本文將該結構稱作爲“時鐘事件設備”。

上述兩個結構內核源代碼中有較詳細的註解，分別位於文件 clocksource.h 和 clockchips.h 中。需要特別注意的是結構 clock_event_device 的成員 event_handler ，它指定了當硬件時鐘中斷髮生時，內核應該執行那些操作，也就是真正的時鐘中斷處理函數。

Linux 內核維護了兩個鏈表，分別存儲了系統中所有時鐘源的信息和時鐘事件設備的信息。這兩個鏈表的表頭在內核中分別是 clocksource_list 和 clockevent_devices 。圖2-1顯示了這兩個鏈表。

圖2-1 時鐘源鏈表和時鐘事件鏈表

三:時鐘中斷相關代碼分析
time_init()是時鐘初始化函數,他由asmlinkage void __init start_kernel()調用.具體代碼如下:
//時鐘中斷初始化

/*
* This is called directly from init code; we must delay timer setup in the
* HPET case as we can't make the decision to turn on HPET this early in the
* boot process.
*
* The chosen time_init function will usually be hpet_time_init, above, but
* in the case of virtual hardware, an alternative function may be substituted.
*/
void __init time_init(void)
{
        tsc_init();
        late_time_init = choose_time_init();
}

    內核中檢查0號中斷的中斷源爲HPET還是PIT的代碼流程是：

    start_kernel() > time_init() > choose_time_init() == hpet_time_init:

                void __init hpet_time_init(void)
                {
                        if (!hpet_enable())
                                setup_pit_timer();
                        //註冊時鐘中斷處理函數
                        time_init_hook();
                }

     注意，如果HPET失敗，則使用PIT。做完這個選擇之後進行0號中斷的相關設置。

     -> HPET的情況:
             hpet_enable() > hpet_legacy_clockevent_register() :

                clockevents_register_device(&hpet_clockevent);
                global_clock_event = &hpet_clockevent;

     -> PIT的情況:
             setup_pit_timer() :

                clockevents_register_device(&pit_clockevent);
                global_clock_event = &pit_clockevent;

     -> 設置完中斷源後：
             time_init_hook():

                /**
                 * static struct irqaction irq0 = {
                 *         .handler = timer_interrupt,
                 *         .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL,
                 *         .mask = CPU_MASK_NONE,
                 *         .name = "timer"
                 * };
                 */

                irq0.mask = cpumask_of_cpu(0);
                setup_irq(0, &irq0);

        可見，不管是其中斷源是HPET還是PIT， 0號中斷只能由CPU0來處理，其處理例程爲timer_interrupt。

注:在start_kernel()>timekeeping_init()已隊系統時間進行了初始化,而在2.6以前版本中初始化時間是在time_init()函數中初始化.

接着上章，轉入time_init_hook():
void __init time_init_hook(void)
{
//註冊中斷處理函數
setup_irq(0, &irq0);
}
Irq0定義如下:
static struct irqaction irq0 = { timer_interrupt, SA_INTERRUPT, CPU_MASK_NONE, "timer", NULL, NULL};

對應的中斷處理函數爲:timer_interrupt():

/*
* This is the same as the above, except we _also_ save the current
* Time Stamp Counter value at the time of the timer interrupt, so that
* we later on can estimate the time of day more exactly.
*/
irqreturn_t timer_interrupt(int irq, void *dev_id)
{
#ifdef CONFIG_X86_IO_APIC
        if (timer_ack) {
                /*
                 * Subtle, when I/O APICs are used we have to ack timer IRQ
                 * manually to reset the IRR bit for do_slow_gettimeoffset().
                 * This will also deassert NMI lines for the watchdog if run
                 * on an 82489DX-based system.
                 */
                spin_lock(&i8259A_lock);
                outb(0x0c, PIC_MASTER_OCW3);
                /* Ack the IRQ; AEOI will end it automatically. */
                inb(PIC_MASTER_POLL);
                spin_unlock(&i8259A_lock);
        }
#endif

        do_timer_interrupt_hook();

        if (MCA_bus) {
                /* The PS/2 uses level-triggered interrupts. You can't
                turn them off, nor would you want to (any attempt to
                enable edge-triggered interrupts usually gets intercepted by a
                special hardware circuit). Hence we have to acknowledge
                the timer interrupt. Through some incredibly stupid
                design idea, the reset for IRQ 0 is done by setting the
                high bit of the PPI port B (0x61). Note that some PS/2s,
                notably the 55SX, work fine if this is removed. */

                u8 irq_v = inb_p( 0x61 );       /* read the current state */
                outb_p( irq_v|0x80, 0x61 );     /* reset the IRQ */
        }

        return IRQ_HANDLED;
}

核心處理函數爲 do_timer_interrupt_hook():
static inline void do_timer_interrupt_hook(void)
{
        global_clock_event->event_handler(global_clock_event);
}
而global_clock_event在前面提過了,中斷處理函數初始化通過如下調用:
static inline void tick_set_periodic_handler(struct clock_event_device *dev,
                                             int broadcast)
{
        dev->event_handler = tick_handle_periodic;
}

初始化dev->event_handler函數的過程：

   clockevents_register_device() >
                    clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev) >
                       raw_notifier_call_chain(&clockevents_chain, reason, dev) >
                                   __raw_notifier_call_chain(nh, val, v, -1, NULL) >
                                      notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls) :

                                 ...
                              ret = nb->notifier_call(nb, val, v);
/* 注意這個參數v，指向clockevents_register_device
* 的參數clock_event_device *global_clock_event;
* 而參數nb，就是&clockevents_chain
*/


然而，這個notifier_call函數指針賦值在start_kernel() > tick_init() :

      clockevents_register_notifier(&tick_notifier);

而tick_notifier的定義：

      static struct notifier_block tick_notifier = {
        .notifier_call = tick_notify,
      };

還沒有完呢，
                   tick_notify>
                           tick_check_new_device()>
                                       tick_setup-device()>
                                                   tick_setup_periodic()>
                                                           tick_set_periodic_handler()

這個dev->event_handler()處理函數終於初始化了

直接轉到tick_handle_periodic():

void tick_handle_periodic(struct clock_event_device *dev)
{
        int cpu = smp_processor_id();
        ktime_t next;

        tick_periodic(cpu);

        if (dev->mode != CLOCK_EVT_MODE_ONESHOT)
                return;
        /*
         * Setup the next period for devices, which do not have
         * periodic mode:
         */
        next = ktime_add(dev->next_event, tick_period);
        for (;;) {
                if (!clockevents_program_event(dev, next, ktime_get()))
                        return;
                tick_periodic(cpu);
                next = ktime_add(next, tick_period);
        }
}
其中tick_periodic調用就是以前的一系列更新操作，包括更新進程時間片等等.
static void tick_periodic(int cpu)
{
        if (tick_do_timer_cpu == cpu) {
                write_seqlock(&xtime_lock);

                /* Keep track of the next tick event */
                tick_next_period = ktime_add(tick_next_period, tick_period);

                do_timer(1);
                write_sequnlock(&xtime_lock);
        }
    //更新當前運行進程的與時鐘相關的信息
        update_process_times(user_mode(get_irq_regs()));
        profile_tick(CPU_PROFILING);
}

我們忽略選擇編譯部份,轉到do_timer()
void do_timer(unsigned long ticks)
{
    // 更新jiffies計數.jiffies_64與jiffies在鏈接的時候,實際是指向同一個區域
        jiffies_64 += ticks;
        ////更新當前時間.xtime的更新
        update_times(ticks);
}

Update_process_times（）代碼如下:
void update_process_times(int user_tick)
{
        struct task_struct *p = current;
        int cpu = smp_processor_id();

    //這裏判斷時鐘中斷髮生用戶空間,還是發生在內核模式,然後計數值加1
        /* Note: this timer irq context must be accounted for as well. */
        if (user_tick)
                account_user_time(p, jiffies_to_cputime(1));
        else
                account_system_time(p, HARDIRQ_OFFSET, jiffies_to_cputime(1));
        //激活時間軟中斷
        run_local_timers();
        if (rcu_pending(cpu))
                rcu_check_callbacks(cpu, user_tick);
        //減少時間片。
        scheduler_tick();
        run_posix_cpu_timers(p);
}

run_local_timers()
void run_local_timers(void)
{
     raise_softirq(TIMER_SOFTIRQ);
}
而該中斷的處理函數__run_timers():

static inline void __run_timers(tvec_base_t *base)
{
        struct timer_list *timer;

        spin_lock_irq(&base->lock);
        /*這裏進入定時器處理循環,利用系統全局jiffies與定時器基準jiffies進行對比,如果前者大,則表明某些定時器進行處理了,否則表示所有的定時器都沒有超時.因爲CPU可能關閉中斷,引起時鐘中斷信號丟失.可能jiffies要大base->timer_jiffies
        while (time_after_eq(jiffies, base->timer_jiffies)) {
            //定義並初始化一個鏈表
                struct list_head work_list;
                struct list_head *head = &work_list;
                int index = base->timer_jiffies & TVR_MASK;

                /*
                 * Cascade timers:
                 */

                 //當index == 0時,說明已經循環了一個週期
                 //則將tv2填充tv1.如果tv2爲空,則用tv3填充tv2.依次類推......
                if (!index &&
                        (!cascade(base, &base->tv2, INDEX(0))) &&
                                (!cascade(base, &base->tv3, INDEX(1))) &&
                                        !cascade(base, &base->tv4, INDEX(2)))
                        cascade(base, &base->tv5, INDEX(3));

                //更新base->timer_jiffies
                ++base->timer_jiffies;

                //將base->tv1.vec項移至work_list.並將base->tv1.vec置空
                list_replace_init(base->tv1.vec + index, &work_list);

                /*如果當前找到的時間數組對應的列表不爲空，則表明該列表上串連的所有定時器都已經超時，循環調用每個定時器的處理
                函數，並將其從列表中刪除，直到列表爲空爲止。*/
                while (!list_empty(head)) {
                        void (*fn)(unsigned long);
                        unsigned long data;
            //遍歷鏈表中的每一項.運行它所對應的函數,並將定時器從鏈表上脫落

                        timer = list_first_entry(head, struct timer_list,entry);
                        fn = timer->function;
                        data = timer->data;

                        timer_stats_account_timer(timer);

                        set_running_timer(base, timer);
                        detach_timer(timer, 1);
                        spin_unlock_irq(&base->lock);
                        {
                                int preempt_count = preempt_count();
                                fn(data);
                                if (preempt_count != preempt_count()) {
                                        printk(KERN_WARNING "huh, entered %p "
                                               "with preempt_count %08x, exited"
                                               " with %08x?\n",
                                               fn, preempt_count,
                                               preempt_count());
                                        BUG();
                                }
                        }
                        spin_lock_irq(&base->lock);
                }
        }
        set_running_timer(base, NULL);
        spin_unlock_irq(&base->lock);
}

硬件定時器計完一個jiffies之後，會引起硬件中斷，在硬件中斷服務程序中會觸發軟中斷，在定時器軟中斷服務程序中會調用 __run_timers（）完成定時器多級hash table的處理，並且處理定時時間到的所有timer。__run_timers算法實現描述如下：

　　1、根據當前jiffes和base->timer_jiffies循環判斷多級hash table掃描條件，如果滿足條件，那麼繼續（2），否則退出循環。

　　2、通過base->timer_jiffies計算得到V1 table中需要處理的索引項。並且將索引高層hash table中的具體項，將該項中的timer分散到低層table中去。

　　3、增加base->timer_jiffies值，提取出V1中索引得到的定時器鏈表。

　　4、如果該定時器鏈表不爲空，那麼依次處理鏈表中的定時器，處理過程爲調用定時器的處理函數timer->function。

下面對scheduler_tick()函數做解析吧.

void scheduler_tick(void)
{
        int cpu = smp_processor_id();
        struct rq *rq = cpu_rq(cpu);
        struct task_struct *curr = rq->curr;
        u64 next_tick = rq->tick_timestamp + TICK_NSEC;

        spin_lock(&rq->lock);
        __update_rq_clock(rq);
        /*
         * Let rq->clock advance by at least TICK_NSEC:
         */
        if (unlikely(rq->clock < next_tick))
                rq->clock = next_tick;
        rq->tick_timestamp = rq->clock;
        update_cpu_load(rq);
        if (curr != rq->idle) /* FIXME: needed? */
                curr->sched_class->task_tick(rq, curr);
        spin_unlock(&rq->lock);

#ifdef CONFIG_SMP
        rq->idle_at_tick = idle_cpu(cpu);
        trigger_load_balance(rq, cpu);
#endif
}

scheduler_tick()函數負責減少運行進程的時間片計數值並且在需要時設置need_resched標誌.該函數還負責平衡每個處理器的運行隊列.

我們在do_timer（）還漏掉了一個函數：
static inline void update_times(unsigned long ticks)
{
       //更新xtime
        update_wall_time();
        //統計TASK_RUNNING TASK_UNINTERRUPTIBLE進程數量
        calc_load(ticks);
}

軟件時鐘處理

實現軟件時鐘原理也比較簡單：每一次硬件時鐘中斷到達時，內核更新的 jiffies ，然後將其和軟件時鐘的到期時間進行比較。如果 jiffies 等於或者大於軟件時鐘的到期時間，內核就執行軟件時鐘指定的函數。

接下來的幾節會詳細介紹 Linux2.6.23 是怎麼實現軟件時鐘的。

struct timer_list ：軟件時鐘，記錄了軟件時鐘的到期時間以及到期後要執行的操作。
struct tvec_base ：用於組織、管理軟件時鐘的結構。在 SMP 系統中，每個 CPU 有一個。

struct timer_list 主要成員

域名	類型	描述
entry	struct list_head	所在的鏈表
expires	unsigned long	到期時間，以 tick 爲單位
function	void (*)(unsigned long)	回調函數，到期後執行的操作
data	unsigned long	回調函數的參數
base	struct tvec_base *	記錄該軟件時鐘所在的 struct tvec_base 變量

注：一個 tick 表示的時間長度爲兩次硬件時鐘中斷髮生時的時間間隔

struct tvec_base 類型的成員

域名	類型	描述
lock	spinlock_t	用於同步操作
running_timer	struct timer_list *	正在處理的軟件時鐘
timer_jiffies	unsigned long	當前正在處理的軟件時鐘到期時間
tv1	struct tvec_root	保存了到期時間從 timer_jiffies 到 timer_jiffies + 之間（包括邊緣值）的所有軟件時鐘
tv2	struct tvec	保存了到期時間從 timer_jiffies + 到 timer_jiffies +之間（包括邊緣值）的所有軟件時鐘
tv3	struct tvec	保存了到期時間從 timer_jiffies +到 timer_jiffies +之間（包括邊緣值）的所有軟件時鐘
tv4	struct tvec	保存了到期時間從 timer_jiffies +到 timer_jiffies +之間（包括邊緣值）的所有軟件時鐘
tv5	struct tvec	保存了到期時間從 timer_jiffies +到 timer_jiffies +之間（包括邊緣值）的所有軟件時鐘

其中 tv1 的類型爲 struct tvec_root ，tv 2~ tv 5的類型爲 struct tvec .

struct tvec_root 和 struct tvec 的定義

struct tvec {
	struct list_head vec[TVN_SIZE];
};

struct tvec_root {
	struct list_head vec[TVR_SIZE];
};

可見它們實際上就是類型爲 struct list_head 的數組，其中 TVN_SIZE 和 TVR_SIZE 在系統沒有配置宏 CONFIG_BASE_SMALL 時分別被定義爲64和256。

顯示了以上數據結構之間的關係：

從圖中可以清楚地看出：軟件時鐘（ struct timer_list ，在圖中由 timer 表示）以雙向鏈表（ struct list_head ）的形式，按照它們的到期時間保存相應的（ tv1~tv5 ）中。tv1 中保存了相對於 timer_jiffies 下256個 tick 時間內到期的所有軟件時鐘； tv2 中保存了相對於 timer_jiffies 下256*64個 tick 時間內到期的所有軟件時鐘； tv3 中保存了相對於 timer_jiffies 下256*64*64個 tick 時間內到期的所有軟件時鐘； tv4 中保存了相對於 timer_jiffies 下256*64*64*64個 tick 時間內到期的所有軟件時鐘； tv5 中保存了相對於 timer_jiffies 下256*64*64*64*64個 tick 時間內到期的所有軟件時鐘。具體的說，從靜態的角度看，假設 timer_jiffies 爲0，那麼 tv1[0] 保存着當前到期（到期時間等於 timer_jiffies ）的軟件時鐘（需要馬上被處理）， tv1[1] 保存着下一個 tick 到達時，到期的所有軟件時鐘， tv1[n] （0<= n <=255）保存着下 n 個 tick 到達時，到期的所有軟件時鐘。而 tv2[0] 則保存着下256到511個 tick 之間到期所有軟件時鐘， tv2[1] 保存着下512到767個 tick 之間到期的所有軟件時鐘， tv2[n] （0<= n <=63）保存着下256*(n+1)到256*(n+2)-1個 tick 之間到達的所有軟件時鐘。 tv3~tv5 依次類推。

注：本章主要摘自IBM資料，下章主要對軟時鐘中斷代碼分析

資料：http://www.ibm.com/developerworks/cn/linux/l-cn-clocks/index.html

TIMER_INITIALIZER（）：
1):TIMER_INITIALIZER（）用來聲明一個定時器，它的定義如下：
#define TIMER_INITIALIZER(_function, _expires, _data) {         \
                .function = (_function),                        \
                .expires = (_expires),                          \
                .data = (_data),                                \
                .base = &boot_tvec_bases,                       \
        }

注：在上章對裏面的數據結構已經解釋過了

2): mod_timer():修改定時器的到時時間

int mod_timer(struct timer_list *timer, unsigned long expires)
{
   //如果該定時器沒有定義fuction
        BUG_ON(!timer->function);

        timer_stats_timer_set_start_info(timer);
        /*
         * This is a common optimization triggered by the
         * networking code - if the timer is re-modified
         * to be the same thing then just return:
         */
         //如果要調整的時間就是定時器的定時時間而且已經被激活,則直接返回
        if (timer->expires == expires && timer_pending(timer))
                return 1;
   //調用__mod_timer().呆會再給出分析
        return __mod_timer(timer, expires);
}

從代碼可以看出，如果所給的要修改的時間等於定時器原來的時間並且定時器現在正處於活動狀態，則不修改，返回1，否則修改定時器時間，返回0。mod_timer()是一個非有效的更新處於活動狀態的定時器的時間的方法，如果定時器處於非活動狀態，則會激活定時器。在功能上，mod_timer()等價於：

3): add_timer()用來將定時器掛載到定時軟中斷隊列,激活該定時器
static inline void add_timer(struct timer_list *timer)
{
BUG_ON(timer_pending(timer));
__mod_timer(timer, timer->expires);
}

可以看到mod_timer與add_timer 最後都會調用__mod_timer().爲了分析這個函數。

下面轉入__mod_timer()的代碼了:

int __mod_timer(struct timer_list *timer, unsigned long expires)
{
        tvec_base_t *base, *new_base;
        unsigned long flags;
        int ret = 0;

        timer_stats_timer_set_start_info(timer);
        BUG_ON(!timer->function);

        base = lock_timer_base(timer, &flags);

        if (timer_pending(timer)) {
                detach_timer(timer, 0);
                ret = 1;
        }
   //取得當前CPU對應的tvec_bases
        new_base = __get_cpu_var(tvec_bases);

        if (base != new_base) {
                if (likely(base->running_timer != timer)) {
                        /* See the comment in lock_timer_base() */
                        timer_set_base(timer, NULL);
                        spin_unlock(&base->lock);
                        base = new_base;
                        spin_lock(&base->lock);
                        timer_set_base(timer, base);
                }
        }

        timer->expires = expires;
        internal_add_timer(base, timer);
        spin_unlock_irqrestore(&base->lock, flags);

        return ret;
}

代碼解析：

取得軟件時鐘所在 base 上的同步鎖（ struct tvec_base 變量中的自旋鎖），並返回該軟件時鐘的 base ，保存在 base 變量中
如果該軟件時鐘處在 pending 狀態（在 base 中，準備執行），則調用detach_timer()函數將該定時器從它原來所屬的鏈表中刪除。
取得本 CPU 上的 base 指針（類型爲 struct tvec_base* ），保存在 new_base 中
如果 base 和 new_base 不一樣，也就是說軟件時鐘發生了遷移（從一個 CPU 中移到了另一個 CPU 上），那麼如果該軟件時鐘的處理函數當前沒有在遷移之前的那個 CPU 上運行，則先將軟件時鐘的 base 設置爲 NULL ，然後再將該軟件時鐘的 base 設置爲 new_base 。否則，跳到5。
設置軟件時鐘的到期時間。
調用 internal_add_timer 函數將軟件時鐘添加到軟件時鐘的 base 中（本 CPU 的 base ）

4)internal_add_timer()的代碼如下:

static void internal_add_timer(tvec_base_t *base, struct timer_list *timer)
{
        unsigned long expires = timer->expires;
        unsigned long idx = expires - base->timer_jiffies;
        struct list_head *vec;

        if (idx < TVR_SIZE) {
                int i = expires & TVR_MASK;
                vec = base->tv1.vec + i;
        } else if (idx < 1 << (TVR_BITS + TVN_BITS)) {
                int i = (expires >> TVR_BITS) & TVN_MASK;
                vec = base->tv2.vec + i;
        } else if (idx < 1 << (TVR_BITS + 2 * TVN_BITS)) {
                int i = (expires >> (TVR_BITS + TVN_BITS)) & TVN_MASK;
                vec = base->tv3.vec + i;
        } else if (idx < 1 << (TVR_BITS + 3 * TVN_BITS)) {
                int i = (expires >> (TVR_BITS + 2 * TVN_BITS)) & TVN_MASK;
                vec = base->tv4.vec + i;
        } else if ((signed long) idx < 0) {
                /*
                 * Can happen if you add a timer with expires == jiffies,
                 * or you set a timer to go off in the past
                 */
                vec = base->tv1.vec + (base->timer_jiffies & TVR_MASK);
        } else {
                int i;
                /* If the timeout is larger than 0xffffffff on 64-bit
                 * architectures then we use the maximum timeout:
                 */
                if (idx > 0xffffffffUL) {
                        idx = 0xffffffffUL;
                        expires = idx + base->timer_jiffies;
                }
                i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
                vec = base->tv5.vec + i;
        }
        /*
         * Timers are FIFO:
         */
        list_add_tail(&timer->entry, vec);
}

* 計算該軟件時鐘的到期時間和 timer_jiffies （當前正在處理的軟件時鐘的到期時間）的差值，作爲索引保存到 idx 變量中。
* 判斷 idx 所在的區間，在
          o [0, 對象12]或者( 對象13, 0)（該軟件時鐘已經到期），則將要添加到 tv1 中
          o [對象14, 對象15]，則將要添加到 tv2 中
          o [對象16, 對象17]，則將要添加到 tv3 中
          o [對象18, 對象19]，則將要添加到 tv4 中
          o [對象20, 對象21)，則將要添加到 tv5 中，但實際上最大值爲 0xffffffffUL
* 計算所要加入的具體位置（哪個鏈表中，即 tv1~tv5 的哪個子鏈表）
* 最後將其添加到相應的鏈表中

從這個函數可以得知，內核中是按照軟件時鐘到期時間的相對值（相對於 timer_jiffies 的值）將軟件時鐘添加到軟件時鐘所在的 base 中的。

5):定時器更新
每過一個HZ,就會檢查當前是否有定時器的定時器時間到達.如果有,運行它所註冊的函數,再將其刪除.爲了分析這一過程,我們先從定時器系統的初始化看起.
asmlinkage void __init start_kernel(void)
{
         ……
         init_timers();
         ……
}
Init_timers()的定義如下:
void __init init_timers(void)
{
        int err = timer_cpu_notify(&timers_nb, (unsigned long)CPU_UP_PREPARE,
                                (void *)(long)smp_processor_id());

        init_timer_stats();

        BUG_ON(err == NOTIFY_BAD);
        register_cpu_notifier(&timers_nb);
        //註冊TIMER_SOFTIRQ軟中斷
        open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
}

我們在前面分析過,每當時鐘當斷函數到來的時候,就會打開定時器的軟中斷.運行其軟中斷函數.run_timer_softirq()
代碼如下:
static void run_timer_softirq(struct softirq_action *h)
{
    ////取得當於CPU的tvec_base_t結構
        tvec_base_t *base = __get_cpu_var(tvec_bases);

        hrtimer_run_queues();
    //如果jiffies > base->timer_jiffies
        if (time_after_eq(jiffies, base->timer_jiffies))
                __run_timers(base);
}

__run_timers()代碼如下:

static inline void __run_timers(struct tvec_base *base)
{
    ……
    spin_lock_irq(&base->lock);
    while (time_after_eq(jiffies, base->timer_jiffies)) {
        ……
        int index = base->timer_jiffies & TVR_MASK;
        if (!index &&
              (!cascade(base, &base->tv2, INDEX(0))) &&
                 (!cascade(base, &base->tv3, INDEX(1))) &&
                    !cascade(base, &base->tv4, INDEX(2)))
                       cascade(base, &base->tv5, INDEX(3));
        ++base->timer_jiffies;
        list_replace_init(base->tv1.vec + index, &work_list);
        while (!list_empty(head)) {
            ……
            timer = list_first_entry(head, struct timer_list,entry);
            fn = timer->function;
            data = timer->data;
            ……
            set_running_timer(base, timer);
            detach_timer(timer, 1);
            spin_unlock_irq(&base->lock);
            {
                int preempt_count = preempt_count();
                fn(data);
                ……
            }
            spin_lock_irq(&base->lock);
        }
    }
    set_running_timer(base, NULL);
    spin_unlock_irq(&base->lock);
}

代碼解析：
獲得 base 的同步鎖
如果 jiffies 大於等於 timer_jiffies （當前正要處理的軟件時鐘的到期時間，說明可能有軟件時鐘到期了），就一直運行3~7，否則跳轉至8
計算得到 tv1 的索引，該索引指明當前到期的軟件時鐘所在 tv1 中的鏈表，代碼：

int index = base->timer_jiffies & TVR_MASK;

調用 cascade 函數對軟件時鐘進行必要的調整（稍後會介紹調整的過程）
使得 timer_jiffies 的數值增加1
取出相應的軟件時鐘鏈表
遍歷該鏈表，對每個元素進行如下操作

    * 設置當前軟件時鐘爲 base 中正在運行的軟件時鐘（即保存當前軟件時鐘到 base-> running_timer 成員中）
    * 將當前軟件時鐘從鏈表中刪除，即卸載該軟件時鐘
    * 釋放鎖，執行軟件時鐘處理程序
    * 再次獲得鎖

設置當前 base 中不存在正在運行的軟件時鐘
釋放鎖

static int cascade(struct tvec_base *base, struct tvec *tv, int index)
{
    struct timer_list *timer, *tmp;
    struct list_head tv_list;
    list_replace_init(tv->vec + index, &tv_list);
    list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
        ……
        internal_add_timer(base, timer);
    }
    return index;
}
該函數根據索引，取出相應的 tv （ tv2~tv5 ）中的鏈表，然後遍歷鏈表每一個元素。按照其到期時間重新將軟件時鐘加入到軟件時鐘的 base 中。該函數返回 tv 中被調整的鏈表索引值

6):del_timer()刪除定時器
//刪除一個timer
int del_timer(struct timer_list *timer)
{
        tvec_base_t *base;
        unsigned long flags;
        int ret = 0;

        timer_stats_timer_clear_start_info(timer);
        if (timer_pending(timer)) {
                base = lock_timer_base(timer, &flags);
                if (timer_pending(timer))
                {
                    //將timer從鏈表中刪除
                        detach_timer(timer, 1);
                        ret = 1;
                }
                spin_unlock_irqrestore(&base->lock, flags);
        }

        return ret;
}
被激活或未被激活的定時器都可以用該函數，如果定時器未被激活，該函數返回0；否則返回1。
注意，不需爲已經超時定時器調用該函數，因爲會被自動調用。刪除定時器時需要等待其他cpu運行該定時器處理器程序都退出。

7): del_timer_sync()有競爭情況下的定時器刪除
int del_timer_sync(struct timer_list *timer)
{
    for (;;) {
        int ret = try_to_del_timer_sync(timer);
        if (ret >= 0)
            return ret;
        cpu_relax();
    }
}

del_timer_sync 函數無限循環試圖卸載該軟件時鐘，直到該軟件時鐘能夠被成功卸載。從其實現中可以看出：如果一個軟件時鐘的處理函數正在執行時，對其的卸載操作將會失敗。一直等到軟件時鐘的處理函數運行結束後，卸載操作纔會成功。這樣避免了在 SMP 系統中一個 CPU 正在執行軟件時鐘的處理函數，而另一個 CPU 則要將該軟件時鐘卸載所引發的問題。
定時器部份到這裏就介紹完了.爲了管理定時器.內核用了一個很巧妙的數據結構.值得好好的體會.

注：下章，介紹定時器的延遲。

定時器延遲

schedule_timeout()函數，會使需要延遲執行的任務睡眠到指定的延遲時間耗盡後再重新運行。

用法如下：
//將任務設置爲可中斷睡眠狀態
set_current_state(TASK_INTERRUPTIBLE);

//小睡，“S”秒後醒
schedule_timeout(s*HZ)

fastcall signed long __sched schedule_timeout(signed long timeout)
{
        struct timer_list timer;
        unsigned long expire;

        switch (timeout)
        {
        case MAX_SCHEDULE_TIMEOUT:
                /*
                 * These two special cases are useful to be comfortable
                 * in the caller. Nothing more. We could take
                 * MAX_SCHEDULE_TIMEOUT from one of the negative value
                 * but I' d like to return a valid offset (>=0) to allow
                 * the caller to do everything it want with the retval.
                 */
                schedule();
                goto out;
        default:
                /*
                 * Another bit of PARANOID. Note that the retval will be
                 * 0 since no piece of kernel is supposed to do a check
                 * for a negative retval of schedule_timeout() (since it
                 * should never happens anyway). You just have the printk()
                 * that will tell you if something is gone wrong and where.
                 */
                if (timeout < 0) {
                        printk(KERN_ERR "schedule_timeout: wrong timeout "
                                "value %lx\n", timeout);
                        dump_stack();
                        current->state = TASK_RUNNING;
                        goto out;
                }
        }

        expire = timeout + jiffies;

        setup_timer(&timer, process_timeout, (unsigned long)current);
        __mod_timer(&timer, expire);
        schedule();
        del_singleshot_timer_sync(&timer);

        timeout = expire - jiffies;

out:
        return timeout < 0 ? 0 : timeout;
}

函數 schedule_timeout 定義了一個軟件時鐘變量 timer ，在計算到期時間後初始化這個軟件時鐘：設置軟件時鐘當時間到期時的處理函數爲 process_timeout ，參數爲當前進程描述符，設置軟件時鐘的到期時間爲 expire 。之後調用 schedule() 函數。此時當前進程睡眠，交出執行權，內核調用其它進程運行。但內核在每一個時鐘中斷處理結束後都要檢測這個軟件時鐘是否到期。如果到期，將調用 process_timeout 函數，參數爲睡眠的那個進程描述符。
其中MAX_SCHEDULE_TIMEOUT是用來檢查任務是否無限期睡眠，如果是，函數不會爲它設置定時器，這時調度會立即執行。

setup_timer()函數對定時器的初始化

static inline void setup_timer(struct timer_list * timer,

void (*function)(unsigned long),

unsigned long data)

{

timer->function = function;

timer->data = data;

init_timer(timer);

}
其中對base字段的賦值是調用了init_timer()函數。

process_timeout()函數
static void process_timeout(unsigned long __data)
{
wake_up_process((struct task_struct *)__data);
}

函數 process_timeout 直接調用 wake_up_process 將進程喚醒。當內核重新調用該進程執行時，該進程繼續執行 schedule_timeout 函數，執行流則從 schedule 函數中返回，之後調用 del_singleshot_timer_sync 函數將軟件時鐘卸載，然後函數 schedule_timeout 結束。

注：下章，做個總結。畢竟看了那麼多，得總結下。看來我有個好習慣也，呵呵。

終於找點時間來總結時鐘中斷，呵呵，愛上linux了。

時間管理在內核佔有非常重要的地位。相對於事件驅動而言，內核中大量的函數都是基於時間驅動的。
其中有些函數是週期執行的，像調度程序中的運行隊列進行平衡調整。

時鐘是整個操作系統的脈搏,它爲進程的時間片調度,定時事件提供了依據.另外,用戶空間的很多操作都依賴於時鐘,例如select.poll,make. 操作系統管理的時間爲分兩種,一種稱爲當前時間,也即我們日常生活所用的時間.這個時間一般保存在CMOS中.主板中有特定的芯片爲其提供計時依據.另外一種時間稱爲相對時間.例如系統運行時間.顯然對計算機而然,相對時間比當前時間更爲重要.

內核通過控制時鐘中斷維護時間，且時鐘中斷對於管理操作系統相當的重要，大量內核函數的生命週期
都離不開流逝的時間的控制。時鐘中斷的作用：

1.更新系統運行時間
2.更新實際時間
3.在SMP系統中，均衡調度程序中各處理器上的運行隊列。如果運行隊列負載不均衡的話，儘量使
它們均衡
4.檢查當前進程是否耗盡自己的時間片。如果用盡，就重新進行調度
5.運行超時的動態定時器
6.更新資源消耗和處理器的時間統計值

時鐘中斷處理程序具體執行的工作：

1.獲得xtime_lock鎖，以便對訪問jiffies_64和牆上時間進行更新<xtime_lock防止多cpu的競態訪問>
2.需要時應答或重新設置系統時鐘
3.調用do_timer()，執行以下工作：
          1>給jiffies_64變量增加1；
          2>更新資源消耗統計值，比如當前進程所消耗的系統時間和用戶時間
          3>執行已經到期的動態定時器
          4>執行scheduler_tick()，函數減少當前運行進程的時間片計數值並且
            設置need_resched標誌
          5>更新實際時間，存放在xtime變量中，計算平均負載值。

中間的調度過程和代碼解釋前面已經有詳細解釋了。關於時鐘中斷就分析到這了，以後我想到那點
重要我會加的，謝謝。

我是西郵人，9.20自由軟件日在我們學校舉行，很激動，記得去年加入linux興趣小組
還是懵懂的很，只是在windows虛擬機裝過.現在windows沒了，只剩下我可愛的ubuntu了，
一年時間了，自己多少還是學到點東東，不過還有很多盲點，硬件，網絡的知識感覺好陌生了，
得加油了。
期待自由軟件日活動的舉行。

Dreams start here..

linux 2.6.23時鐘中斷與調度分析

調度進程 -- schedule()

HID連接過程-失敗的例子

高精度定時器在mips上的框架結構

從MACHINE_START開始

create_mapping如何創建內存映射表

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結