調度進程 -- schedule()

  • 調用schedule()的時機
Direct invocation(直接調用)

    The scheduler is invoked directly when the current process must be blocked right away because the resource it needs is not available. In this case, the kernel routine that wants to block it proceeds as follows: 
1. Inserts current in the proper wait queue.
2. Changes the state of current either to TASK_INTERRUPTIBLE or to TASK_UNINTERRUPTIBLE.
3. Invokes schedule( ).
4. Checks whether the resource is available; if not, goes to step 2.
5. Once the resource is available, removes current from the wait queue.

Lazy invocation(延遲調用)
    The scheduler can also be invoked in a lazy way by setting the TIF_NEED_RESCHED flag of current to 1. Because a check on the value of this flag is always made before resuming the execution of a User Mode process, schedule( ) will definitely be invoked at some time in the near future. 
Typical examples of lazy invocation of the scheduler are:
  • When current has used up its quantum of CPU time; this is done by the scheduler_tick( ) function.
  • When a process is woken up and its priority is higher than that of the current process; this task is performed by the try_to_wake_up( ) function.
  • When a sched_setscheduler( ) system call is issued.




  • schedule()概述
    The act of picking the next task to run and switching to it is implemented via the schedule() function.選定下一個進程並切換到它去執行是通過schedule()函數實現的。
    The schedule() function is relatively simple for all it must accomplish. The following code determines the highest priority task:

struct task_struct *prev, *next;
struct list_head *queue;
struct prio_array *array;
int idx;

prev 
= current;
array 
= rq->active;
idx 
= sched_find_first_bit(array->bitmap);
queue 
= array->queue + idx;
 next 
= list_entry(queue->next, struct task_struct, run_list);

    First, the active priority array is searched to find the first set bit. This bit corresponds to the highest priority task that is runnable. Next, the scheduler selects the first task in the list at that priority. This is the highest priority runnable task on the system and is the task the scheduler will run.
    首先,要在活動優先級數組中找到第一個被設置的位。該位對應着優先級最高的可執行進程。然後,調度程序選擇這個級別鏈表裏的頭一個進程。這就是系統中優先級最高的可執行進程,也是馬上會被調度執行的進程。

    If prev does not equal next, then a new task has been selected to run. The function context_switch() is called to switch from prev to next. 
    如果prev和next不等,說明被選中的進程不是當前進程。此時函數context_switch()被調用,負責從prev切換到next.


  • schedule()實現細節

    The goal of the schedule( ) function consists of replacing the currently executing process with another one. Thus, the key outcome of the function is to set a local variable called next, so that it points to the descriptor of the process selected to replace current. If no runnable process in the system has priority greater than the priority of current, at the end, next coincides with current and no process switch takes place.
    schedule()函數的目的在於用另一個進程替換當前正在運行的進程。因此,這個函數的主要結果就是設置一個名爲next的變量,以便它指向所選中的代替current的進程的描述符。如果在系統中沒有可運行進程的優先級大於current的優先級,那麼,結果是next與current一致,沒有進程切換髮生。

asmlinkage void __sched schedule(void)
{
    long *switch_count;
    task_t *prev, *next;
    runqueue_t *rq;
    prio_array_t *array;
    struct list_head *queue;
    unsigned long long now;
    unsigned long run_time;
    int cpu, idx;


    if (likely(!current->exit_state)) {
        if (unlikely(in_atomic())) {
            printk(KERN_ERR "scheduling while atomic: "
                "%s/0x%08x/%d ",
                current->comm, preempt_count(), current->pid);
            dump_stack();
        }
    }
    profile_hit(SCHED_PROFILING, __builtin_return_address(0));

  • Actions performed by schedule( ) before a process switch
關閉內核搶佔功能;初始化參數prev、rq
The schedule( ) function starts by disabling kernel preemption and initializing a few local variables:
|---------------------------------|
|need_resched:                    |
|    preempt_disable();           |
|    prev = current;              |
|    release_kernel_lock(prev);   |
|need_resched_nonpreemptible:     |
|    rq = this_rq();              |
|---------------------------------|


    if (unlikely(prev == rq->idle) && prev->state != TASK_RUNNING) {
        printk(KERN_ERR "bad: scheduling from the idle thread! ");
        dump_stack();
    }

    schedstat_inc(rq, sched_cnt);

計算進程prev本次運行時間(run_time)
通常連續運行時間(run_time)限制在1秒內(要轉換成納秒)
    The sched_clock( ) function is invoked to read the TSC and convert its value to nanoseconds; the timestamp obtained is saved in the now local variable. Then, schedule( ) computes the duration of the CPU time slice used by prev:
    now = sched_lock();
    run_time = now - prev->timestamp;
    if (run_time > 1000000000)
        run_time = 1000000000;
|-----------------------------------------------------------------------|
|   now = sched_clock();                                                |
|   if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {|
|       run_time = now - prev->timestamp;                               |
|       if (unlikely((long long)(now - prev->timestamp) < 0))           |
|           run_time = 0;                                               |
|   } else                                                              |
|       run_time = NS_MAX_SLEEP_AVG;                                    |
|-----------------------------------------------------------------------|

根據原平均睡眠時間(CURRENT_BONUS)“倍減”本次連續運行時間:
本來進程prev的平均睡眠時間應該更新爲:
    原平均睡眠時間 - 本次連續運行時間;
不過,schedule()爲了獎勵原平均睡眠時間較長的進程--CURRENT_BONUS(prev)值較大;經過下面運算將會減小run_time,從而降低了本次連續運行時間對新的平均睡眠時間的影響
|--------------------------------------------|
|   run_time /= (CURRENT_BONUS(prev) ? : 1); |
|--------------------------------------------|

關閉本地中斷;使用自旋鎖保護runqueue
Before starting to look at the runnable processes, schedule( ) must disable the local interrupts and acquire the spin lock that protects the runqueue:
|------------------------------------|
|   spin_lock_irq(&rq->lock);        |
|------------------------------------|

爲了識別當前進程是否已經終止,schedule檢查PF_DEAD標誌
|----------------------------------------|
|   if (unlikely(prev->flags & PF_DEAD)) |
|       prev->state = EXIT_DEAD;         |
|----------------------------------------| 


    switch_count = &prev->nivcsw;

如果進程prev因爲等待某事件的發生而調用schedule()放棄CPU控制權,則schedule()將根據該進程的具體狀態(TASK_INTERRUPTIBLE還是TASK_UNINTERRUPTIBLE)來決定它是繼續留在活躍隊列;還是從活躍隊列中刪除
    如果進程prev處於不可運行狀態;並且該進程在內核態沒有被搶佔;則應該從可執行隊列(runqueue)中刪除。然而如果該進程有不可阻塞的信號並且其狀態爲TASK_INTERRUPTIBLE則該進程將會被置爲TASK_RUNNING並繼續留在runqueue中。這個操作與把處理器分配給prev是不同的,它只是給prev一次被選中執行的機會。
    schedule( ) examines the state of prev. If it is not runnable and it has not been preempted in Kernel Mode,then it should be removed from the runqueue. However, if it has nonblocked pending signals and its state is TASK_INTERRUPTIBLE, the function sets the process state to TASK_RUNNING and leaves it into the runqueue. This action is not the same as assigning the processor to prev; it just gives prev a chance to be selected for execution:
    如果進程prev處於不可運行狀態;並且該進程在內核態沒有被搶佔;則說明該進程在調用schedule()之前,由於等待某事件的發生而進入等待隊列--處於睡眠狀態。如果其狀態爲TASK_INTERRUPTIBLE並且收到了信號(並不處理信號),則該進程再次回到TASK_RUNNING狀態(被調度後將會去處理信號)
|---------------------------------------------------------------|
|   if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {   |
|       switch_count = &prev->nvcsw;                            |
|       if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&      |
|               unlikely(signal_pending(prev))))                |
|           prev->state = TASK_RUNNING;                         |
|       else {                                               |
|           if (prev->state == TASK_UNINTERRUPTIBLE)            |
|               rq->nr_uninterruptible++;                       |
|           deactivate_task(prev, rq);                          |
|       }                                                       |
|   }                                                           |
|---------------------------------------------------------------|


    cpu = smp_processor_id();
  • Actions performed by schedule( ) to make the process switch
檢測可執行隊列(runqueue)中可運行進程數,並根據所剩進程數進行負載均衡運算:
schedule( ) checks the number of runnable processes left in the runqueue.
    If no runnable process exists, the function invokes idle_balance( ) to move some runnable process from another runqueue to the local runqueue; idle_balance( ) is similar to load_balance( )
    如果運行隊列中沒有可運行的進程存在,schedule()就調用idle_balance(),從另外一個運行隊列遷移一些可運行進程到本地運行隊列中, idle_balance( )與load_balance( )類似
    如果idle_balance( )沒有成功地把進程遷移到本地運行隊列中,schedule( )就調用wake_sleeping_dependent( )重新調度空閒CPU(即每個運行swapper進程的CPU)中的可運行進程。就象前面討論 dependent_sleeper( ) 函數時所說明的,通常在內核支持超線程技術的時候可能會出現這種情況。然而,在單處理機系統中,或者當把進程遷移到本地運行隊列的種種努力都失敗的情況下,函數就選擇swapper進程作爲next進程並繼續進行下一步驟。
    If there are some runnable processes, the function invokes the dependent_sleeper( ) function. In most cases, this function immediately returns zero.
|-------------------------------------------------|
|   if (unlikely(!rq->nr_running)) {              |
|go_idle:                                         |
|       idle_balance(cpu, rq);                    |
|       if (!rq->nr_running) {                    |
|           next = rq->idle;                      |
|           rq->expired_timestamp = 0;            |
|           wake_sleeping_dependent(cpu, rq);     |
|           if (!rq->nr_running)                  |
|               goto switch_tasks;                |
|       }                                         |
|-------------------------------------------------|
|   } else {                                      |
|       if (dependent_sleeper(cpu, rq)) {         |
|           next = rq->idle;                      |
|           goto switch_tasks;                    |
|       }                                         |
|       if (unlikely(!rq->nr_running))            |
|           goto go_idle;                         |
|   }                                             |
|-------------------------------------------------|

如果可運行隊列的活躍隊列中(runqueue.active)已經沒有活躍進程;則交換活躍隊列(active)和過期隊列(expired)
Let's suppose that the schedule( ) function has determined that the runqueue includes some runnable processes; now it has to check that at least one of these runnable processes is active. If not, the function exchanges the contents of the active and expired fields of the runqueue data structure; thus, all expired processes become active, while the empty set is ready to receive the processes that will expire in the future.
|------------------------------------------|
|   array = rq->active;                    | 
|   if (unlikely(!array->nr_active)) {     |
|       schedstat_inc(rq, sched_switch);   |
|       rq->active = rq->expired;          |
|       rq->expired = array;               |
|       array = rq->active;                |
|       rq->expired_timestamp = 0;         |
|       rq->best_expired_prio = MAX_PRIO;  |
|   }                                      |
|------------------------------------------|

從優先級數組中選取優先級最高的進程next:
It is time to look up a runnable process in the active prio_array_t data structure.First of all, schedule( ) searches for the first nonzero bit in the bitmask of the active set. Remember that a bit in the bitmask is set when the corresponding priority list is not empty. Thus, the index of the first nonzero bit indicates the list containing the best process to run. Then, the first process descriptor in that list is retrieved:
|-----------------------------------------------------|
|   idx = sched_find_first_bit(array->bitmap);        |
|   queue = array->queue + idx;                       |
|   next = list_entry(queue->next, task_t, run_list); |
|-----------------------------------------------------|

計算進程next的平均睡眠時間:
如果進程next是普通用戶進程,並且該進程是從TASK_INTERRUPTIBLE或者TASK_STOPPED被喚醒的,scheduler將要爲該進程增加平均睡眠時間sleep_avg(此時計算平均睡眠時間不能簡單增加喚醒前的睡眠時間)
If next is a conventional process and it is being awakened from the TASK_INTERRUPTIBLE or TASK_STOPPED state, the scheduler adds to the average sleep time of the process the nanoseconds elapsed since the process was inserted into the runqueue. In other words, the sleep time of the process is increased to cover also the time spent by the process in the runqueue waiting for the CPU:
|-------------------------------------------------------------------|
|   if (!rt_task(next) && next->activated > 0) {                    |
|       unsigned long long delta = now - next->timestamp;           |
|       if (unlikely((long long)(now - next->timestamp) < 0))       |
|           delta = 0;                                              |
|       if (next->activated == 1)                                   |
|           delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; |
|       array = next->array;                                        |
|       dequeue_task(next, array);                                  |
|       recalc_task_prio(next, next->timestamp + delta);            |
|       enqueue_task(next, array);                                  |
|   }                                                               |
|   next->activated = 0;                                            |
|-------------------------------------------------------------------|


  • Actions performed by schedule( ) to make the process switch
switch_tasks:                              
    if (next == rq->idle)
       schedstat_inc(rq, sched_goidle);

獲取進程next的thread_info域
現在schedule()函數已經確定將要運行的進程next。內核將訪問進程next的thread_info域--該域存放在進程next描述符的頂部(task_struct.thread_info):
Now the schedule( ) function has determined the next process to run.In a moment, the kernel will access the thread_info data structure of next, whose address is stored close to the top of next's process descriptor:

|-----------------------|
|   prefetch(next);     |
|-----------------------|

在替換prev進程前,調度程序需要進行對prev做一些處理:
清除標誌位TIF_NEED_RESCHED
Before replacing prev, the scheduler should do some administrative work:
The clear_tsk_need_resched( ) function clears the TIF_NEED_RESCHED flag of prev, just in case schedule( ) has been invoked in the lazy way. Then, the function records that the CPU is going through a quiescent state
|-----------------------------------|
|   clear_tsk_need_resched(prev);   |
|   rcu_qsctr_inc(task_cpu(prev));  |
|-----------------------------------|

    update_cpu_clock(prev, rq, now);
計算進程prev的平均睡眠時間sleep_avg
計算進程prev的平均睡眠時間sleep_avg(進程上下文切換前進程prev運行了run_time長的時間,因此該進程的sleep_avg應該減少run_time);更新該進程進入睡眠狀態的時間戳
The schedule( ) function must also decrease the average sleep time of prev, charging to it the slice of CPU time used by the process:
|-------------------------------------------|
|   prev->sleep_avg -= run_time;            |
|   if ((long)prev->sleep_avg <= 0)         |
|       prev->sleep_avg = 0;                |
|   prev->timestamp = prev->last_ran = now; |
|-------------------------------------------|


    
    sched_info_switch(prev, next);

執行進程上下文切換動作:
At this point, prev and next are different processes, and the process switch is for real:

|-----------------------------------------------|
|   if (likely(prev != next)) {                 |
|       next->timestamp = now;                  |
|       rq->nr_switches++;                      |
|       rq->curr = next;                        |
|       ++*switch_count;                        |
|       prepare_arch_switch(rq, next);          |
|       prev = context_switch(rq, prev, next);  |
|-----------------------------------------------|

  • Actions performed by schedule( ) after a process switch
|---------------------------------|
|       barrier();                |
|       finish_task_switch(prev); |
|---------------------------------|

如果prev和next是同一個進程:
It is quite possible that prev and next are the same process: this happens if no other higher or equal priority active process is present in the runqueue. In this case, the function skips the process switch:
|-----------------------------------|
|   } else                          |
|       spin_unlock_irq(&rq->lock); |
|-----------------------------------|


    prev = current;
    if (unlikely(reacquire_kernel_lock(prev) < 0))
        goto need_resched_nonpreemptible;
    preempt_enable_no_resched();
    if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
        goto need_resched;
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章