調度進程 -- schedule()

  • 調用schedule()的時機
Direct invocation(直接調用)

    The scheduler is invoked directly when the current process must be blocked right away because the resource it needs is not available. In this case, the kernel routine that wants to block it proceeds as follows: 
1. Inserts current in the proper wait queue.
2. Changes the state of current either to TASK_INTERRUPTIBLE or to TASK_UNINTERRUPTIBLE.
3. Invokes schedule( ).
4. Checks whether the resource is available; if not, goes to step 2.
5. Once the resource is available, removes current from the wait queue.

Lazy invocation(延遲調用)
    The scheduler can also be invoked in a lazy way by setting the TIF_NEED_RESCHED flag of current to 1. Because a check on the value of this flag is always made before resuming the execution of a User Mode process, schedule( ) will definitely be invoked at some time in the near future. 
Typical examples of lazy invocation of the scheduler are:
  • When current has used up its quantum of CPU time; this is done by the scheduler_tick( ) function.
  • When a process is woken up and its priority is higher than that of the current process; this task is performed by the try_to_wake_up( ) function.
  • When a sched_setscheduler( ) system call is issued.

  • schedule()概述
    The act of picking the next task to run and switching to it is implemented via the schedule() function.選定下一個進程並切換到它去執行是通過schedule()函數實現的。
    The schedule() function is relatively simple for all it must accomplish. The following code determines the highest priority task:

struct task_struct *prev, *next;
struct list_head *queue;
struct prio_array *array;
int idx;

= current;
= rq->active;
= sched_find_first_bit(array->bitmap);
= array->queue + idx;
= list_entry(queue->next, struct task_struct, run_list);

    First, the active priority array is searched to find the first set bit. This bit corresponds to the highest priority task that is runnable. Next, the scheduler selects the first task in the list at that priority. This is the highest priority runnable task on the system and is the task the scheduler will run.

    If prev does not equal next, then a new task has been selected to run. The function context_switch() is called to switch from prev to next. 

  • schedule()實現細節

    The goal of the schedule( ) function consists of replacing the currently executing process with another one. Thus, the key outcome of the function is to set a local variable called next, so that it points to the descriptor of the process selected to replace current. If no runnable process in the system has priority greater than the priority of current, at the end, next coincides with current and no process switch takes place.

asmlinkage void __sched schedule(void)
    long *switch_count;
    task_t *prev, *next;
    runqueue_t *rq;
    prio_array_t *array;
    struct list_head *queue;
    unsigned long long now;
    unsigned long run_time;
    int cpu, idx;

    if (likely(!current->exit_state)) {
        if (unlikely(in_atomic())) {
            printk(KERN_ERR "scheduling while atomic: "
                "%s/0x%08x/%d ",
                current->comm, preempt_count(), current->pid);
    profile_hit(SCHED_PROFILING, __builtin_return_address(0));

  • Actions performed by schedule( ) before a process switch
The schedule( ) function starts by disabling kernel preemption and initializing a few local variables:
|need_resched:                    |
|    preempt_disable();           |
|    prev = current;              |
|    release_kernel_lock(prev);   |
|need_resched_nonpreemptible:     |
|    rq = this_rq();              |

    if (unlikely(prev == rq->idle) && prev->state != TASK_RUNNING) {
        printk(KERN_ERR "bad: scheduling from the idle thread! ");

    schedstat_inc(rq, sched_cnt);

    The sched_clock( ) function is invoked to read the TSC and convert its value to nanoseconds; the timestamp obtained is saved in the now local variable. Then, schedule( ) computes the duration of the CPU time slice used by prev:
    now = sched_lock();
    run_time = now - prev->timestamp;
    if (run_time > 1000000000)
        run_time = 1000000000;
|   now = sched_clock();                                                |
|   if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {|
|       run_time = now - prev->timestamp;                               |
|       if (unlikely((long long)(now - prev->timestamp) < 0))           |
|           run_time = 0;                                               |
|   } else                                                              |
|       run_time = NS_MAX_SLEEP_AVG;                                    |

    原平均睡眠時間 - 本次連續運行時間;
|   run_time /= (CURRENT_BONUS(prev) ? : 1); |

Before starting to look at the runnable processes, schedule( ) must disable the local interrupts and acquire the spin lock that protects the runqueue:
|   spin_lock_irq(&rq->lock);        |

|   if (unlikely(prev->flags & PF_DEAD)) |
|       prev->state = EXIT_DEAD;         |

    switch_count = &prev->nivcsw;

    schedule( ) examines the state of prev. If it is not runnable and it has not been preempted in Kernel Mode,then it should be removed from the runqueue. However, if it has nonblocked pending signals and its state is TASK_INTERRUPTIBLE, the function sets the process state to TASK_RUNNING and leaves it into the runqueue. This action is not the same as assigning the processor to prev; it just gives prev a chance to be selected for execution:
|   if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {   |
|       switch_count = &prev->nvcsw;                            |
|       if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&      |
|               unlikely(signal_pending(prev))))                |
|           prev->state = TASK_RUNNING;                         |
|       else {                                               |
|           if (prev->state == TASK_UNINTERRUPTIBLE)            |
|               rq->nr_uninterruptible++;                       |
|           deactivate_task(prev, rq);                          |
|       }                                                       |
|   }                                                           |

    cpu = smp_processor_id();
  • Actions performed by schedule( ) to make the process switch
schedule( ) checks the number of runnable processes left in the runqueue.
    If no runnable process exists, the function invokes idle_balance( ) to move some runnable process from another runqueue to the local runqueue; idle_balance( ) is similar to load_balance( )
    如果運行隊列中沒有可運行的進程存在,schedule()就調用idle_balance(),從另外一個運行隊列遷移一些可運行進程到本地運行隊列中, idle_balance( )與load_balance( )類似
    如果idle_balance( )沒有成功地把進程遷移到本地運行隊列中,schedule( )就調用wake_sleeping_dependent( )重新調度空閒CPU(即每個運行swapper進程的CPU)中的可運行進程。就象前面討論 dependent_sleeper( ) 函數時所說明的,通常在內核支持超線程技術的時候可能會出現這種情況。然而,在單處理機系統中,或者當把進程遷移到本地運行隊列的種種努力都失敗的情況下,函數就選擇swapper進程作爲next進程並繼續進行下一步驟。
    If there are some runnable processes, the function invokes the dependent_sleeper( ) function. In most cases, this function immediately returns zero.
|   if (unlikely(!rq->nr_running)) {              |
|go_idle:                                         |
|       idle_balance(cpu, rq);                    |
|       if (!rq->nr_running) {                    |
|           next = rq->idle;                      |
|           rq->expired_timestamp = 0;            |
|           wake_sleeping_dependent(cpu, rq);     |
|           if (!rq->nr_running)                  |
|               goto switch_tasks;                |
|       }                                         |
|   } else {                                      |
|       if (dependent_sleeper(cpu, rq)) {         |
|           next = rq->idle;                      |
|           goto switch_tasks;                    |
|       }                                         |
|       if (unlikely(!rq->nr_running))            |
|           goto go_idle;                         |
|   }                                             |

Let's suppose that the schedule( ) function has determined that the runqueue includes some runnable processes; now it has to check that at least one of these runnable processes is active. If not, the function exchanges the contents of the active and expired fields of the runqueue data structure; thus, all expired processes become active, while the empty set is ready to receive the processes that will expire in the future.
|   array = rq->active;                    | 
|   if (unlikely(!array->nr_active)) {     |
|       schedstat_inc(rq, sched_switch);   |
|       rq->active = rq->expired;          |
|       rq->expired = array;               |
|       array = rq->active;                |
|       rq->expired_timestamp = 0;         |
|       rq->best_expired_prio = MAX_PRIO;  |
|   }                                      |

It is time to look up a runnable process in the active prio_array_t data structure.First of all, schedule( ) searches for the first nonzero bit in the bitmask of the active set. Remember that a bit in the bitmask is set when the corresponding priority list is not empty. Thus, the index of the first nonzero bit indicates the list containing the best process to run. Then, the first process descriptor in that list is retrieved:
|   idx = sched_find_first_bit(array->bitmap);        |
|   queue = array->queue + idx;                       |
|   next = list_entry(queue->next, task_t, run_list); |

If next is a conventional process and it is being awakened from the TASK_INTERRUPTIBLE or TASK_STOPPED state, the scheduler adds to the average sleep time of the process the nanoseconds elapsed since the process was inserted into the runqueue. In other words, the sleep time of the process is increased to cover also the time spent by the process in the runqueue waiting for the CPU:
|   if (!rt_task(next) && next->activated > 0) {                    |
|       unsigned long long delta = now - next->timestamp;           |
|       if (unlikely((long long)(now - next->timestamp) < 0))       |
|           delta = 0;                                              |
|       if (next->activated == 1)                                   |
|           delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; |
|       array = next->array;                                        |
|       dequeue_task(next, array);                                  |
|       recalc_task_prio(next, next->timestamp + delta);            |
|       enqueue_task(next, array);                                  |
|   }                                                               |
|   next->activated = 0;                                            |

  • Actions performed by schedule( ) to make the process switch
    if (next == rq->idle)
       schedstat_inc(rq, sched_goidle);

Now the schedule( ) function has determined the next process to run.In a moment, the kernel will access the thread_info data structure of next, whose address is stored close to the top of next's process descriptor:

|   prefetch(next);     |

Before replacing prev, the scheduler should do some administrative work:
The clear_tsk_need_resched( ) function clears the TIF_NEED_RESCHED flag of prev, just in case schedule( ) has been invoked in the lazy way. Then, the function records that the CPU is going through a quiescent state
|   clear_tsk_need_resched(prev);   |
|   rcu_qsctr_inc(task_cpu(prev));  |

    update_cpu_clock(prev, rq, now);
The schedule( ) function must also decrease the average sleep time of prev, charging to it the slice of CPU time used by the process:
|   prev->sleep_avg -= run_time;            |
|   if ((long)prev->sleep_avg <= 0)         |
|       prev->sleep_avg = 0;                |
|   prev->timestamp = prev->last_ran = now; |

    sched_info_switch(prev, next);

At this point, prev and next are different processes, and the process switch is for real:

|   if (likely(prev != next)) {                 |
|       next->timestamp = now;                  |
|       rq->nr_switches++;                      |
|       rq->curr = next;                        |
|       ++*switch_count;                        |
|       prepare_arch_switch(rq, next);          |
|       prev = context_switch(rq, prev, next);  |

  • Actions performed by schedule( ) after a process switch
|       barrier();                |
|       finish_task_switch(prev); |

It is quite possible that prev and next are the same process: this happens if no other higher or equal priority active process is present in the runqueue. In this case, the function skips the process switch:
|   } else                          |
|       spin_unlock_irq(&rq->lock); |

    prev = current;
    if (unlikely(reacquire_kernel_lock(prev) < 0))
        goto need_resched_nonpreemptible;
    if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
        goto need_resched;
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.