本文個人博客地址：https://www.huweihuang.com/kubernetes-notes/code-analysis/kube-scheduler/scheduleOne.html

kube-scheduler源碼分析（三）之 scheduleOne

以下代碼分析基於 kubernetes v1.12.0 版本。

本文主要分析/pkg/scheduler/中調度的基本流程。具體的預選調度邏輯、優選調度邏輯、節點搶佔邏輯待後續再獨立分析。

scheduler的pkg代碼目錄結構如下：

scheduler
├── algorithm         # 主要包含調度的算法
│   ├── predicates    # 預選的策略
│   ├── priorities    # 優選的策略
│   ├── scheduler_interface.go    # ScheduleAlgorithm、SchedulerExtender接口定義
│   ├── types.go      # 使用到的type的定義
├── algorithmprovider
│   ├── defaults
│   │   ├── defaults.go    # 默認算法的初始化操作，包括預選和優選策略
├── cache      # scheduler調度使用到的cache
│   ├── cache.go    # schedulerCache
│   ├── interface.go
│   ├── node_info.go
│   ├── node_tree.go
├── core       # 調度邏輯的核心代碼
│   ├── equivalence
│   │   ├── eqivalence.go       # 存儲相同pod的調度結果緩存，主要給預選策略使用
│   ├── extender.go
│   ├── generic_scheduler.go    # genericScheduler,主要包含默認調度器的調度邏輯
│   ├── scheduling_queue.go     # 調度使用到的隊列，主要用來存儲需要被調度的pod
├── factory
│   ├── factory.go   # 主要包括NewConfigFactory、NewPodInformer，監聽pod事件來更新調度隊列
├── metrics
│   └── metrics.go   # 主要給prometheus使用
├── scheduler.go # pkg部分的Run入口(核心代碼)，主要包含Run、scheduleOne、schedule、preempt等函數
└── volumebinder
    └── volume_binder.go   # volume bind

1. Scheduler.Run

此部分代碼位於pkg/scheduler/scheduler.go

此處爲具體調度邏輯的入口。

// Run begins watching and scheduling. It waits for cache to be synced, then starts a goroutine and returns immediately.
func (sched *Scheduler) Run() {
	if !sched.config.WaitForCacheSync() {
		return
	}

	go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}

2. Scheduler.scheduleOne

此部分代碼位於pkg/scheduler/scheduler.go

scheduleOne主要爲單個pod選擇一個適合的節點，爲調度邏輯的核心函數。

對單個pod進行調度的基本流程如下：

通過podQueue的待調度隊列中彈出需要調度的pod。
通過具體的調度算法爲該pod選出合適的節點，其中調度算法就包括預選和優選兩步策略。
如果上述調度失敗，則會嘗試搶佔機制，將優先級低的pod剔除，讓優先級高的pod調度成功。
將該pod和選定的節點進行假性綁定，存入scheduler cache中，方便具體綁定操作可以異步進行。
實際執行綁定操作，將node的名字添加到pod的節點相關屬性中。

完整代碼如下：

// scheduleOne does the entire scheduling workflow for a single pod.  It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne() {
	pod := sched.config.NextPod()
	if pod.DeletionTimestamp != nil {
		sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		glog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		return
	}

	glog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)

	// Synchronously attempt to find a fit for the pod.
	start := time.Now()
	suggestedHost, err := sched.schedule(pod)
	if err != nil {
		// schedule() may have failed because the pod would not fit on any host, so we try to
		// preempt, with the expectation that the next time the pod is tried for scheduling it
		// will fit due to the preemption. It is also possible that a different pod will schedule
		// into the resources that were preempted, but this is harmless.
		if fitError, ok := err.(*core.FitError); ok {
			preemptionStartTime := time.Now()
			sched.preempt(pod, fitError)
			metrics.PreemptionAttempts.Inc()
			metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
			metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
		}
		return
	}
	metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInMicroseconds(start))
	// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
	// This allows us to keep scheduling without waiting on binding to occur.
	assumedPod := pod.DeepCopy()

	// Assume volumes first before assuming the pod.
	//
	// If all volumes are completely bound, then allBound is true and binding will be skipped.
	//
	// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
	//
	// This function modifies 'assumedPod' if volume binding is required.
	allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
	if err != nil {
		return
	}

	// assume modifies `assumedPod` by setting NodeName=suggestedHost
	err = sched.assume(assumedPod, suggestedHost)
	if err != nil {
		return
	}
	// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
	go func() {
		// Bind volumes first before Pod
		if !allBound {
			err = sched.bindVolumes(assumedPod)
			if err != nil {
				return
			}
		}

		err := sched.bind(assumedPod, &v1.Binding{
			ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
			Target: v1.ObjectReference{
				Kind: "Node",
				Name: suggestedHost,
			},
		})
		metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
		if err != nil {
			glog.Errorf("Internal error binding pod: (%v)", err)
		}
	}()
}

以下對重要代碼分別進行分析。

3. config.NextPod

通過podQueue的方式存儲待調度的pod隊列，NextPod拿出下一個需要被調度的pod。

pod := sched.config.NextPod()
if pod.DeletionTimestamp != nil {
	sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
	glog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
	return
}

glog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)

NextPod的具體函數在factory.go的CreateFromKey函數中定義，如下：

func (c *configFactory) CreateFromKeys(predicateKeys, priorityKeys sets.String, extenders []algorithm.SchedulerExtender) (*scheduler.Config, error) {
...
  	return &scheduler.Config{
    ...
		NextPod: func() *v1.Pod {
			return c.getNextPod()
		}
    ...
}

3.1. getNextPod

通過一個podQueue來存儲需要調度的pod的隊列，通過隊列Pop的方式彈出需要被調度的pod。

func (c *configFactory) getNextPod() *v1.Pod {
	pod, err := c.podQueue.Pop()
	if err == nil {
		glog.V(4).Infof("About to try and schedule pod %v/%v", pod.Namespace, pod.Name)
		return pod
	}
	glog.Errorf("Error while retrieving next pod from scheduling queue: %v", err)
	return nil
}

4. Scheduler.schedule

此部分代碼位於pkg/scheduler/scheduler.go

此部分爲調度邏輯的核心，通過不同的算法爲具體的pod選擇一個最合適的節點。

// Synchronously attempt to find a fit for the pod.
start := time.Now()
suggestedHost, err := sched.schedule(pod)
if err != nil {
	// schedule() may have failed because the pod would not fit on any host, so we try to
	// preempt, with the expectation that the next time the pod is tried for scheduling it
	// will fit due to the preemption. It is also possible that a different pod will schedule
	// into the resources that were preempted, but this is harmless.
	if fitError, ok := err.(*core.FitError); ok {
		preemptionStartTime := time.Now()
		sched.preempt(pod, fitError)
		metrics.PreemptionAttempts.Inc()
		metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
		metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
	}
	return
}

schedule通過調度算法返回一個最優的節點。

// schedule implements the scheduling algorithm and returns the suggested host.
func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) {
	host, err := sched.config.Algorithm.Schedule(pod, sched.config.NodeLister)
	if err != nil {
		pod = pod.DeepCopy()
		sched.config.Error(pod, err)
		sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "%v", err)
		sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{
			Type:    v1.PodScheduled,
			Status:  v1.ConditionFalse,
			Reason:  v1.PodReasonUnschedulable,
			Message: err.Error(),
		})
		return "", err
	}
	return host, err
}

4.1. ScheduleAlgorithm

ScheduleAlgorithm是一個調度算法的接口，主要的實現體是genericScheduler，後續分析genericScheduler.Schedule。

ScheduleAlgorithm接口定義如下：

// ScheduleAlgorithm is an interface implemented by things that know how to schedule pods
// onto machines.
type ScheduleAlgorithm interface {
	Schedule(*v1.Pod, NodeLister) (selectedMachine string, err error)
	// Preempt receives scheduling errors for a pod and tries to create room for
	// the pod by preempting lower priority pods if possible.
	// It returns the node where preemption happened, a list of preempted pods, a
	// list of pods whose nominated node name should be removed, and error if any.
	Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, cleanupNominatedPods []*v1.Pod, err error)
	// Predicates() returns a pointer to a map of predicate functions. This is
	// exposed for testing.
	Predicates() map[string]FitPredicate
	// Prioritizers returns a slice of priority config. This is exposed for
	// testing.
	Prioritizers() []PriorityConfig
}

5. genericScheduler.Schedule

此部分代碼位於/pkg/scheduler/core/generic_scheduler.go

genericScheduler.Schedule實現了基本的調度邏輯，基於給定需要調度的pod和node列表，如果執行成功返回調度的節點的名字，如果執行失敗，則返回錯誤和原因。主要通過預選和優選兩步操作完成調度的邏輯。

基本流程如下：

對pod做基本性檢查，目前主要是對pvc的檢查。
通過findNodesThatFit預選策略選出滿足調度條件的node列表。
通過PrioritizeNodes優選策略給預選的node列表中的node進行打分。
在打分的node列表中選擇一個分數最高的node作爲調度的節點。

完整代碼如下：

// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.
func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
	trace := utiltrace.New(fmt.Sprintf("Scheduling %s/%s", pod.Namespace, pod.Name))
	defer trace.LogIfLong(100 * time.Millisecond)

	if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
		return "", err
	}

	nodes, err := nodeLister.List()
	if err != nil {
		return "", err
	}
	if len(nodes) == 0 {
		return "", ErrNoNodesAvailable
	}

	// Used for all fit and priority funcs.
	err = g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
	if err != nil {
		return "", err
	}

	trace.Step("Computing predicates")
	startPredicateEvalTime := time.Now()
	filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
	if err != nil {
		return "", err
	}

	if len(filteredNodes) == 0 {
		return "", &FitError{
			Pod:              pod,
			NumAllNodes:      len(nodes),
			FailedPredicates: failedPredicateMap,
		}
	}
	metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))

	trace.Step("Prioritizing")
	startPriorityEvalTime := time.Now()
	// When only one node after predicate, just use it.
	if len(filteredNodes) == 1 {
		metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
		return filteredNodes[0].Name, nil
	}

	metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
	priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
	if err != nil {
		return "", err
	}
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))

	trace.Step("Selecting host")
	return g.selectHost(priorityList)
}

5.1. podPassesBasicChecks

podPassesBasicChecks主要做一下基本性檢查，目前主要是對pvc的檢查。

if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
	return "", err
}

podPassesBasicChecks具體實現如下：

// podPassesBasicChecks makes sanity checks on the pod if it can be scheduled.
func podPassesBasicChecks(pod *v1.Pod, pvcLister corelisters.PersistentVolumeClaimLister) error {
	// Check PVCs used by the pod
	namespace := pod.Namespace
	manifest := &(pod.Spec)
	for i := range manifest.Volumes {
		volume := &manifest.Volumes[i]
		if volume.PersistentVolumeClaim == nil {
			// Volume is not a PVC, ignore
			continue
		}
		pvcName := volume.PersistentVolumeClaim.ClaimName
		pvc, err := pvcLister.PersistentVolumeClaims(namespace).Get(pvcName)
		if err != nil {
			// The error has already enough context ("persistentvolumeclaim "myclaim" not found")
			return err
		}

		if pvc.DeletionTimestamp != nil {
			return fmt.Errorf("persistentvolumeclaim %q is being deleted", pvc.Name)
		}
	}

	return nil
}

5.2. findNodesThatFit

預選，通過預選函數來判斷每個節點是否適合被該Pod調度。

具體的findNodesThatFit代碼實現細節待後續文章獨立分析。

genericScheduler.Schedule中對findNodesThatFit的調用過程如下：

trace.Step("Computing predicates")
startPredicateEvalTime := time.Now()
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
if err != nil {
	return "", err
}

if len(filteredNodes) == 0 {
	return "", &FitError{
		Pod:              pod,
		NumAllNodes:      len(nodes),
		FailedPredicates: failedPredicateMap,
	}
}
metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))

5.3. PrioritizeNodes

優選，從滿足的節點中選擇出最優的節點。

具體操作如下：

PrioritizeNodes通過並行運行各個優先級函數來對節點進行優先級排序。
每個優先級函數會給節點打分，打分範圍爲0-10分。
0 表示優先級最低的節點，10表示優先級最高的節點。
每個優先級函數也有各自的權重。
優先級函數返回的節點分數乘以權重以獲得加權分數。
最後組合（添加）所有分數以獲得所有節點的總加權分數。

具體PrioritizeNodes的實現邏輯待後續文章獨立分析。

genericScheduler.Schedule中對PrioritizeNodes的調用過程如下：

trace.Step("Prioritizing")
startPriorityEvalTime := time.Now()
// When only one node after predicate, just use it.
if len(filteredNodes) == 1 {
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	return filteredNodes[0].Name, nil
}
metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
if err != nil {
	return "", err
}
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))

5.4. selectHost

scheduler在最後會從priorityList中選擇分數最高的一個節點。

trace.Step("Selecting host")
return g.selectHost(priorityList)

selectHost獲取優先級的節點列表，然後從分數最高的節點以循環方式選擇一個節點。

具體代碼如下：

// selectHost takes a prioritized list of nodes and then picks one
// in a round-robin manner from the nodes that had the highest score.
func (g *genericScheduler) selectHost(priorityList schedulerapi.HostPriorityList) (string, error) {
	if len(priorityList) == 0 {
		return "", fmt.Errorf("empty priorityList")
	}

	maxScores := findMaxScores(priorityList)
	ix := int(g.lastNodeIndex % uint64(len(maxScores)))
	g.lastNodeIndex++

	return priorityList[maxScores[ix]].Host, nil
}

5.4.1. findMaxScores

findMaxScores返回priorityList中具有最高Score的節點的索引。

// findMaxScores returns the indexes of nodes in the "priorityList" that has the highest "Score".
func findMaxScores(priorityList schedulerapi.HostPriorityList) []int {
	maxScoreIndexes := make([]int, 0, len(priorityList)/2)
	maxScore := priorityList[0].Score
	for i, hp := range priorityList {
		if hp.Score > maxScore {
			maxScore = hp.Score
			maxScoreIndexes = maxScoreIndexes[:0]
			maxScoreIndexes = append(maxScoreIndexes, i)
		} else if hp.Score == maxScore {
			maxScoreIndexes = append(maxScoreIndexes, i)
		}
	}
	return maxScoreIndexes
}

6. Scheduler.preempt

如果pod在預選和優選調度中失敗，則執行搶佔操作。搶佔主要是將低優先級的pod的資源空間騰出給待調度的高優先級的pod。

具體Scheduler.preempt的實現邏輯待後續文章獨立分析。

suggestedHost, err := sched.schedule(pod)
if err != nil {
	// schedule() may have failed because the pod would not fit on any host, so we try to
	// preempt, with the expectation that the next time the pod is tried for scheduling it
	// will fit due to the preemption. It is also possible that a different pod will schedule
	// into the resources that were preempted, but this is harmless.
	if fitError, ok := err.(*core.FitError); ok {
		preemptionStartTime := time.Now()
		sched.preempt(pod, fitError)
		metrics.PreemptionAttempts.Inc()
		metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
		metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
	}
	return
}

7. Scheduler.assume

將該pod和選定的節點進行假性綁定，存入scheduler cache中，方便可以繼續執行調度邏輯，而不需要等待綁定操作的發生，具體綁定操作可以異步進行。

// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
// This allows us to keep scheduling without waiting on binding to occur.
assumedPod := pod.DeepCopy()

// Assume volumes first before assuming the pod.
//
// If all volumes are completely bound, then allBound is true and binding will be skipped.
//
// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
//
// This function modifies 'assumedPod' if volume binding is required.
allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
if err != nil {
	return
}

// assume modifies `assumedPod` by setting NodeName=suggestedHost
err = sched.assume(assumedPod, suggestedHost)
if err != nil {
	return
}

如果假性綁定成功則發送請求給apiserver，如果失敗則scheduler會立即釋放已分配給假性綁定的pod的資源。

assume方法的具體實現：

// assume signals to the cache that a pod is already in the cache, so that binding can be asynchronous.
// assume modifies `assumed`.
func (sched *Scheduler) assume(assumed *v1.Pod, host string) error {
	// Optimistically assume that the binding will succeed and send it to apiserver
	// in the background.
	// If the binding fails, scheduler will release resources allocated to assumed pod
	// immediately.
	assumed.Spec.NodeName = host
	// NOTE: Because the scheduler uses snapshots of SchedulerCache and the live
	// version of Ecache, updates must be written to SchedulerCache before
	// invalidating Ecache.
	if err := sched.config.SchedulerCache.AssumePod(assumed); err != nil {
		glog.Errorf("scheduler cache AssumePod failed: %v", err)

		// This is most probably result of a BUG in retrying logic.
		// We report an error here so that pod scheduling can be retried.
		// This relies on the fact that Error will check if the pod has been bound
		// to a node and if so will not add it back to the unscheduled pods queue
		// (otherwise this would cause an infinite loop).
		sched.config.Error(assumed, err)
		sched.config.Recorder.Eventf(assumed, v1.EventTypeWarning, "FailedScheduling", "AssumePod failed: %v", err)
		sched.config.PodConditionUpdater.Update(assumed, &v1.PodCondition{
			Type:    v1.PodScheduled,
			Status:  v1.ConditionFalse,
			Reason:  "SchedulerError",
			Message: err.Error(),
		})
		return err
	}

	// Optimistically assume that the binding will succeed, so we need to invalidate affected
	// predicates in equivalence cache.
	// If the binding fails, these invalidated item will not break anything.
	if sched.config.Ecache != nil {
		sched.config.Ecache.InvalidateCachedPredicateItemForPodAdd(assumed, host)
	}
	return nil
}

8. Scheduler.bind

異步的方式給pod綁定到具體的調度節點上。

// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
go func() {
	// Bind volumes first before Pod
	if !allBound {
		err = sched.bindVolumes(assumedPod)
		if err != nil {
			return
		}
	}
	err := sched.bind(assumedPod, &v1.Binding{
		ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
		Target: v1.ObjectReference{
			Kind: "Node",
			Name: suggestedHost,
		},
	})
	metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
	if err != nil {
		glog.Errorf("Internal error binding pod: (%v)", err)
	}
}()

bind具體實現如下：

// bind binds a pod to a given node defined in a binding object.  We expect this to run asynchronously, so we
// handle binding metrics internally.
func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error {
	bindingStart := time.Now()
	// If binding succeeded then PodScheduled condition will be updated in apiserver so that
	// it's atomic with setting host.
	err := sched.config.GetBinder(assumed).Bind(b)
	if err := sched.config.SchedulerCache.FinishBinding(assumed); err != nil {
		glog.Errorf("scheduler cache FinishBinding failed: %v", err)
	}
	if err != nil {
		glog.V(1).Infof("Failed to bind pod: %v/%v", assumed.Namespace, assumed.Name)
		if err := sched.config.SchedulerCache.ForgetPod(assumed); err != nil {
			glog.Errorf("scheduler cache ForgetPod failed: %v", err)
		}
		sched.config.Error(assumed, err)
		sched.config.Recorder.Eventf(assumed, v1.EventTypeWarning, "FailedScheduling", "Binding rejected: %v", err)
		sched.config.PodConditionUpdater.Update(assumed, &v1.PodCondition{
			Type:   v1.PodScheduled,
			Status: v1.ConditionFalse,
			Reason: "BindingRejected",
		})
		return err
	}

	metrics.BindingLatency.Observe(metrics.SinceInMicroseconds(bindingStart))
	metrics.SchedulingLatency.WithLabelValues(metrics.Binding).Observe(metrics.SinceInSeconds(bindingStart))
	sched.config.Recorder.Eventf(assumed, v1.EventTypeNormal, "Scheduled", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, b.Target.Name)
	return nil
}

9. 總結

本文主要分析了單個pod的調度過程。具體流程如下：

通過podQueue的待調度隊列中彈出需要調度的pod。
通過具體的調度算法爲該pod選出合適的節點，其中調度算法就包括預選和優選兩步策略。
如果上述調度失敗，則會嘗試搶佔機制，將優先級低的pod剔除，讓優先級高的pod調度成功。
將該pod和選定的節點進行假性綁定，存入scheduler cache中，方便具體綁定操作可以異步進行。
實際執行綁定操作，將node的名字添加到pod的節點相關屬性中。

其中核心的部分爲通過具體的調度算法選出調度節點的過程，即genericScheduler.Schedule的實現部分。該部分包括預選和優選兩個部分。

genericScheduler.Schedule調度的基本流程如下：

對pod做基本性檢查，目前主要是對pvc的檢查。
通過findNodesThatFit預選策略選出滿足調度條件的node列表。
通過PrioritizeNodes優選策略給預選的node列表中的node進行打分。
在打分的node列表中選擇一個分數最高的node作爲調度的節點。

參考：

kube-scheduler源碼分析（三）之 scheduleOne

kube-scheduler源碼分析（三）之 scheduleOne

1. Scheduler.Run

2. Scheduler.scheduleOne

3. config.NextPod

3.1. getNextPod

4. Scheduler.schedule

4.1. ScheduleAlgorithm

5. genericScheduler.Schedule

5.1. podPassesBasicChecks

5.2. findNodesThatFit

5.3. PrioritizeNodes

5.4. selectHost

5.4.1. findMaxScores

6. Scheduler.preempt

7. Scheduler.assume

8. Scheduler.bind

9. 總結

kube-scheduler源碼分析（四）之 findNodesThatFit

kube-controller-manager源碼分析（二）之 DeploymentController

kube-scheduler源碼分析（三）之 scheduleOne

kube-scheduler源碼分析（一）之 NewSchedulerCommand

kubelet源碼分析（五）之 syncPod

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結