kube-scheduler源碼分析(三)之 scheduleOne

本文個人博客地址:https://www.huweihuang.com/kubernetes-notes/code-analysis/kube-scheduler/scheduleOne.html

kube-scheduler源碼分析(三)之 scheduleOne

以下代碼分析基於 kubernetes v1.12.0 版本。

本文主要分析/pkg/scheduler/中調度的基本流程。具體的預選調度邏輯優選調度邏輯節點搶佔邏輯待後續再獨立分析。

scheduler的pkg代碼目錄結構如下:

scheduler
├── algorithm         # 主要包含調度的算法
│   ├── predicates    # 預選的策略
│   ├── priorities    # 優選的策略
│   ├── scheduler_interface.go    # ScheduleAlgorithm、SchedulerExtender接口定義
│   ├── types.go      # 使用到的type的定義
├── algorithmprovider
│   ├── defaults
│   │   ├── defaults.go    # 默認算法的初始化操作,包括預選和優選策略
├── cache      # scheduler調度使用到的cache
│   ├── cache.go    # schedulerCache
│   ├── interface.go
│   ├── node_info.go
│   ├── node_tree.go
├── core       # 調度邏輯的核心代碼
│   ├── equivalence
│   │   ├── eqivalence.go       # 存儲相同pod的調度結果緩存,主要給預選策略使用
│   ├── extender.go
│   ├── generic_scheduler.go    # genericScheduler,主要包含默認調度器的調度邏輯
│   ├── scheduling_queue.go     # 調度使用到的隊列,主要用來存儲需要被調度的pod
├── factory
│   ├── factory.go   # 主要包括NewConfigFactory、NewPodInformer,監聽pod事件來更新調度隊列
├── metrics
│   └── metrics.go   # 主要給prometheus使用
├── scheduler.go # pkg部分的Run入口(核心代碼),主要包含Run、scheduleOne、schedule、preempt等函數
└── volumebinder
    └── volume_binder.go   # volume bind

1. Scheduler.Run

此部分代碼位於pkg/scheduler/scheduler.go

此處爲具體調度邏輯的入口。

// Run begins watching and scheduling. It waits for cache to be synced, then starts a goroutine and returns immediately.
func (sched *Scheduler) Run() {
	if !sched.config.WaitForCacheSync() {
		return
	}

	go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}

2. Scheduler.scheduleOne

此部分代碼位於pkg/scheduler/scheduler.go

scheduleOne主要爲單個pod選擇一個適合的節點,爲調度邏輯的核心函數。

對單個pod進行調度的基本流程如下:

  1. 通過podQueue的待調度隊列中彈出需要調度的pod。
  2. 通過具體的調度算法爲該pod選出合適的節點,其中調度算法就包括預選和優選兩步策略。
  3. 如果上述調度失敗,則會嘗試搶佔機制,將優先級低的pod剔除,讓優先級高的pod調度成功。
  4. 將該pod和選定的節點進行假性綁定,存入scheduler cache中,方便具體綁定操作可以異步進行。
  5. 實際執行綁定操作,將node的名字添加到pod的節點相關屬性中。

完整代碼如下:

// scheduleOne does the entire scheduling workflow for a single pod.  It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne() {
	pod := sched.config.NextPod()
	if pod.DeletionTimestamp != nil {
		sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		glog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		return
	}

	glog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)

	// Synchronously attempt to find a fit for the pod.
	start := time.Now()
	suggestedHost, err := sched.schedule(pod)
	if err != nil {
		// schedule() may have failed because the pod would not fit on any host, so we try to
		// preempt, with the expectation that the next time the pod is tried for scheduling it
		// will fit due to the preemption. It is also possible that a different pod will schedule
		// into the resources that were preempted, but this is harmless.
		if fitError, ok := err.(*core.FitError); ok {
			preemptionStartTime := time.Now()
			sched.preempt(pod, fitError)
			metrics.PreemptionAttempts.Inc()
			metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
			metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
		}
		return
	}
	metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInMicroseconds(start))
	// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
	// This allows us to keep scheduling without waiting on binding to occur.
	assumedPod := pod.DeepCopy()

	// Assume volumes first before assuming the pod.
	//
	// If all volumes are completely bound, then allBound is true and binding will be skipped.
	//
	// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
	//
	// This function modifies 'assumedPod' if volume binding is required.
	allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
	if err != nil {
		return
	}

	// assume modifies `assumedPod` by setting NodeName=suggestedHost
	err = sched.assume(assumedPod, suggestedHost)
	if err != nil {
		return
	}
	// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
	go func() {
		// Bind volumes first before Pod
		if !allBound {
			err = sched.bindVolumes(assumedPod)
			if err != nil {
				return
			}
		}

		err := sched.bind(assumedPod, &v1.Binding{
			ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
			Target: v1.ObjectReference{
				Kind: "Node",
				Name: suggestedHost,
			},
		})
		metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
		if err != nil {
			glog.Errorf("Internal error binding pod: (%v)", err)
		}
	}()
}

以下對重要代碼分別進行分析。

3. config.NextPod

通過podQueue的方式存儲待調度的pod隊列,NextPod拿出下一個需要被調度的pod。

pod := sched.config.NextPod()
if pod.DeletionTimestamp != nil {
	sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
	glog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
	return
}

glog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)

NextPod的具體函數在factory.go的CreateFromKey函數中定義,如下:

func (c *configFactory) CreateFromKeys(predicateKeys, priorityKeys sets.String, extenders []algorithm.SchedulerExtender) (*scheduler.Config, error) {
...
  	return &scheduler.Config{
    ...
		NextPod: func() *v1.Pod {
			return c.getNextPod()
		}
    ...
}      

3.1. getNextPod

通過一個podQueue來存儲需要調度的pod的隊列,通過隊列Pop的方式彈出需要被調度的pod。

func (c *configFactory) getNextPod() *v1.Pod {
	pod, err := c.podQueue.Pop()
	if err == nil {
		glog.V(4).Infof("About to try and schedule pod %v/%v", pod.Namespace, pod.Name)
		return pod
	}
	glog.Errorf("Error while retrieving next pod from scheduling queue: %v", err)
	return nil
}

4. Scheduler.schedule

此部分代碼位於pkg/scheduler/scheduler.go

此部分爲調度邏輯的核心,通過不同的算法爲具體的pod選擇一個最合適的節點。

// Synchronously attempt to find a fit for the pod.
start := time.Now()
suggestedHost, err := sched.schedule(pod)
if err != nil {
	// schedule() may have failed because the pod would not fit on any host, so we try to
	// preempt, with the expectation that the next time the pod is tried for scheduling it
	// will fit due to the preemption. It is also possible that a different pod will schedule
	// into the resources that were preempted, but this is harmless.
	if fitError, ok := err.(*core.FitError); ok {
		preemptionStartTime := time.Now()
		sched.preempt(pod, fitError)
		metrics.PreemptionAttempts.Inc()
		metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
		metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
	}
	return
}

schedule通過調度算法返回一個最優的節點。

// schedule implements the scheduling algorithm and returns the suggested host.
func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) {
	host, err := sched.config.Algorithm.Schedule(pod, sched.config.NodeLister)
	if err != nil {
		pod = pod.DeepCopy()
		sched.config.Error(pod, err)
		sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "%v", err)
		sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{
			Type:    v1.PodScheduled,
			Status:  v1.ConditionFalse,
			Reason:  v1.PodReasonUnschedulable,
			Message: err.Error(),
		})
		return "", err
	}
	return host, err
}

4.1. ScheduleAlgorithm

ScheduleAlgorithm是一個調度算法的接口,主要的實現體是genericScheduler,後續分析genericScheduler.Schedule

ScheduleAlgorithm接口定義如下:

// ScheduleAlgorithm is an interface implemented by things that know how to schedule pods
// onto machines.
type ScheduleAlgorithm interface {
	Schedule(*v1.Pod, NodeLister) (selectedMachine string, err error)
	// Preempt receives scheduling errors for a pod and tries to create room for
	// the pod by preempting lower priority pods if possible.
	// It returns the node where preemption happened, a list of preempted pods, a
	// list of pods whose nominated node name should be removed, and error if any.
	Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, cleanupNominatedPods []*v1.Pod, err error)
	// Predicates() returns a pointer to a map of predicate functions. This is
	// exposed for testing.
	Predicates() map[string]FitPredicate
	// Prioritizers returns a slice of priority config. This is exposed for
	// testing.
	Prioritizers() []PriorityConfig
}

5. genericScheduler.Schedule

此部分代碼位於/pkg/scheduler/core/generic_scheduler.go

genericScheduler.Schedule實現了基本的調度邏輯,基於給定需要調度的pod和node列表,如果執行成功返回調度的節點的名字,如果執行失敗,則返回錯誤和原因。主要通過預選和優選兩步操作完成調度的邏輯。

基本流程如下:

  1. 對pod做基本性檢查,目前主要是對pvc的檢查。
  2. 通過findNodesThatFit預選策略選出滿足調度條件的node列表。
  3. 通過PrioritizeNodes優選策略給預選的node列表中的node進行打分。
  4. 在打分的node列表中選擇一個分數最高的node作爲調度的節點。

完整代碼如下:

// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.
func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
	trace := utiltrace.New(fmt.Sprintf("Scheduling %s/%s", pod.Namespace, pod.Name))
	defer trace.LogIfLong(100 * time.Millisecond)

	if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
		return "", err
	}

	nodes, err := nodeLister.List()
	if err != nil {
		return "", err
	}
	if len(nodes) == 0 {
		return "", ErrNoNodesAvailable
	}

	// Used for all fit and priority funcs.
	err = g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
	if err != nil {
		return "", err
	}

	trace.Step("Computing predicates")
	startPredicateEvalTime := time.Now()
	filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
	if err != nil {
		return "", err
	}

	if len(filteredNodes) == 0 {
		return "", &FitError{
			Pod:              pod,
			NumAllNodes:      len(nodes),
			FailedPredicates: failedPredicateMap,
		}
	}
	metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))

	trace.Step("Prioritizing")
	startPriorityEvalTime := time.Now()
	// When only one node after predicate, just use it.
	if len(filteredNodes) == 1 {
		metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
		return filteredNodes[0].Name, nil
	}

	metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
	priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
	if err != nil {
		return "", err
	}
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))

	trace.Step("Selecting host")
	return g.selectHost(priorityList)
}

5.1. podPassesBasicChecks

podPassesBasicChecks主要做一下基本性檢查,目前主要是對pvc的檢查。

if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
	return "", err
}

podPassesBasicChecks具體實現如下:

// podPassesBasicChecks makes sanity checks on the pod if it can be scheduled.
func podPassesBasicChecks(pod *v1.Pod, pvcLister corelisters.PersistentVolumeClaimLister) error {
	// Check PVCs used by the pod
	namespace := pod.Namespace
	manifest := &(pod.Spec)
	for i := range manifest.Volumes {
		volume := &manifest.Volumes[i]
		if volume.PersistentVolumeClaim == nil {
			// Volume is not a PVC, ignore
			continue
		}
		pvcName := volume.PersistentVolumeClaim.ClaimName
		pvc, err := pvcLister.PersistentVolumeClaims(namespace).Get(pvcName)
		if err != nil {
			// The error has already enough context ("persistentvolumeclaim "myclaim" not found")
			return err
		}

		if pvc.DeletionTimestamp != nil {
			return fmt.Errorf("persistentvolumeclaim %q is being deleted", pvc.Name)
		}
	}

	return nil
}

5.2. findNodesThatFit

預選,通過預選函數來判斷每個節點是否適合被該Pod調度。

具體的findNodesThatFit代碼實現細節待後續文章獨立分析。

genericScheduler.Schedule中對findNodesThatFit的調用過程如下:

trace.Step("Computing predicates")
startPredicateEvalTime := time.Now()
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
if err != nil {
	return "", err
}

if len(filteredNodes) == 0 {
	return "", &FitError{
		Pod:              pod,
		NumAllNodes:      len(nodes),
		FailedPredicates: failedPredicateMap,
	}
}
metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))

5.3. PrioritizeNodes

優選,從滿足的節點中選擇出最優的節點。

具體操作如下:

  • PrioritizeNodes通過並行運行各個優先級函數來對節點進行優先級排序。
  • 每個優先級函數會給節點打分,打分範圍爲0-10分。
  • 0 表示優先級最低的節點,10表示優先級最高的節點。
  • 每個優先級函數也有各自的權重。
  • 優先級函數返回的節點分數乘以權重以獲得加權分數。
  • 最後組合(添加)所有分數以獲得所有節點的總加權分數。

具體PrioritizeNodes的實現邏輯待後續文章獨立分析。

genericScheduler.Schedule中對PrioritizeNodes的調用過程如下:

trace.Step("Prioritizing")
startPriorityEvalTime := time.Now()
// When only one node after predicate, just use it.
if len(filteredNodes) == 1 {
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	return filteredNodes[0].Name, nil
}
metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
if err != nil {
	return "", err
}
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))

5.4. selectHost

scheduler在最後會從priorityList中選擇分數最高的一個節點。

trace.Step("Selecting host")
return g.selectHost(priorityList)

selectHost獲取優先級的節點列表,然後從分數最高的節點以循環方式選擇一個節點。

具體代碼如下:

// selectHost takes a prioritized list of nodes and then picks one
// in a round-robin manner from the nodes that had the highest score.
func (g *genericScheduler) selectHost(priorityList schedulerapi.HostPriorityList) (string, error) {
	if len(priorityList) == 0 {
		return "", fmt.Errorf("empty priorityList")
	}

	maxScores := findMaxScores(priorityList)
	ix := int(g.lastNodeIndex % uint64(len(maxScores)))
	g.lastNodeIndex++

	return priorityList[maxScores[ix]].Host, nil
}

5.4.1. findMaxScores

findMaxScores返回priorityList中具有最高Score的節點的索引。

// findMaxScores returns the indexes of nodes in the "priorityList" that has the highest "Score".
func findMaxScores(priorityList schedulerapi.HostPriorityList) []int {
	maxScoreIndexes := make([]int, 0, len(priorityList)/2)
	maxScore := priorityList[0].Score
	for i, hp := range priorityList {
		if hp.Score > maxScore {
			maxScore = hp.Score
			maxScoreIndexes = maxScoreIndexes[:0]
			maxScoreIndexes = append(maxScoreIndexes, i)
		} else if hp.Score == maxScore {
			maxScoreIndexes = append(maxScoreIndexes, i)
		}
	}
	return maxScoreIndexes
}

6. Scheduler.preempt

如果pod在預選和優選調度中失敗,則執行搶佔操作。搶佔主要是將低優先級的pod的資源空間騰出給待調度的高優先級的pod。

具體Scheduler.preempt的實現邏輯待後續文章獨立分析。

suggestedHost, err := sched.schedule(pod)
if err != nil {
	// schedule() may have failed because the pod would not fit on any host, so we try to
	// preempt, with the expectation that the next time the pod is tried for scheduling it
	// will fit due to the preemption. It is also possible that a different pod will schedule
	// into the resources that were preempted, but this is harmless.
	if fitError, ok := err.(*core.FitError); ok {
		preemptionStartTime := time.Now()
		sched.preempt(pod, fitError)
		metrics.PreemptionAttempts.Inc()
		metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
		metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
	}
	return
}

7. Scheduler.assume

將該pod和選定的節點進行假性綁定,存入scheduler cache中,方便可以繼續執行調度邏輯,而不需要等待綁定操作的發生,具體綁定操作可以異步進行。

// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
// This allows us to keep scheduling without waiting on binding to occur.
assumedPod := pod.DeepCopy()

// Assume volumes first before assuming the pod.
//
// If all volumes are completely bound, then allBound is true and binding will be skipped.
//
// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
//
// This function modifies 'assumedPod' if volume binding is required.
allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
if err != nil {
	return
}

// assume modifies `assumedPod` by setting NodeName=suggestedHost
err = sched.assume(assumedPod, suggestedHost)
if err != nil {
	return
}

如果假性綁定成功則發送請求給apiserver,如果失敗則scheduler會立即釋放已分配給假性綁定的pod的資源。

assume方法的具體實現:

// assume signals to the cache that a pod is already in the cache, so that binding can be asynchronous.
// assume modifies `assumed`.
func (sched *Scheduler) assume(assumed *v1.Pod, host string) error {
	// Optimistically assume that the binding will succeed and send it to apiserver
	// in the background.
	// If the binding fails, scheduler will release resources allocated to assumed pod
	// immediately.
	assumed.Spec.NodeName = host
	// NOTE: Because the scheduler uses snapshots of SchedulerCache and the live
	// version of Ecache, updates must be written to SchedulerCache before
	// invalidating Ecache.
	if err := sched.config.SchedulerCache.AssumePod(assumed); err != nil {
		glog.Errorf("scheduler cache AssumePod failed: %v", err)

		// This is most probably result of a BUG in retrying logic.
		// We report an error here so that pod scheduling can be retried.
		// This relies on the fact that Error will check if the pod has been bound
		// to a node and if so will not add it back to the unscheduled pods queue
		// (otherwise this would cause an infinite loop).
		sched.config.Error(assumed, err)
		sched.config.Recorder.Eventf(assumed, v1.EventTypeWarning, "FailedScheduling", "AssumePod failed: %v", err)
		sched.config.PodConditionUpdater.Update(assumed, &v1.PodCondition{
			Type:    v1.PodScheduled,
			Status:  v1.ConditionFalse,
			Reason:  "SchedulerError",
			Message: err.Error(),
		})
		return err
	}

	// Optimistically assume that the binding will succeed, so we need to invalidate affected
	// predicates in equivalence cache.
	// If the binding fails, these invalidated item will not break anything.
	if sched.config.Ecache != nil {
		sched.config.Ecache.InvalidateCachedPredicateItemForPodAdd(assumed, host)
	}
	return nil
}

8. Scheduler.bind

異步的方式給pod綁定到具體的調度節點上。

// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
go func() {
	// Bind volumes first before Pod
	if !allBound {
		err = sched.bindVolumes(assumedPod)
		if err != nil {
			return
		}
	}
	err := sched.bind(assumedPod, &v1.Binding{
		ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
		Target: v1.ObjectReference{
			Kind: "Node",
			Name: suggestedHost,
		},
	})
	metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
	if err != nil {
		glog.Errorf("Internal error binding pod: (%v)", err)
	}
}()

bind具體實現如下:

// bind binds a pod to a given node defined in a binding object.  We expect this to run asynchronously, so we
// handle binding metrics internally.
func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error {
	bindingStart := time.Now()
	// If binding succeeded then PodScheduled condition will be updated in apiserver so that
	// it's atomic with setting host.
	err := sched.config.GetBinder(assumed).Bind(b)
	if err := sched.config.SchedulerCache.FinishBinding(assumed); err != nil {
		glog.Errorf("scheduler cache FinishBinding failed: %v", err)
	}
	if err != nil {
		glog.V(1).Infof("Failed to bind pod: %v/%v", assumed.Namespace, assumed.Name)
		if err := sched.config.SchedulerCache.ForgetPod(assumed); err != nil {
			glog.Errorf("scheduler cache ForgetPod failed: %v", err)
		}
		sched.config.Error(assumed, err)
		sched.config.Recorder.Eventf(assumed, v1.EventTypeWarning, "FailedScheduling", "Binding rejected: %v", err)
		sched.config.PodConditionUpdater.Update(assumed, &v1.PodCondition{
			Type:   v1.PodScheduled,
			Status: v1.ConditionFalse,
			Reason: "BindingRejected",
		})
		return err
	}

	metrics.BindingLatency.Observe(metrics.SinceInMicroseconds(bindingStart))
	metrics.SchedulingLatency.WithLabelValues(metrics.Binding).Observe(metrics.SinceInSeconds(bindingStart))
	sched.config.Recorder.Eventf(assumed, v1.EventTypeNormal, "Scheduled", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, b.Target.Name)
	return nil
}

9. 總結

本文主要分析了單個pod的調度過程。具體流程如下:

  1. 通過podQueue的待調度隊列中彈出需要調度的pod。
  2. 通過具體的調度算法爲該pod選出合適的節點,其中調度算法就包括預選和優選兩步策略。
  3. 如果上述調度失敗,則會嘗試搶佔機制,將優先級低的pod剔除,讓優先級高的pod調度成功。
  4. 將該pod和選定的節點進行假性綁定,存入scheduler cache中,方便具體綁定操作可以異步進行。
  5. 實際執行綁定操作,將node的名字添加到pod的節點相關屬性中。

其中核心的部分爲通過具體的調度算法選出調度節點的過程,即genericScheduler.Schedule的實現部分。該部分包括預選和優選兩個部分。

genericScheduler.Schedule調度的基本流程如下:

  1. 對pod做基本性檢查,目前主要是對pvc的檢查。
  2. 通過findNodesThatFit預選策略選出滿足調度條件的node列表。
  3. 通過PrioritizeNodes優選策略給預選的node列表中的node進行打分。
  4. 在打分的node列表中選擇一個分數最高的node作爲調度的節點。

參考:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章