kube-scheduler源碼分析(四)之 findNodesThatFit

本文個人博客地址:https://www.huweihuang.com/kubernetes-notes/code-analysis/kube-scheduler/findNodesThatFit.html

kube-scheduler源碼分析(四)之 findNodesThatFit

以下代碼分析基於 kubernetes v1.12.0 版本。

本文主要分析調度邏輯中的預選策略,即第一步篩選出符合pod調度條件的節點。

1. 調用入口

預選,通過預選函數來判斷每個節點是否適合被該Pod調度。

genericScheduler.Schedule中對findNodesThatFit的調用過程如下:

此部分代碼位於pkg/scheduler/core/generic_scheduler.go

func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
	...
  // 列出所有的節點
	nodes, err := nodeLister.List()
	if err != nil {
		return "", err
	}
	if len(nodes) == 0 {
		return "", ErrNoNodesAvailable
	}

	// Used for all fit and priority funcs.
	err = g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
	if err != nil {
		return "", err
	}

	trace.Step("Computing predicates")
	startPredicateEvalTime := time.Now()
  // 調用findNodesThatFit過濾出預選節點
	filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
	if err != nil {
		return "", err
	}

	if len(filteredNodes) == 0 {
		return "", &FitError{
			Pod:              pod,
			NumAllNodes:      len(nodes),
			FailedPredicates: failedPredicateMap,
		}
	}
// metrics
  metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
			  metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))
	...
}  

核心代碼:

// 調用findNodesThatFit過濾出預選節點
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)

2. findNodesThatFit

findNodesThatFit基於給定的預選函數過濾node,每個node傳入到預選函數中來確實該節點是否符合要求。

findNodesThatFit的入參是被調度的pod和當前的節點列表,返回預選節點列表和錯誤。

findNodesThatFit基本流程如下:

  1. 設置可行節點的總數,作爲預選節點數組的容量,避免總節點過多需要篩選的節點過多。
  2. 通過NodeTree不斷獲取下一個節點來判斷該節點是否滿足pod的調度條件。
  3. 通過之前註冊的各種預選函數來判斷當前節點是否符合pod的調度條件。
  4. 最後返回滿足調度條件的node列表,供下一步的優選操作。

findNodesThatFit完整代碼如下:

此部分代碼位於pkg/scheduler/core/generic_scheduler.go

// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
   var filtered []*v1.Node
   failedPredicateMap := FailedPredicateMap{}

   if len(g.predicates) == 0 {
      filtered = nodes
   } else {
      allNodes := int32(g.cache.NodeTree().NumNodes)
      numNodesToFind := g.numFeasibleNodesToFind(allNodes)

      // Create filtered list with enough space to avoid growing it
      // and allow assigning.
      filtered = make([]*v1.Node, numNodesToFind)
      errs := errors.MessageCountMap{}
      var (
         predicateResultLock sync.Mutex
         filteredLen         int32
         equivClass          *equivalence.Class
      )

      ctx, cancel := context.WithCancel(context.Background())

      // We can use the same metadata producer for all nodes.
      meta := g.predicateMetaProducer(pod, g.cachedNodeInfoMap)

      if g.equivalenceCache != nil {
         // getEquivalenceClassInfo will return immediately if no equivalence pod found
         equivClass = equivalence.NewClass(pod)
      }

      checkNode := func(i int) {
         var nodeCache *equivalence.NodeCache
         nodeName := g.cache.NodeTree().Next()
         if g.equivalenceCache != nil {
            nodeCache, _ = g.equivalenceCache.GetNodeCache(nodeName)
         }
         fits, failedPredicates, err := podFitsOnNode(
            pod,
            meta,
            g.cachedNodeInfoMap[nodeName],
            g.predicates,
            g.cache,
            nodeCache,
            g.schedulingQueue,
            g.alwaysCheckAllPredicates,
            equivClass,
         )
         if err != nil {
            predicateResultLock.Lock()
            errs[err.Error()]++
            predicateResultLock.Unlock()
            return
         }
         if fits {
            length := atomic.AddInt32(&filteredLen, 1)
            if length > numNodesToFind {
               cancel()
               atomic.AddInt32(&filteredLen, -1)
            } else {
               filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
            }
         } else {
            predicateResultLock.Lock()
            failedPredicateMap[nodeName] = failedPredicates
            predicateResultLock.Unlock()
         }
      }

      // Stops searching for more nodes once the configured number of feasible nodes
      // are found.
      workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

      filtered = filtered[:filteredLen]
      if len(errs) > 0 {
         return []*v1.Node{}, FailedPredicateMap{}, errors.CreateAggregateFromMessageCountMap(errs)
      }
   }

   if len(filtered) > 0 && len(g.extenders) != 0 {
      for _, extender := range g.extenders {
         if !extender.IsInterested(pod) {
            continue
         }
         filteredList, failedMap, err := extender.Filter(pod, filtered, g.cachedNodeInfoMap)
         if err != nil {
            if extender.IsIgnorable() {
               glog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
                  extender, err)
               continue
            } else {
               return []*v1.Node{}, FailedPredicateMap{}, err
            }
         }

         for failedNodeName, failedMsg := range failedMap {
            if _, found := failedPredicateMap[failedNodeName]; !found {
               failedPredicateMap[failedNodeName] = []algorithm.PredicateFailureReason{}
            }
            failedPredicateMap[failedNodeName] = append(failedPredicateMap[failedNodeName], predicates.NewFailureReason(failedMsg))
         }
         filtered = filteredList
         if len(filtered) == 0 {
            break
         }
      }
   }
   return filtered, failedPredicateMap, nil
}

以下對findNodesThatFit分段分析。

3. numFeasibleNodesToFind

findNodesThatFit先基於所有的節點找出可行的節點是總數。numFeasibleNodesToFind的作用主要是避免當節點過多(超過100)影響調度的效率。

allNodes := int32(g.cache.NodeTree().NumNodes)
numNodesToFind := g.numFeasibleNodesToFind(allNodes)

// Create filtered list with enough space to avoid growing it
// and allow assigning.
filtered = make([]*v1.Node, numNodesToFind)

numFeasibleNodesToFind基本流程如下:

  • 如果所有的node節點小於minFeasibleNodesToFind(當前默認爲100)則返回節點數。
  • 如果節點數超100,則取指定計分的百分比的節點數,當該百分比後的數目仍小於minFeasibleNodesToFind,則返回minFeasibleNodesToFind
  • 如果百分比後的數目大於minFeasibleNodesToFind,則返回該百分比。
// numFeasibleNodesToFind returns the number of feasible nodes that once found, the scheduler stops
// its search for more feasible nodes.
func (g *genericScheduler) numFeasibleNodesToFind(numAllNodes int32) int32 {
	if numAllNodes < minFeasibleNodesToFind || g.percentageOfNodesToScore <= 0 ||
		g.percentageOfNodesToScore >= 100 {
		return numAllNodes
	}
	numNodes := numAllNodes * g.percentageOfNodesToScore / 100
	if numNodes < minFeasibleNodesToFind {
		return minFeasibleNodesToFind
	}
	return numNodes
}

4. checkNode

checkNode是一個校驗node是否符合要求的函數,其中實際調用到的核心函數是podFitsOnNode。再通過workqueue併發執行checkNode操作。

checkNode主要流程如下:

  1. 通過cache中的nodeTree不斷獲取下一個node。
  2. 將當前node和pod傳入podFitsOnNode判斷當前node是否符合要求。
  3. 如果當前node符合要求就將當前node加入預選節點的數組中filtered
  4. 如果當前node不滿足要求,則加入到失敗的數組中,並記錄原因。
  5. 通過workqueue.ParallelizeUntil併發執行checkNode函數,一旦找到配置的可行節點數,就停止搜索更多節點。
checkNode := func(i int) {
	var nodeCache *equivalence.NodeCache
	nodeName := g.cache.NodeTree().Next()
	if g.equivalenceCache != nil {
		nodeCache, _ = g.equivalenceCache.GetNodeCache(nodeName)
	}
	fits, failedPredicates, err := podFitsOnNode(
		pod,
		meta,
		g.cachedNodeInfoMap[nodeName],
		g.predicates,
		g.cache,
		nodeCache,
		g.schedulingQueue,
		g.alwaysCheckAllPredicates,
		equivClass,
	)
	if err != nil {
		predicateResultLock.Lock()
		errs[err.Error()]++
		predicateResultLock.Unlock()
		return
	}
	if fits {
		length := atomic.AddInt32(&filteredLen, 1)
		if length > numNodesToFind {
			cancel()
			atomic.AddInt32(&filteredLen, -1)
		} else {
			filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
		}
	} else {
		predicateResultLock.Lock()
		failedPredicateMap[nodeName] = failedPredicates
		predicateResultLock.Unlock()
	}
}

workqueue的併發操作:

// Stops searching for more nodes once the configured number of feasible nodes
// are found.
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

ParallelizeUntil具體代碼如下:

// ParallelizeUntil is a framework that allows for parallelizing N
// independent pieces of work until done or the context is canceled.
func ParallelizeUntil(ctx context.Context, workers, pieces int, doWorkPiece DoWorkPieceFunc) {
	var stop <-chan struct{}
	if ctx != nil {
		stop = ctx.Done()
	}

	toProcess := make(chan int, pieces)
	for i := 0; i < pieces; i++ {
		toProcess <- i
	}
	close(toProcess)

	if pieces < workers {
		workers = pieces
	}

	wg := sync.WaitGroup{}
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func() {
			defer utilruntime.HandleCrash()
			defer wg.Done()
			for piece := range toProcess {
				select {
				case <-stop:
					return
				default:
					doWorkPiece(piece)
				}
			}
		}()
	}
	wg.Wait()
}

5. podFitsOnNode

podFitsOnNode主要內容如下:

  • podFitsOnNode會檢查給定的某個Node是否滿足預選的函數。

  • 對於給定的pod,podFitsOnNode會檢查是否有相同的pod存在,儘量複用緩存過的預選結果。

podFitsOnNode主要在Schedule(調度)和Preempt(搶佔)的時候被調用。

當在Schedule中被調用的時候,主要判斷是否可以被調度到當前節點,依據爲當前節點上所有已存在的pod及被提名要運行到該節點的具有相等或更高優先級的pod。

當在Preempt中被調用的時候,即發生搶佔的時候,通過SelectVictimsOnNode函數選出需要被移除的pod,移除後然後將預調度的pod調度到該節點上。

podFitsOnNode基本流程如下:

  1. 遍歷之前註冊好的預選策略predicates.Ordering,並獲取預選策略的執行函數。
  2. 遍歷執行每個預選函數,並返回是否合適,預選失敗的原因和錯誤。
  3. 如果預選函數執行的結果不合適,則加入預選失敗的數組中。
  4. 最後返回預選失敗的個數是否爲0,和預選失敗的原因。

入參:

  • pod
  • PredicateMetadata
  • NodeInfo
  • predicateFuncs
  • schedulercache.Cache
  • nodeCache
  • SchedulingQueue
  • alwaysCheckAllPredicates
  • equivClass

出參:

  • fit
  • PredicateFailureReason

完整代碼如下:

此部分代碼位於pkg/scheduler/core/generic_scheduler.go

// podFitsOnNode checks whether a node given by NodeInfo satisfies the given predicate functions.
// For given pod, podFitsOnNode will check if any equivalent pod exists and try to reuse its cached
// predicate results as possible.
// This function is called from two different places: Schedule and Preempt.
// When it is called from Schedule, we want to test whether the pod is schedulable
// on the node with all the existing pods on the node plus higher and equal priority
// pods nominated to run on the node.
// When it is called from Preempt, we should remove the victims of preemption and
// add the nominated pods. Removal of the victims is done by SelectVictimsOnNode().
// It removes victims from meta and NodeInfo before calling this function.
func podFitsOnNode(
	pod *v1.Pod,
	meta algorithm.PredicateMetadata,
	info *schedulercache.NodeInfo,
	predicateFuncs map[string]algorithm.FitPredicate,
	cache schedulercache.Cache,
	nodeCache *equivalence.NodeCache,
	queue SchedulingQueue,
	alwaysCheckAllPredicates bool,
	equivClass *equivalence.Class,
) (bool, []algorithm.PredicateFailureReason, error) {
	var (
		eCacheAvailable  bool
		failedPredicates []algorithm.PredicateFailureReason
	)

	podsAdded := false
	// We run predicates twice in some cases. If the node has greater or equal priority
	// nominated pods, we run them when those pods are added to meta and nodeInfo.
	// If all predicates succeed in this pass, we run them again when these
	// nominated pods are not added. This second pass is necessary because some
	// predicates such as inter-pod affinity may not pass without the nominated pods.
	// If there are no nominated pods for the node or if the first run of the
	// predicates fail, we don't run the second pass.
	// We consider only equal or higher priority pods in the first pass, because
	// those are the current "pod" must yield to them and not take a space opened
	// for running them. It is ok if the current "pod" take resources freed for
	// lower priority pods.
	// Requiring that the new pod is schedulable in both circumstances ensures that
	// we are making a conservative decision: predicates like resources and inter-pod
	// anti-affinity are more likely to fail when the nominated pods are treated
	// as running, while predicates like pod affinity are more likely to fail when
	// the nominated pods are treated as not running. We can't just assume the
	// nominated pods are running because they are not running right now and in fact,
	// they may end up getting scheduled to a different node.
	for i := 0; i < 2; i++ {
		metaToUse := meta
		nodeInfoToUse := info
		if i == 0 {
			podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue)
		} else if !podsAdded || len(failedPredicates) != 0 {
			break
		}
		// Bypass eCache if node has any nominated pods.
		// TODO(bsalamat): consider using eCache and adding proper eCache invalidations
		// when pods are nominated or their nominations change.
		eCacheAvailable = equivClass != nil && nodeCache != nil && !podsAdded
		for _, predicateKey := range predicates.Ordering() {
			var (
				fit     bool
				reasons []algorithm.PredicateFailureReason
				err     error
			)
			//TODO (yastij) : compute average predicate restrictiveness to export it as Prometheus metric
			if predicate, exist := predicateFuncs[predicateKey]; exist {
				if eCacheAvailable {
					fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, pod, metaToUse, nodeInfoToUse, equivClass, cache)
				} else {
					fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
				}
				if err != nil {
					return false, []algorithm.PredicateFailureReason{}, err
				}

				if !fit {
					// eCache is available and valid, and predicates result is unfit, record the fail reasons
					failedPredicates = append(failedPredicates, reasons...)
					// if alwaysCheckAllPredicates is false, short circuit all predicates when one predicate fails.
					if !alwaysCheckAllPredicates {
						glog.V(5).Infoln("since alwaysCheckAllPredicates has not been set, the predicate " +
							"evaluation is short circuited and there are chances " +
							"of other predicates failing as well.")
						break
					}
				}
			}
		}
	}

	return len(failedPredicates) == 0, failedPredicates, nil
}

5.1. predicateFuncs

根據之前初註冊好的預選策略函數來執行預選,判斷節點是否符合調度。

for _, predicateKey := range predicates.Ordering() {
	if predicate, exist := predicateFuncs[predicateKey]; exist {
		if eCacheAvailable {
			fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, pod, metaToUse, nodeInfoToUse, equivClass, cache)
		} else {
			fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
		}

預選策略如下:

var (
	predicatesOrdering = []string{CheckNodeConditionPred, CheckNodeUnschedulablePred,
		GeneralPred, HostNamePred, PodFitsHostPortsPred,
		MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
		PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred, CheckNodeLabelPresencePred,
		CheckServiceAffinityPred, MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred, MaxCSIVolumeCountPred,
		MaxAzureDiskVolumeCountPred, CheckVolumeBindingPred, NoVolumeZoneConflictPred,
		CheckNodeMemoryPressurePred, CheckNodePIDPressurePred, CheckNodeDiskPressurePred, MatchInterPodAffinityPred}
)

6. PodFitsResources

以下以PodFitsResources這個預選函數爲例做分析,其他重要的預選函數待後續單獨分析。

PodFitsResources用來檢查一個節點是否有足夠的資源來運行當前的pod,包括CPU、內存、GPU等。

PodFitsResources基本流程如下:

  1. 判斷當前節點上pod總數加上預調度pod個數是否大於node的可分配pod總數,若是則不允許調度。
  2. 判斷pod的request值是否都爲0,若是則允許調度。
  3. 判斷pod的request值加上當前node上所有pod的request值總和是否大於node的可分配資源,若是則不允許調度。
  4. 判斷pod的拓展資源request值加上當前node上所有pod對應的request值總和是否大於node對應的可分配資源,若是則不允許調度。

PodFitsResources的註冊代碼如下:

factory.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)

PodFitsResources入參:

  • pod

  • nodeInfo

  • PredicateMetadata

PodFitsResources出參:

  • fit
  • PredicateFailureReason

PodFitsResources完整代碼:

此部分的代碼位於pkg/scheduler/algorithm/predicates/predicates.go

// PodFitsResources checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
// First return value indicates whether a node has sufficient resources to run a pod while the second return value indicates the
// predicate failure reasons if the node has insufficient resources to run the pod.
func PodFitsResources(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	node := nodeInfo.Node()
	if node == nil {
		return false, nil, fmt.Errorf("node not found")
	}

	var predicateFails []algorithm.PredicateFailureReason
	allowedPodNumber := nodeInfo.AllowedPodNumber()
	if len(nodeInfo.Pods())+1 > allowedPodNumber {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))
	}

	// No extended resources should be ignored by default.
	ignoredExtendedResources := sets.NewString()

	var podRequest *schedulercache.Resource
	if predicateMeta, ok := meta.(*predicateMetadata); ok {
		podRequest = predicateMeta.podRequest
		if predicateMeta.ignoredExtendedResources != nil {
			ignoredExtendedResources = predicateMeta.ignoredExtendedResources
		}
	} else {
		// We couldn't parse metadata - fallback to computing it.
		podRequest = GetResourceRequest(pod)
	}
	if podRequest.MilliCPU == 0 &&
		podRequest.Memory == 0 &&
		podRequest.EphemeralStorage == 0 &&
		len(podRequest.ScalarResources) == 0 {
		return len(predicateFails) == 0, predicateFails, nil
	}

	allocatable := nodeInfo.AllocatableResource()
	if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
	}
	if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
	}
	if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))
	}

	for rName, rQuant := range podRequest.ScalarResources {
		if v1helper.IsExtendedResourceName(rName) {
			// If this resource is one of the extended resources that should be
			// ignored, we will skip checking it.
			if ignoredExtendedResources.Has(string(rName)) {
				continue
			}
		}
		if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {
			predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))
		}
	}

	if glog.V(10) {
		if len(predicateFails) == 0 {
			// We explicitly don't do glog.V(10).Infof() to avoid computing all the parameters if this is
			// not logged. There is visible performance gain from it.
			glog.Infof("Schedule Pod %+v on Node %+v is allowed, Node is running only %v out of %v Pods.",
				podName(pod), node.Name, len(nodeInfo.Pods()), allowedPodNumber)
		}
	}
	return len(predicateFails) == 0, predicateFails, nil
}

6.1. NodeInfo

NodeInfo是node的聚合信息,主要包括:

  • node:k8s node的結構體
  • pods:當前node上pod的數量
  • requestedResource:當前node上所有pod的request總和
  • allocatableResource:node的實際所有的可分配資源(對應於Node.Status.Allocatable.*),可理解爲node的資源總量。

此部分代碼位於pkg/scheduler/cache/node_info.go

// NodeInfo is node level aggregated information.
type NodeInfo struct {
	// Overall node information.
	node *v1.Node

	pods             []*v1.Pod
	podsWithAffinity []*v1.Pod
	usedPorts        util.HostPortInfo

	// Total requested resource of all pods on this node.
	// It includes assumed pods which scheduler sends binding to apiserver but
	// didn't get it as scheduled yet.
	requestedResource *Resource
	nonzeroRequest    *Resource
	// We store allocatedResources (which is Node.Status.Allocatable.*) explicitly
	// as int64, to avoid conversions and accessing map.
	allocatableResource *Resource

	// Cached taints of the node for faster lookup.
	taints    []v1.Taint
	taintsErr error

	// imageStates holds the entry of an image if and only if this image is on the node. The entry can be used for
	// checking an image's existence and advanced usage (e.g., image locality scheduling policy) based on the image
	// state information.
	imageStates map[string]*ImageStateSummary

	// TransientInfo holds the information pertaining to a scheduling cycle. This will be destructed at the end of
	// scheduling cycle.
	// TODO: @ravig. Remove this once we have a clear approach for message passing across predicates and priorities.
	TransientInfo *transientSchedulerInfo

	// Cached conditions of node for faster lookup.
	memoryPressureCondition v1.ConditionStatus
	diskPressureCondition   v1.ConditionStatus
	pidPressureCondition    v1.ConditionStatus

	// Whenever NodeInfo changes, generation is bumped.
	// This is used to avoid cloning it if the object didn't change.
	generation int64
}

6.2. Resource

Resource是可計算資源的集合體。主要包括:

  • MilliCPU
  • Memory
  • EphemeralStorage
  • AllowedPodNumber:允許的pod總數(對應於Node.Status.Allocatable.Pods().Value()),一般爲110。
  • ScalarResources
// Resource is a collection of compute resource.
type Resource struct {
	MilliCPU         int64
	Memory           int64
	EphemeralStorage int64
	// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
	// explicitly as int, to avoid conversions and improve performance.
	AllowedPodNumber int
	// ScalarResources
	ScalarResources map[v1.ResourceName]int64
}

以下分析podFitsOnNode的具體流程。

6.3. allowedPodNumber

首先獲取節點的信息,先判斷如果該節點當前所有的pod的個數加上當前預調度的pod是否會大於該節點允許的pod的總數,一般爲110個。如果超過,則predicateFails數組增加1,即當前節點不適合該pod。

node := nodeInfo.Node()
if node == nil {
	return false, nil, fmt.Errorf("node not found")
}

var predicateFails []algorithm.PredicateFailureReason
allowedPodNumber := nodeInfo.AllowedPodNumber()
if len(nodeInfo.Pods())+1 > allowedPodNumber {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))
	}

6.4. podRequest

如果podRequest都爲0,則允許調度到該節點,直接返回結果。

if podRequest.MilliCPU == 0 &&
	podRequest.Memory == 0 &&
	podRequest.EphemeralStorage == 0 &&
	len(podRequest.ScalarResources) == 0 {
	return len(predicateFails) == 0, predicateFails, nil
}

6.5. AllocatableResource

如果當前預調度的pod的request資源加上當前node上所有pod的request總和大於該node的可分配資源總量,則不允許調度到該節點,直接返回結果。其中request資源包括CPU、內存、storage。

allocatable := nodeInfo.AllocatableResource()
if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
}
if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
}
if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))
	}

6.6. ScalarResources

判斷其他拓展的標量資源,是否該pod的request值加上當前node上所有pod的對應資源的request總和大於該node上對應資源的可分配總量,如果是,則不允許調度到該節點。

for rName, rQuant := range podRequest.ScalarResources {
	if v1helper.IsExtendedResourceName(rName) {
		// If this resource is one of the extended resources that should be
		// ignored, we will skip checking it.
		if ignoredExtendedResources.Has(string(rName)) {
			continue
		}
	}
	if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {
		predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))
	}
}

7. 總結

findNodesThatFit基於給定的預選函數過濾node,每個node傳入到預選函數中來確實該節點是否符合要求。

findNodesThatFit的入參是被調度的pod和當前的節點列表,返回預選節點列表和錯誤。

findNodesThatFit基本流程如下:

  1. 設置可行節點的總數,作爲預選節點數組的容量,避免總節點過多導致需要篩選的節點過多,效率低。
  2. 通過NodeTree不斷獲取下一個節點來判斷該節點是否滿足pod的調度條件。
  3. 通過之前註冊的各種預選函數來判斷當前節點是否符合pod的調度條件。
  4. 最後返回滿足調度條件的node列表,供下一步的優選操作。

7.1. checkNode

checkNode是一個校驗node是否符合要求的函數,其中實際調用到的核心函數是podFitsOnNode。再通過workqueue併發執行checkNode操作。

checkNode主要流程如下:

  1. 通過cache中的nodeTree不斷獲取下一個node。
  2. 將當前node和pod傳入podFitsOnNode判斷當前node是否符合要求。
  3. 如果當前node符合要求就將當前node加入預選節點的數組中filtered
  4. 如果當前node不滿足要求,則加入到失敗的數組中,並記錄原因。
  5. 通過workqueue.ParallelizeUntil併發執行checkNode函數,一旦找到配置的可行節點數,就停止搜索更多節點。

7.2. podFitsOnNode

其中會調用到核心函數podFitsOnNode。

podFitsOnNode主要內容如下:

  • podFitsOnNode會檢查給定的某個Node是否滿足預選的函數。

  • 對於給定的pod,podFitsOnNode會檢查是否有相同的pod存在,儘量複用緩存過的預選結果。

podFitsOnNode主要在Schedule(調度)和Preempt(搶佔)的時候被調用。

當在Schedule中被調用的時候,主要判斷是否可以被調度到當前節點,依據爲當前節點上所有已存在的pod及被提名要運行到該節點的具有相等或更高優先級的pod。

當在Preempt中被調用的時候,即發生搶佔的時候,通過SelectVictimsOnNode函數選出需要被移除的pod,移除後然後將預調度的pod調度到該節點上。

podFitsOnNode基本流程如下:

  1. 遍歷之前註冊好的預選策略predicates.Ordering,並獲取預選策略的執行函數。
  2. 遍歷執行每個預選函數,並返回是否合適,預選失敗的原因和錯誤。
  3. 如果預選函數執行的結果不合適,則加入預選失敗的數組中。
  4. 最後返回預選失敗的個數是否爲0,和預選失敗的原因。

7.3. PodFitsResources

本文只示例分析了其中一個重要的預選函數:PodFitsResources

PodFitsResources用來檢查一個節點是否有足夠的資源來運行當前的pod,包括CPU、內存、GPU等。

PodFitsResources基本流程如下:

  1. 判斷當前節點上pod總數加上預調度pod個數是否大於node的可分配pod總數,若是則不允許調度。
  2. 判斷pod的request值是否都爲0,若是則允許調度。
  3. 判斷pod的request值加上當前node上所有pod的request值總和是否大於node的可分配資源,若是則不允許調度。
  4. 判斷pod的拓展資源request值加上當前node上所有pod對應的request值總和是否大於node對應的可分配資源,若是則不允許調度。

參考:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章