本文個人博客地址：https://www.huweihuang.com/kubernetes-notes/code-analysis/kube-scheduler/findNodesThatFit.html

kube-scheduler源碼分析（四）之 findNodesThatFit

以下代碼分析基於 kubernetes v1.12.0 版本。

本文主要分析調度邏輯中的預選策略，即第一步篩選出符合pod調度條件的節點。

1. 調用入口

預選，通過預選函數來判斷每個節點是否適合被該Pod調度。

genericScheduler.Schedule中對findNodesThatFit的調用過程如下：

此部分代碼位於pkg/scheduler/core/generic_scheduler.go

func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
	...
  // 列出所有的節點
	nodes, err := nodeLister.List()
	if err != nil {
		return "", err
	}
	if len(nodes) == 0 {
		return "", ErrNoNodesAvailable
	}

	// Used for all fit and priority funcs.
	err = g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
	if err != nil {
		return "", err
	}

	trace.Step("Computing predicates")
	startPredicateEvalTime := time.Now()
  // 調用findNodesThatFit過濾出預選節點
	filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
	if err != nil {
		return "", err
	}

	if len(filteredNodes) == 0 {
		return "", &FitError{
			Pod:              pod,
			NumAllNodes:      len(nodes),
			FailedPredicates: failedPredicateMap,
		}
	}
// metrics
  metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
			  metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))
	...
}

核心代碼：

// 調用findNodesThatFit過濾出預選節點
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)

2. findNodesThatFit

findNodesThatFit基於給定的預選函數過濾node，每個node傳入到預選函數中來確實該節點是否符合要求。

findNodesThatFit的入參是被調度的pod和當前的節點列表，返回預選節點列表和錯誤。

findNodesThatFit基本流程如下：

設置可行節點的總數，作爲預選節點數組的容量，避免總節點過多需要篩選的節點過多。
通過NodeTree不斷獲取下一個節點來判斷該節點是否滿足pod的調度條件。
通過之前註冊的各種預選函數來判斷當前節點是否符合pod的調度條件。
最後返回滿足調度條件的node列表，供下一步的優選操作。

findNodesThatFit完整代碼如下：

此部分代碼位於pkg/scheduler/core/generic_scheduler.go

// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
   var filtered []*v1.Node
   failedPredicateMap := FailedPredicateMap{}

   if len(g.predicates) == 0 {
      filtered = nodes
   } else {
      allNodes := int32(g.cache.NodeTree().NumNodes)
      numNodesToFind := g.numFeasibleNodesToFind(allNodes)

      // Create filtered list with enough space to avoid growing it
      // and allow assigning.
      filtered = make([]*v1.Node, numNodesToFind)
      errs := errors.MessageCountMap{}
      var (
         predicateResultLock sync.Mutex
         filteredLen         int32
         equivClass          *equivalence.Class
      )

      ctx, cancel := context.WithCancel(context.Background())

      // We can use the same metadata producer for all nodes.
      meta := g.predicateMetaProducer(pod, g.cachedNodeInfoMap)

      if g.equivalenceCache != nil {
         // getEquivalenceClassInfo will return immediately if no equivalence pod found
         equivClass = equivalence.NewClass(pod)
      }

      checkNode := func(i int) {
         var nodeCache *equivalence.NodeCache
         nodeName := g.cache.NodeTree().Next()
         if g.equivalenceCache != nil {
            nodeCache, _ = g.equivalenceCache.GetNodeCache(nodeName)
         }
         fits, failedPredicates, err := podFitsOnNode(
            pod,
            meta,
            g.cachedNodeInfoMap[nodeName],
            g.predicates,
            g.cache,
            nodeCache,
            g.schedulingQueue,
            g.alwaysCheckAllPredicates,
            equivClass,
         )
         if err != nil {
            predicateResultLock.Lock()
            errs[err.Error()]++
            predicateResultLock.Unlock()
            return
         }
         if fits {
            length := atomic.AddInt32(&filteredLen, 1)
            if length > numNodesToFind {
               cancel()
               atomic.AddInt32(&filteredLen, -1)
            } else {
               filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
            }
         } else {
            predicateResultLock.Lock()
            failedPredicateMap[nodeName] = failedPredicates
            predicateResultLock.Unlock()
         }
      }

      // Stops searching for more nodes once the configured number of feasible nodes
      // are found.
      workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

      filtered = filtered[:filteredLen]
      if len(errs) > 0 {
         return []*v1.Node{}, FailedPredicateMap{}, errors.CreateAggregateFromMessageCountMap(errs)
      }
   }

   if len(filtered) > 0 && len(g.extenders) != 0 {
      for _, extender := range g.extenders {
         if !extender.IsInterested(pod) {
            continue
         }
         filteredList, failedMap, err := extender.Filter(pod, filtered, g.cachedNodeInfoMap)
         if err != nil {
            if extender.IsIgnorable() {
               glog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
                  extender, err)
               continue
            } else {
               return []*v1.Node{}, FailedPredicateMap{}, err
            }
         }

         for failedNodeName, failedMsg := range failedMap {
            if _, found := failedPredicateMap[failedNodeName]; !found {
               failedPredicateMap[failedNodeName] = []algorithm.PredicateFailureReason{}
            }
            failedPredicateMap[failedNodeName] = append(failedPredicateMap[failedNodeName], predicates.NewFailureReason(failedMsg))
         }
         filtered = filteredList
         if len(filtered) == 0 {
            break
         }
      }
   }
   return filtered, failedPredicateMap, nil
}

以下對findNodesThatFit分段分析。

3. numFeasibleNodesToFind

findNodesThatFit先基於所有的節點找出可行的節點是總數。numFeasibleNodesToFind的作用主要是避免當節點過多（超過100）影響調度的效率。

allNodes := int32(g.cache.NodeTree().NumNodes)
numNodesToFind := g.numFeasibleNodesToFind(allNodes)

// Create filtered list with enough space to avoid growing it
// and allow assigning.
filtered = make([]*v1.Node, numNodesToFind)

numFeasibleNodesToFind基本流程如下：

如果所有的node節點小於minFeasibleNodesToFind(當前默認爲100)則返回節點數。
如果節點數超100，則取指定計分的百分比的節點數，當該百分比後的數目仍小於minFeasibleNodesToFind，則返回minFeasibleNodesToFind。
如果百分比後的數目大於minFeasibleNodesToFind，則返回該百分比。

// numFeasibleNodesToFind returns the number of feasible nodes that once found, the scheduler stops
// its search for more feasible nodes.
func (g *genericScheduler) numFeasibleNodesToFind(numAllNodes int32) int32 {
	if numAllNodes < minFeasibleNodesToFind || g.percentageOfNodesToScore <= 0 ||
		g.percentageOfNodesToScore >= 100 {
		return numAllNodes
	}
	numNodes := numAllNodes * g.percentageOfNodesToScore / 100
	if numNodes < minFeasibleNodesToFind {
		return minFeasibleNodesToFind
	}
	return numNodes
}

4. checkNode

checkNode是一個校驗node是否符合要求的函數，其中實際調用到的核心函數是podFitsOnNode。再通過workqueue併發執行checkNode操作。

checkNode主要流程如下：

通過cache中的nodeTree不斷獲取下一個node。
將當前node和pod傳入podFitsOnNode判斷當前node是否符合要求。
如果當前node符合要求就將當前node加入預選節點的數組中filtered。
如果當前node不滿足要求，則加入到失敗的數組中，並記錄原因。
通過workqueue.ParallelizeUntil併發執行checkNode函數，一旦找到配置的可行節點數，就停止搜索更多節點。

checkNode := func(i int) {
	var nodeCache *equivalence.NodeCache
	nodeName := g.cache.NodeTree().Next()
	if g.equivalenceCache != nil {
		nodeCache, _ = g.equivalenceCache.GetNodeCache(nodeName)
	}
	fits, failedPredicates, err := podFitsOnNode(
		pod,
		meta,
		g.cachedNodeInfoMap[nodeName],
		g.predicates,
		g.cache,
		nodeCache,
		g.schedulingQueue,
		g.alwaysCheckAllPredicates,
		equivClass,
	)
	if err != nil {
		predicateResultLock.Lock()
		errs[err.Error()]++
		predicateResultLock.Unlock()
		return
	}
	if fits {
		length := atomic.AddInt32(&filteredLen, 1)
		if length > numNodesToFind {
			cancel()
			atomic.AddInt32(&filteredLen, -1)
		} else {
			filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
		}
	} else {
		predicateResultLock.Lock()
		failedPredicateMap[nodeName] = failedPredicates
		predicateResultLock.Unlock()
	}
}

workqueue的併發操作：

// Stops searching for more nodes once the configured number of feasible nodes
// are found.
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

ParallelizeUntil具體代碼如下：

// ParallelizeUntil is a framework that allows for parallelizing N
// independent pieces of work until done or the context is canceled.
func ParallelizeUntil(ctx context.Context, workers, pieces int, doWorkPiece DoWorkPieceFunc) {
	var stop <-chan struct{}
	if ctx != nil {
		stop = ctx.Done()
	}

	toProcess := make(chan int, pieces)
	for i := 0; i < pieces; i++ {
		toProcess <- i
	}
	close(toProcess)

	if pieces < workers {
		workers = pieces
	}

	wg := sync.WaitGroup{}
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func() {
			defer utilruntime.HandleCrash()
			defer wg.Done()
			for piece := range toProcess {
				select {
				case <-stop:
					return
				default:
					doWorkPiece(piece)
				}
			}
		}()
	}
	wg.Wait()
}

5. podFitsOnNode

podFitsOnNode主要內容如下：

podFitsOnNode會檢查給定的某個Node是否滿足預選的函數。
對於給定的pod，podFitsOnNode會檢查是否有相同的pod存在，儘量複用緩存過的預選結果。

podFitsOnNode主要在Schedule（調度）和Preempt（搶佔）的時候被調用。

當在Schedule中被調用的時候，主要判斷是否可以被調度到當前節點，依據爲當前節點上所有已存在的pod及被提名要運行到該節點的具有相等或更高優先級的pod。

當在Preempt中被調用的時候，即發生搶佔的時候，通過SelectVictimsOnNode函數選出需要被移除的pod，移除後然後將預調度的pod調度到該節點上。

podFitsOnNode基本流程如下：

遍歷之前註冊好的預選策略predicates.Ordering，並獲取預選策略的執行函數。
遍歷執行每個預選函數，並返回是否合適，預選失敗的原因和錯誤。
如果預選函數執行的結果不合適，則加入預選失敗的數組中。
最後返回預選失敗的個數是否爲0，和預選失敗的原因。

入參：

pod
PredicateMetadata
NodeInfo
predicateFuncs
schedulercache.Cache
nodeCache
SchedulingQueue
alwaysCheckAllPredicates
equivClass

出參：

fit
PredicateFailureReason

完整代碼如下：

此部分代碼位於pkg/scheduler/core/generic_scheduler.go

// podFitsOnNode checks whether a node given by NodeInfo satisfies the given predicate functions.
// For given pod, podFitsOnNode will check if any equivalent pod exists and try to reuse its cached
// predicate results as possible.
// This function is called from two different places: Schedule and Preempt.
// When it is called from Schedule, we want to test whether the pod is schedulable
// on the node with all the existing pods on the node plus higher and equal priority
// pods nominated to run on the node.
// When it is called from Preempt, we should remove the victims of preemption and
// add the nominated pods. Removal of the victims is done by SelectVictimsOnNode().
// It removes victims from meta and NodeInfo before calling this function.
func podFitsOnNode(
	pod *v1.Pod,
	meta algorithm.PredicateMetadata,
	info *schedulercache.NodeInfo,
	predicateFuncs map[string]algorithm.FitPredicate,
	cache schedulercache.Cache,
	nodeCache *equivalence.NodeCache,
	queue SchedulingQueue,
	alwaysCheckAllPredicates bool,
	equivClass *equivalence.Class,
) (bool, []algorithm.PredicateFailureReason, error) {
	var (
		eCacheAvailable  bool
		failedPredicates []algorithm.PredicateFailureReason
	)

	podsAdded := false
	// We run predicates twice in some cases. If the node has greater or equal priority
	// nominated pods, we run them when those pods are added to meta and nodeInfo.
	// If all predicates succeed in this pass, we run them again when these
	// nominated pods are not added. This second pass is necessary because some
	// predicates such as inter-pod affinity may not pass without the nominated pods.
	// If there are no nominated pods for the node or if the first run of the
	// predicates fail, we don't run the second pass.
	// We consider only equal or higher priority pods in the first pass, because
	// those are the current "pod" must yield to them and not take a space opened
	// for running them. It is ok if the current "pod" take resources freed for
	// lower priority pods.
	// Requiring that the new pod is schedulable in both circumstances ensures that
	// we are making a conservative decision: predicates like resources and inter-pod
	// anti-affinity are more likely to fail when the nominated pods are treated
	// as running, while predicates like pod affinity are more likely to fail when
	// the nominated pods are treated as not running. We can't just assume the
	// nominated pods are running because they are not running right now and in fact,
	// they may end up getting scheduled to a different node.
	for i := 0; i < 2; i++ {
		metaToUse := meta
		nodeInfoToUse := info
		if i == 0 {
			podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue)
		} else if !podsAdded || len(failedPredicates) != 0 {
			break
		}
		// Bypass eCache if node has any nominated pods.
		// TODO(bsalamat): consider using eCache and adding proper eCache invalidations
		// when pods are nominated or their nominations change.
		eCacheAvailable = equivClass != nil && nodeCache != nil && !podsAdded
		for _, predicateKey := range predicates.Ordering() {
			var (
				fit     bool
				reasons []algorithm.PredicateFailureReason
				err     error
			)
			//TODO (yastij) : compute average predicate restrictiveness to export it as Prometheus metric
			if predicate, exist := predicateFuncs[predicateKey]; exist {
				if eCacheAvailable {
					fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, pod, metaToUse, nodeInfoToUse, equivClass, cache)
				} else {
					fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
				}
				if err != nil {
					return false, []algorithm.PredicateFailureReason{}, err
				}

				if !fit {
					// eCache is available and valid, and predicates result is unfit, record the fail reasons
					failedPredicates = append(failedPredicates, reasons...)
					// if alwaysCheckAllPredicates is false, short circuit all predicates when one predicate fails.
					if !alwaysCheckAllPredicates {
						glog.V(5).Infoln("since alwaysCheckAllPredicates has not been set, the predicate " +
							"evaluation is short circuited and there are chances " +
							"of other predicates failing as well.")
						break
					}
				}
			}
		}
	}

	return len(failedPredicates) == 0, failedPredicates, nil
}

5.1. predicateFuncs

根據之前初註冊好的預選策略函數來執行預選，判斷節點是否符合調度。

for _, predicateKey := range predicates.Ordering() {
	if predicate, exist := predicateFuncs[predicateKey]; exist {
		if eCacheAvailable {
			fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, pod, metaToUse, nodeInfoToUse, equivClass, cache)
		} else {
			fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
		}

預選策略如下：

var (
	predicatesOrdering = []string{CheckNodeConditionPred, CheckNodeUnschedulablePred,
		GeneralPred, HostNamePred, PodFitsHostPortsPred,
		MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
		PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred, CheckNodeLabelPresencePred,
		CheckServiceAffinityPred, MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred, MaxCSIVolumeCountPred,
		MaxAzureDiskVolumeCountPred, CheckVolumeBindingPred, NoVolumeZoneConflictPred,
		CheckNodeMemoryPressurePred, CheckNodePIDPressurePred, CheckNodeDiskPressurePred, MatchInterPodAffinityPred}
)

6. PodFitsResources

以下以PodFitsResources這個預選函數爲例做分析，其他重要的預選函數待後續單獨分析。

PodFitsResources用來檢查一個節點是否有足夠的資源來運行當前的pod，包括CPU、內存、GPU等。

PodFitsResources基本流程如下：

判斷當前節點上pod總數加上預調度pod個數是否大於node的可分配pod總數，若是則不允許調度。
判斷pod的request值是否都爲0，若是則允許調度。
判斷pod的request值加上當前node上所有pod的request值總和是否大於node的可分配資源，若是則不允許調度。
判斷pod的拓展資源request值加上當前node上所有pod對應的request值總和是否大於node對應的可分配資源，若是則不允許調度。

PodFitsResources的註冊代碼如下：

factory.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)

PodFitsResources入參：

pod
nodeInfo
PredicateMetadata

PodFitsResources出參：

fit
PredicateFailureReason

PodFitsResources完整代碼：

此部分的代碼位於pkg/scheduler/algorithm/predicates/predicates.go

// PodFitsResources checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
// First return value indicates whether a node has sufficient resources to run a pod while the second return value indicates the
// predicate failure reasons if the node has insufficient resources to run the pod.
func PodFitsResources(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	node := nodeInfo.Node()
	if node == nil {
		return false, nil, fmt.Errorf("node not found")
	}

	var predicateFails []algorithm.PredicateFailureReason
	allowedPodNumber := nodeInfo.AllowedPodNumber()
	if len(nodeInfo.Pods())+1 > allowedPodNumber {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))
	}

	// No extended resources should be ignored by default.
	ignoredExtendedResources := sets.NewString()

	var podRequest *schedulercache.Resource
	if predicateMeta, ok := meta.(*predicateMetadata); ok {
		podRequest = predicateMeta.podRequest
		if predicateMeta.ignoredExtendedResources != nil {
			ignoredExtendedResources = predicateMeta.ignoredExtendedResources
		}
	} else {
		// We couldn't parse metadata - fallback to computing it.
		podRequest = GetResourceRequest(pod)
	}
	if podRequest.MilliCPU == 0 &&
		podRequest.Memory == 0 &&
		podRequest.EphemeralStorage == 0 &&
		len(podRequest.ScalarResources) == 0 {
		return len(predicateFails) == 0, predicateFails, nil
	}

	allocatable := nodeInfo.AllocatableResource()
	if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
	}
	if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
	}
	if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))
	}

	for rName, rQuant := range podRequest.ScalarResources {
		if v1helper.IsExtendedResourceName(rName) {
			// If this resource is one of the extended resources that should be
			// ignored, we will skip checking it.
			if ignoredExtendedResources.Has(string(rName)) {
				continue
			}
		}
		if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {
			predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))
		}
	}

	if glog.V(10) {
		if len(predicateFails) == 0 {
			// We explicitly don't do glog.V(10).Infof() to avoid computing all the parameters if this is
			// not logged. There is visible performance gain from it.
			glog.Infof("Schedule Pod %+v on Node %+v is allowed, Node is running only %v out of %v Pods.",
				podName(pod), node.Name, len(nodeInfo.Pods()), allowedPodNumber)
		}
	}
	return len(predicateFails) == 0, predicateFails, nil
}

6.1. NodeInfo

NodeInfo是node的聚合信息，主要包括：

node：k8s node的結構體
pods：當前node上pod的數量
requestedResource：當前node上所有pod的request總和
allocatableResource：node的實際所有的可分配資源(對應於Node.Status.Allocatable.*)，可理解爲node的資源總量。

此部分代碼位於pkg/scheduler/cache/node_info.go

// NodeInfo is node level aggregated information.
type NodeInfo struct {
	// Overall node information.
	node *v1.Node

	pods             []*v1.Pod
	podsWithAffinity []*v1.Pod
	usedPorts        util.HostPortInfo

	// Total requested resource of all pods on this node.
	// It includes assumed pods which scheduler sends binding to apiserver but
	// didn't get it as scheduled yet.
	requestedResource *Resource
	nonzeroRequest    *Resource
	// We store allocatedResources (which is Node.Status.Allocatable.*) explicitly
	// as int64, to avoid conversions and accessing map.
	allocatableResource *Resource

	// Cached taints of the node for faster lookup.
	taints    []v1.Taint
	taintsErr error

	// imageStates holds the entry of an image if and only if this image is on the node. The entry can be used for
	// checking an image's existence and advanced usage (e.g., image locality scheduling policy) based on the image
	// state information.
	imageStates map[string]*ImageStateSummary

	// TransientInfo holds the information pertaining to a scheduling cycle. This will be destructed at the end of
	// scheduling cycle.
	// TODO: @ravig. Remove this once we have a clear approach for message passing across predicates and priorities.
	TransientInfo *transientSchedulerInfo

	// Cached conditions of node for faster lookup.
	memoryPressureCondition v1.ConditionStatus
	diskPressureCondition   v1.ConditionStatus
	pidPressureCondition    v1.ConditionStatus

	// Whenever NodeInfo changes, generation is bumped.
	// This is used to avoid cloning it if the object didn't change.
	generation int64
}

6.2. Resource

Resource是可計算資源的集合體。主要包括：

MilliCPU
Memory
EphemeralStorage
AllowedPodNumber：允許的pod總數(對應於Node.Status.Allocatable.Pods().Value())，一般爲110。
ScalarResources

// Resource is a collection of compute resource.
type Resource struct {
	MilliCPU         int64
	Memory           int64
	EphemeralStorage int64
	// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
	// explicitly as int, to avoid conversions and improve performance.
	AllowedPodNumber int
	// ScalarResources
	ScalarResources map[v1.ResourceName]int64
}

以下分析podFitsOnNode的具體流程。

6.3. allowedPodNumber

首先獲取節點的信息，先判斷如果該節點當前所有的pod的個數加上當前預調度的pod是否會大於該節點允許的pod的總數，一般爲110個。如果超過，則predicateFails數組增加1，即當前節點不適合該pod。

node := nodeInfo.Node()
if node == nil {
	return false, nil, fmt.Errorf("node not found")
}

var predicateFails []algorithm.PredicateFailureReason
allowedPodNumber := nodeInfo.AllowedPodNumber()
if len(nodeInfo.Pods())+1 > allowedPodNumber {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))
	}

6.4. podRequest

如果podRequest都爲0，則允許調度到該節點，直接返回結果。

if podRequest.MilliCPU == 0 &&
	podRequest.Memory == 0 &&
	podRequest.EphemeralStorage == 0 &&
	len(podRequest.ScalarResources) == 0 {
	return len(predicateFails) == 0, predicateFails, nil
}

6.5. AllocatableResource

如果當前預調度的pod的request資源加上當前node上所有pod的request總和大於該node的可分配資源總量，則不允許調度到該節點，直接返回結果。其中request資源包括CPU、內存、storage。

allocatable := nodeInfo.AllocatableResource()
if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
}
if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
}
if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))
	}

6.6. ScalarResources

判斷其他拓展的標量資源，是否該pod的request值加上當前node上所有pod的對應資源的request總和大於該node上對應資源的可分配總量，如果是，則不允許調度到該節點。

for rName, rQuant := range podRequest.ScalarResources {
	if v1helper.IsExtendedResourceName(rName) {
		// If this resource is one of the extended resources that should be
		// ignored, we will skip checking it.
		if ignoredExtendedResources.Has(string(rName)) {
			continue
		}
	}
	if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {
		predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))
	}
}

7. 總結

findNodesThatFit基於給定的預選函數過濾node，每個node傳入到預選函數中來確實該節點是否符合要求。

findNodesThatFit的入參是被調度的pod和當前的節點列表，返回預選節點列表和錯誤。

findNodesThatFit基本流程如下：

設置可行節點的總數，作爲預選節點數組的容量，避免總節點過多導致需要篩選的節點過多，效率低。
通過NodeTree不斷獲取下一個節點來判斷該節點是否滿足pod的調度條件。
通過之前註冊的各種預選函數來判斷當前節點是否符合pod的調度條件。
最後返回滿足調度條件的node列表，供下一步的優選操作。

7.1. checkNode

checkNode是一個校驗node是否符合要求的函數，其中實際調用到的核心函數是podFitsOnNode。再通過workqueue併發執行checkNode操作。

checkNode主要流程如下：

通過cache中的nodeTree不斷獲取下一個node。
將當前node和pod傳入podFitsOnNode判斷當前node是否符合要求。
如果當前node符合要求就將當前node加入預選節點的數組中filtered。
如果當前node不滿足要求，則加入到失敗的數組中，並記錄原因。
通過workqueue.ParallelizeUntil併發執行checkNode函數，一旦找到配置的可行節點數，就停止搜索更多節點。

7.2. podFitsOnNode

其中會調用到核心函數podFitsOnNode。

podFitsOnNode主要內容如下：

podFitsOnNode會檢查給定的某個Node是否滿足預選的函數。
對於給定的pod，podFitsOnNode會檢查是否有相同的pod存在，儘量複用緩存過的預選結果。

podFitsOnNode主要在Schedule（調度）和Preempt（搶佔）的時候被調用。

當在Preempt中被調用的時候，即發生搶佔的時候，通過SelectVictimsOnNode函數選出需要被移除的pod，移除後然後將預調度的pod調度到該節點上。

podFitsOnNode基本流程如下：

遍歷之前註冊好的預選策略predicates.Ordering，並獲取預選策略的執行函數。
遍歷執行每個預選函數，並返回是否合適，預選失敗的原因和錯誤。
如果預選函數執行的結果不合適，則加入預選失敗的數組中。
最後返回預選失敗的個數是否爲0，和預選失敗的原因。

7.3. PodFitsResources

本文只示例分析了其中一個重要的預選函數：PodFitsResources

PodFitsResources用來檢查一個節點是否有足夠的資源來運行當前的pod，包括CPU、內存、GPU等。

PodFitsResources基本流程如下：

判斷當前節點上pod總數加上預調度pod個數是否大於node的可分配pod總數，若是則不允許調度。
判斷pod的request值是否都爲0，若是則允許調度。
判斷pod的request值加上當前node上所有pod的request值總和是否大於node的可分配資源，若是則不允許調度。
判斷pod的拓展資源request值加上當前node上所有pod對應的request值總和是否大於node對應的可分配資源，若是則不允許調度。

參考：

kube-scheduler源碼分析（四）之 findNodesThatFit

kube-scheduler源碼分析（四）之 findNodesThatFit

1. 調用入口

2. findNodesThatFit

3. numFeasibleNodesToFind

4. checkNode

5. podFitsOnNode

5.1. predicateFuncs

6. PodFitsResources

6.1. NodeInfo

6.2. Resource

6.3. allowedPodNumber

6.4. podRequest

6.5. AllocatableResource

6.6. ScalarResources

7. 總結

7.1. checkNode

7.2. podFitsOnNode

7.3. PodFitsResources

kube-scheduler源碼分析（四）之 findNodesThatFit

kube-controller-manager源碼分析（二）之 DeploymentController

kube-scheduler源碼分析（三）之 scheduleOne

kube-scheduler源碼分析（一）之 NewSchedulerCommand

kubelet源碼分析（五）之 syncPod

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結