1. Scheduler extender

有三種方式爲 kubernetes 添加新的調度規則，包括 predicates 和 priority 功能，本文講解第三種方式

第一種，直接在 kubernetes 添加調度規則，重新編譯

第二種，實現自己的調度，替換 k8s 的 scheduler

第三種，實現 scheduler extender，提供擴展 k8s 調度的一個能力

1.1 http body 結構 ExtenderArgs

設置的 scheduler 會向 extender 服務發起 http 請求，其中將 ExtenderArgs 序列化作爲 body，主要是需要調度的 pod 的信息，以及調度的節點列表。

// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
// nodes for a pod.
type ExtenderArgs struct {
	// Pod being scheduled
	Pod *v1.Pod
	// List of candidate nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == false
	Nodes *v1.NodeList
	// List of candidate node names where the pod can be scheduled; to be
	// populated only if ExtenderConfig.NodeCacheCapable == true
	NodeNames *[]string
}

1.1.1 讀取 http body

如果使用 github.com/emicklei/go-restful 包比較簡單，直接使用 ReadEntity 讀取到 ExtenderArgs 變量即可

func predicates(r *restful.Request, w *restful.Response) {
	var extenderArgs schedulerapi.ExtenderArgs

	if err := r.ReadEntity(&extenderArgs); err != nil {
		logrus.Errorf("predicate read entity error: %v", err)
		w.WriteErrorString(http.StatusInternalServerError, err.Error())
		return
	}

1.1.2 定義自己的 filter 規則

輪詢所有 node 節點，使用自定義的過略規則，比如 CPU 內存存儲等指標，通過的加入到 canSchedule，未通的加入到 canNotSchedule，返回結果在 ExtenderFilterResult。

predicateHandler 可以根據情況二定義，比如 CPU，內存存儲等等指標

func handleFilter(args schedulerapi.ExtenderArgs) *schedulerapi.ExtenderFilterResult {
	pod := args.Pod
	canSchedule := make([]v1.Node, 0, len(args.Nodes.Items))
	canNotSchedule := make(map[string]string)

	for _, node := range args.Nodes.Items {
		result, err := predicateHandler(*pod, node)
		if err != nil {
			canNotSchedule[node.Name] = err.Error()
		} else if result {
			canSchedule = append(canSchedule, node)
		}
	}
	return &schedulerapi.ExtenderFilterResult{
		Nodes: &v1.NodeList{
			Items: canSchedule,
		},
		FailedNodes: canNotSchedule,
		Error:       "",
	}
}

1.2 http response body 結構 ExtenderFilterResult

使用 ExtenderFilterResult 作爲結構告知 scheduler 哪些 node 可以調度，哪些是不可以調度節點

// ExtenderFilterResult represents the results of a filter call to an extender
type ExtenderFilterResult struct {
	// Filtered set of nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == false
	Nodes *v1.NodeList
	// Filtered set of nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == true
	NodeNames *[]string
	// Filtered out nodes where the pod can't be scheduled and the failure messages
	FailedNodes FailedNodesMap
	// Error message indicating failure
	Error string
}

2. How the scheduler extender works

2.1 extender scheduler policy 配置文件樣例 policy.yaml

urlPrefix 是向 scheduler 註冊需要回調服務地址前綴
enableHttps 是否 https 服務
nodeCacheCapable ：如果設置 NodeCache，那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。

{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
  {"name" : "PodFitsHostPorts"},
  {"name" : "PodFitsResources"},
  {"name" : "NoDiskConflict"},
  {"name" : "MatchNodeSelector"},
  {"name" : "HostName"}
  ],
  "priorities" : [
  {"name" : "LeastRequestedPriority", "weight" : 1},
  {"name" : "BalancedResourceAllocation", "weight" : 1},
  {"name" : "ServiceSpreadingPriority", "weight" : 1},
  {"name" : "EqualPriority", "weight" : 1}
  ],
  "extenders" : [{
                   "urlPrefix": "http://localhost:8880",
                   "filterVerb": "predicates",
                   "prioritizeVerb": "priorities",
                   "preemptVerb": "preemption",
                   "bindVerb": "",
                   "weight": 1,
                   "enableHttps": false,
                   "nodeCacheCapable": false
                 }],
  "hardPodAffinitySymmetricWeight" : 10
}

2.2 定義的 KubeSchedulerConfiguration

設置調度器，以及調度算法的配置 policy 文件，使用上文的 2.1 policy.yaml 文件，下文 schedulerConfig.yaml

kube-scheduler 啓動時可以通過 --config=schedulerConfig.yaml 參數可以指定調度策略文件，用戶可以根據需要組裝Predicates 和 Priority函數。選擇不同的過濾函數和優先級函數、控制優先級函數的權重、調整過濾函數的順序都會影響調度過程。

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: my-scheduler
algorithmSource:
  policy:
    file:
      path: policy.yaml
leaderElection:
  leaderElect: true
  lockObjectName: my-scheduler
  lockObjectNamespace: kube-system

3. k8s scheduler extender 源碼實現分析

3.1 讀取配置文件如果設置 file 或者 configMap

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: my-scheduler
algorithmSource:
policy:
file:
path: extender-policy.yaml

source := schedulerAlgorithmSource
switch {
case source.Provider != nil:
	// Create the config from a named algorithm provider.
	sc, err := configurator.CreateFromProvider(*source.Provider)
	if err != nil {
		return nil, fmt.Errorf("couldn't create scheduler using provider %q: %v", *source.Provider, err)
	}
	config = sc
case source.Policy != nil:
	// Create the config from a user specified policy source.
	policy := &schedulerapi.Policy{}
	switch {
	case source.Policy.File != nil:
		if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {
			return nil, err
		}
	case source.Policy.ConfigMap != nil:
		if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {
			return nil, err
		}
	}
	sc, err := configurator.CreateFromConfig(*policy)
	if err != nil {
		return nil, fmt.Errorf("couldn't create scheduler from policy: %v", err)
	}
	config = sc

3.1.1 CreateFromConfig 讀取設置的 policy.yaml 文件或者 configMap 配置

{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
    {"name" : "PodFitsHostPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
  ],
  "priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
  ],
  "extenders" : [{
    "urlPrefix": "http://localhost:8880",
    "filterVerb": "predicates",
    "prioritizeVerb": "priorities",
    "preemptVerb": "preemption",
    "bindVerb": "",
    "weight": 1,
    "enableHttps": false,
    "nodeCacheCapable": false
   }],
  "hardPodAffinitySymmetricWeight" : 10
}

3.1.2 SchedulerExtender 接口

方法比較簡單明瞭，路徑 pkg/scheduler/algorithm/scheduler_interface.go

type SchedulerExtender interface {

	// Filter based on extender-implemented predicate functions. The filtered list is
	// expected to be a subset of the supplied list. failedNodesMap optionally contains
	// the list of failed nodes and failure reasons.
	Filter(pod *v1.Pod,
		nodes []*v1.Node, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
	) (filteredNodes []*v1.Node, failedNodesMap schedulerapi.FailedNodesMap, err error)

	// Prioritize based on extender-implemented priority functions. The returned scores & weight
	// are used to compute the weighted score for an extender. The weighted scores are added to
	// the scores computed  by Kubernetes scheduler. The total scores are used to do the host selection.
	Prioritize(pod *v1.Pod, nodes []*v1.Node) (hostPriorities *schedulerapi.HostPriorityList, weight int, err error)

	// Bind delegates the action of binding a pod to a node to the extender.
	Bind(binding *v1.Binding) error
   ..................................
}

3.1.3 ExtenderConfig 結構體

當調度 pod 時，extender 通過外部的進程來預選（filter）和優選 (prioritize) 節點，extender 也可以直接實現把 pod bind 到node

使用 extender 功能時，需要創建 scheduler policy 配置文件，配置文件指明怎樣能訪問到 extender

"extenders" : [{
"urlPrefix": "http://localhost:8880",
"filterVerb": "predicates",
"prioritizeVerb": "priorities",
"preemptVerb": "preemption",
"bindVerb": "",
"weight": 1,
"enableHttps": false,
"nodeCacheCapable": false
}],

URLPrefix 是 extender 服務的可用地址前綴
FilterVerb filter 的調用地址，例如上述配置就是 http://localhost:8880/predicates，如果未空則不支持
enableHttps 是否 https 服務
nodeCacheCapable ：如果設置 NodeCache，那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。

type ExtenderConfig struct {
	// URLPrefix at which the extender is available
	URLPrefix string
	// Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
	FilterVerb string
	// Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender.
	PreemptVerb string
	// Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
	PrioritizeVerb string
	// The numeric multiplier for the node scores that the prioritize call generates.
	// The weight should be a positive integer
	Weight int
	// Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender.
	// If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender
	// can implement this function.
	BindVerb string
	// EnableHTTPS specifies whether https should be used to communicate with the extender
	EnableHTTPS bool
	// TLSConfig specifies the transport layer security config
	TLSConfig *ExtenderTLSConfig
	// HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
	// timeout is ignored, k8s/other extenders priorities are used to select the node.
	HTTPTimeout time.Duration
	// NodeCacheCapable specifies that the extender is capable of caching node information,
	// so the scheduler should only send minimal information about the eligible nodes
	// assuming that the extender already cached full details of all nodes in the cluster
	NodeCacheCapable bool
	// ManagedResources is a list of extended resources that are managed by
	// this extender.
	// - A pod will be sent to the extender on the Filter, Prioritize and Bind
	//   (if the extender is the binder) phases iff the pod requests at least
	//   one of the extended resources in this list. If empty or unspecified,
	//   all pods will be sent to this extender.
	// - If IgnoredByScheduler is set to true for a resource, kube-scheduler
	//   will skip checking the resource in predicates.
	// +optional
	ManagedResources []ExtenderManagedResource
	// Ignorable specifies if the extender is ignorable, i.e. scheduling should not
	// fail when the extender returns an error or is not reachable.
	Ignorable bool
}

3.2 HTTPExtender 實現了 SchedulerExtender 接口

NewHTTPExtender 實例化 HTTPExtender 對象，實現了 SchedulerExtender 接口

var extenders []algorithm.SchedulerExtender
if len(policy.ExtenderConfigs) != 0 {
	ignoredExtendedResources := sets.NewString()
	var ignorableExtenders []algorithm.SchedulerExtender
	for ii := range policy.ExtenderConfigs {
		klog.V(2).Infof("Creating extender with config %+v", policy.ExtenderConfigs[ii])
		extender, err := core.NewHTTPExtender(&policy.ExtenderConfigs[ii])

哪裏會調用 Extender 呢？請看下文分解

3.3 findNodesThatFit 函數

預選階段，這裏會調用 extender 在根據自定義規則在過濾一下，所以這裏是 HTTPExtender 的 Filter 方法

// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
	var filtered []*v1.Node
	failedPredicateMap := FailedPredicateMap{}

	if len(filtered) > 0 && len(g.extenders) != 0 {
		for _, extender := range g.extenders {
			if !extender.IsInterested(pod) {
				continue
			}
			filteredList, failedMap, err := extender.Filter(pod, filtered, g.nodeInfoSnapshot.NodeInfoMap)

3.3.1 HTTPExtender 的 Filter 方法

這裏 ExtenderArgs 作爲請求 body 的主體結構，ExtenderFilterResult 作爲 response 的返回結構

// Filter based on extender implemented predicate functions. The filtered list is
// expected to be a subset of the supplied list. failedNodesMap optionally contains
// the list of failed nodes and failure reasons.
func (h *HTTPExtender) Filter(
	pod *v1.Pod,
	nodes []*v1.Node, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
) ([]*v1.Node, schedulerapi.FailedNodesMap, error) {
	var (
		result     schedulerapi.ExtenderFilterResult
		nodeList   *v1.NodeList
		nodeNames  *[]string
		nodeResult []*v1.Node
		args       *schedulerapi.ExtenderArgs
	)

3.3.1.1 如果 filterVerb 未空，則未過略任何節點

if h.filterVerb == "" {
	return nodes, schedulerapi.FailedNodesMap{}, nil
}

3.3.1.2 是否設置 nodeCacheCapable

如果設置 NodeCache，那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。

if h.nodeCacheCapable {
	nodeNameSlice := make([]string, 0, len(nodes))
	for _, node := range nodes {
		nodeNameSlice = append(nodeNameSlice, node.Name)
	}
	nodeNames = &nodeNameSlice
} else {
	nodeList = &v1.NodeList{}
	for _, node := range nodes {
		nodeList.Items = append(nodeList.Items, *node)
	}
}

3.3.1.3 封裝 ExtenderArgs 結構，包括需要調度的 Pod 與節點列表

args = &schedulerapi.ExtenderArgs{
	Pod:       pod,
	Nodes:     nodeList,
	NodeNames: nodeNames,
}

if err := h.send(h.filterVerb, args, &result); err != nil {
	return nil, nil, err
}

3.3.1.4 發送 http 請求

將 ExtenderArgs 序列化，請求地址就是 urlPrefix + “/” + filterVerb，所以主要實現邏輯在自定義 extender 服務中

// Helper function to send messages to the extender
func (h *HTTPExtender) send(action string, args interface{}, result interface{}) error {
	out, err := json.Marshal(args)
	if err != nil {
		return err
	}

	url := strings.TrimRight(h.extenderURL, "/") + "/" + action

	req, err := http.NewRequest("POST", url, bytes.NewReader(out))
	if err != nil {
		return err
	}

	req.Header.Set("Content-Type", "application/json")

	resp, err := h.client.Do(req)
	if err != nil {
		return err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		return fmt.Errorf("Failed %v with extender at URL %v, code %v", action, url, resp.StatusCode)
	}

	return json.NewDecoder(resp.Body).Decode(result)
}

對於優選階段怎麼處理的呢？請看第 4 章節

4. PiroritizeNodes 函數

// PrioritizeNodes prioritizes the nodes by running the individual priority functions in parallel.
// Each priority function is expected to set a score of 0-10
// 0 is the lowest priority score (least preferred node) and 10 is the highest
// Each priority function can also have its own weight
// The node scores returned by the priority function are multiplied by the weights to get weighted scores
// All scores are finally combined (added) to get the total weighted scores of all nodes
func PrioritizeNodes(
	pod *v1.Pod,
	nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
	meta interface{},
	priorityConfigs []priorities.PriorityConfig,
	nodes []*v1.Node,
	extenders []algorithm.SchedulerExtender,
) (schedulerapi.HostPriorityList, error) {

4.1 對所有 extender 實行併發異步處理

對所有結果分數累加，分析 HTTPExtender 的 Prioritize 方法

if len(extenders) != 0 && nodes != nil {
	combinedScores := make(map[string]int, len(nodeNameToInfo))
	for i := range extenders {
		if !extenders[i].IsInterested(pod) {
			continue
		}
		wg.Add(1)
		go func(extIndex int) {
			defer wg.Done()
			prioritizedList, weight, err := extenders[extIndex].Prioritize(pod, nodes)
			if err != nil {
				// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
				return
			}
			mu.Lock()
			for i := range *prioritizedList {
				host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
				if klog.V(10) {
					klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), host, extenders[extIndex].Name(), score)
				}
				combinedScores[host] += score * weight
			}
			mu.Unlock()
		}(i)
	}
	// wait for all go routines to finish
	wg.Wait()
	for i := range result {
		result[i].Score += combinedScores[result[i].Host]
	}
}

4.2 HTTPExtender 的 Prioritize 方法

// Prioritize based on extender implemented priority functions. Weight*priority is added
// up for each such priority function. The returned score is added to the score computed
// by Kubernetes scheduler. The total score is used to do the host selection.
func (h *HTTPExtender) Prioritize(pod *v1.Pod, nodes []*v1.Node) (*schedulerapi.HostPriorityList, int, error) {
	var (
		result    schedulerapi.HostPriorityList
		nodeList  *v1.NodeList
		nodeNames *[]string
		args      *schedulerapi.ExtenderArgs
	)

4.2.1 如果未設置 prioritizeVerb

打個比方，交白卷打 0 分，也就是無需實現自定義 prioritize 打分，0 分也就是對原來的得分沒有影響

if h.prioritizeVerb == "" {
	result := schedulerapi.HostPriorityList{}
	for _, node := range nodes {
		result = append(result, schedulerapi.HostPriority{Host: node.Name, Score: 0})
	}
	return &result, 0, nil
}

預選與優選的 HTTPExtender 處理方法相同，不過返回結果不同而已，一個是過略的 node 節點，一個是 node 節點給的分數

總結：

部署需要scheduler指定配置 policy 的 Extender 配置

Extender 服務實現 HTTP server，實現預選優選方法

參考：

https://developer.ibm.com/technologies/containers/articles/creating-a-custom-kube-scheduler/

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md

【kubernetes/k8s源碼分析】k8s extender scheduler 分析

1. Scheduler extender

1.1 http body 結構 ExtenderArgs

1.2 http response body 結構 ExtenderFilterResult

2. How the scheduler extender works

2.1 extender scheduler policy 配置文件樣例 policy.yaml

2.2 定義的 KubeSchedulerConfiguration

3. k8s scheduler extender 源碼實現分析

3.1 讀取配置文件如果設置 file 或者 configMap

3.2 HTTPExtender 實現了 SchedulerExtender 接口

3.3 findNodesThatFit 函數

4. PiroritizeNodes 函數

4.1 對所有 extender 實行併發異步處理

4.2 HTTPExtender 的 Prioritize 方法

總結：

參考：

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

關於接口協議，你必須要知道這些！

FolkMq v1.4.6 發佈（可以內嵌的消息中間件）

一鍵自動化博客發佈工具,用過的人都說好(頭條篇)

01 穩定性（一）如何應對事故並做好覆盤？

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

線程池那些坑爹的參數-核心線程數&最大線程數&工作隊列

京東面試：如何進行JVM調優？

Stream流常用方法總結

【kubernetes/k8s源碼分析】coredns 源碼分析之四 cache 插件

【kubernetes/k8s源碼分析】kata container create 創建源碼分析

【kubeedge概念】kubeedge架構與部署安裝

【kubernetes/k8s源碼分析】kata container agent create container 源碼分析

【containerd 源碼分析】containerd cri 啓動註冊流程源碼分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結