【kubernetes/k8s源碼分析】k8s extender scheduler 分析

1. Scheduler extender

    有三種方式爲 kubernetes 添加新的調度規則,包括 predicates 和 priority 功能,本文講解第三種方式

  • 第一種,直接在 kubernetes 添加調度規則,重新編譯
  • 第二種,實現自己的調度,替換 k8s 的 scheduler
  • 第三種,實現 scheduler extender,提供擴展 k8s 調度的一個能力

   1.1 http body 結構 ExtenderArgs 

     設置的 scheduler 會向 extender 服務發起 http 請求,其中將 ExtenderArgs 序列化作爲 body,主要是需要調度的 pod 的信息,以及調度的節點列表。

// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
// nodes for a pod.
type ExtenderArgs struct {
	// Pod being scheduled
	Pod *v1.Pod
	// List of candidate nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == false
	Nodes *v1.NodeList
	// List of candidate node names where the pod can be scheduled; to be
	// populated only if ExtenderConfig.NodeCacheCapable == true
	NodeNames *[]string
}

     1.1.1 讀取 http body

      如果使用 github.com/emicklei/go-restful 包比較簡單,直接使用 ReadEntity 讀取到 ExtenderArgs 變量即可

func predicates(r *restful.Request, w *restful.Response) {
	var extenderArgs schedulerapi.ExtenderArgs

	if err := r.ReadEntity(&extenderArgs); err != nil {
		logrus.Errorf("predicate read entity error: %v", err)
		w.WriteErrorString(http.StatusInternalServerError, err.Error())
		return
	}

    1.1.2 定義自己的 filter 規則

     輪詢所有 node 節點,使用自定義的過略規則,比如 CPU 內存 存儲等指標,通過的加入到 canSchedule,未通的加入到 canNotSchedule,返回結果在 ExtenderFilterResult。

     predicateHandler 可以根據情況二定義,比如 CPU,內存 存儲等等指標

func handleFilter(args schedulerapi.ExtenderArgs) *schedulerapi.ExtenderFilterResult {
	pod := args.Pod
	canSchedule := make([]v1.Node, 0, len(args.Nodes.Items))
	canNotSchedule := make(map[string]string)

	for _, node := range args.Nodes.Items {
		result, err := predicateHandler(*pod, node)
		if err != nil {
			canNotSchedule[node.Name] = err.Error()
		} else if result {
			canSchedule = append(canSchedule, node)
		}
	}
	return &schedulerapi.ExtenderFilterResult{
		Nodes: &v1.NodeList{
			Items: canSchedule,
		},
		FailedNodes: canNotSchedule,
		Error:       "",
	}
}

   1.2 http response body 結構 ExtenderFilterResult

     使用 ExtenderFilterResult 作爲結構告知 scheduler 哪些 node 可以調度,哪些是不可以調度節點

// ExtenderFilterResult represents the results of a filter call to an extender
type ExtenderFilterResult struct {
	// Filtered set of nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == false
	Nodes *v1.NodeList
	// Filtered set of nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == true
	NodeNames *[]string
	// Filtered out nodes where the pod can't be scheduled and the failure messages
	FailedNodes FailedNodesMap
	// Error message indicating failure
	Error string
}

 

2. How the scheduler extender works

    2.1 extender scheduler policy 配置文件樣例 policy.yaml

  •      urlPrefix 是向 scheduler 註冊需要回調服務地址前綴
  •      enableHttps 是否 https 服務
  •      nodeCacheCapable :如果設置 NodeCache,那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。
{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
  {"name" : "PodFitsHostPorts"},
  {"name" : "PodFitsResources"},
  {"name" : "NoDiskConflict"},
  {"name" : "MatchNodeSelector"},
  {"name" : "HostName"}
  ],
  "priorities" : [
  {"name" : "LeastRequestedPriority", "weight" : 1},
  {"name" : "BalancedResourceAllocation", "weight" : 1},
  {"name" : "ServiceSpreadingPriority", "weight" : 1},
  {"name" : "EqualPriority", "weight" : 1}
  ],
  "extenders" : [{
                   "urlPrefix": "http://localhost:8880",
                   "filterVerb": "predicates",
                   "prioritizeVerb": "priorities",
                   "preemptVerb": "preemption",
                   "bindVerb": "",
                   "weight": 1,
                   "enableHttps": false,
                   "nodeCacheCapable": false
                 }],
  "hardPodAffinitySymmetricWeight" : 10
}

  2.2 定義的 KubeSchedulerConfiguration

    設置調度器,以及調度算法的配置 policy 文件,使用上文的 2.1 policy.yaml 文件,下文 schedulerConfig.yaml

    kube-scheduler 啓動時可以通過 --config=schedulerConfig.yaml 參數可以指定調度策略文件,用戶可以根據需要組裝Predicates 和 Priority函數。選擇不同的過濾函數和優先級函數、控制優先級函數的權重、調整過濾函數的順序都會影響調度過程。

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: my-scheduler
algorithmSource:
  policy:
    file:
      path: policy.yaml
leaderElection:
  leaderElect: true
  lockObjectName: my-scheduler
  lockObjectNamespace: kube-system

 

3. k8s scheduler extender 源碼實現分析

    3.1 讀取配置文件如果設置 file 或者 configMap

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: my-scheduler
algorithmSource:
  policy:
    file:
      path: extender-policy.yaml

source := schedulerAlgorithmSource
switch {
case source.Provider != nil:
	// Create the config from a named algorithm provider.
	sc, err := configurator.CreateFromProvider(*source.Provider)
	if err != nil {
		return nil, fmt.Errorf("couldn't create scheduler using provider %q: %v", *source.Provider, err)
	}
	config = sc
case source.Policy != nil:
	// Create the config from a user specified policy source.
	policy := &schedulerapi.Policy{}
	switch {
	case source.Policy.File != nil:
		if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {
			return nil, err
		}
	case source.Policy.ConfigMap != nil:
		if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {
			return nil, err
		}
	}
	sc, err := configurator.CreateFromConfig(*policy)
	if err != nil {
		return nil, fmt.Errorf("couldn't create scheduler from policy: %v", err)
	}
	config = sc

    3.1.1 CreateFromConfig 讀取設置的 policy.yaml 文件或者 configMap 配置

{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
    {"name" : "PodFitsHostPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
  ],
  "priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
  ],
  "extenders" : [{
    "urlPrefix": "http://localhost:8880",
    "filterVerb": "predicates",
    "prioritizeVerb": "priorities",
    "preemptVerb": "preemption",
    "bindVerb": "",
    "weight": 1,
    "enableHttps": false,
    "nodeCacheCapable": false
   }],
  "hardPodAffinitySymmetricWeight" : 10
}

    3.1.2 SchedulerExtender 接口

    方法比較簡單明瞭,路徑 pkg/scheduler/algorithm/scheduler_interface.go

type SchedulerExtender interface {

	// Filter based on extender-implemented predicate functions. The filtered list is
	// expected to be a subset of the supplied list. failedNodesMap optionally contains
	// the list of failed nodes and failure reasons.
	Filter(pod *v1.Pod,
		nodes []*v1.Node, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
	) (filteredNodes []*v1.Node, failedNodesMap schedulerapi.FailedNodesMap, err error)

	// Prioritize based on extender-implemented priority functions. The returned scores & weight
	// are used to compute the weighted score for an extender. The weighted scores are added to
	// the scores computed  by Kubernetes scheduler. The total scores are used to do the host selection.
	Prioritize(pod *v1.Pod, nodes []*v1.Node) (hostPriorities *schedulerapi.HostPriorityList, weight int, err error)

	// Bind delegates the action of binding a pod to a node to the extender.
	Bind(binding *v1.Binding) error
   ..................................
}

    3.1.3 ExtenderConfig 結構體

    當調度 pod 時,extender 通過外部的進程來預選(filter)和優選 (prioritize) 節點,extender 也可以直接實現把 pod bind 到node

    使用 extender 功能時,需要創建 scheduler policy 配置文件,配置文件指明怎樣能訪問到 extender

  "extenders" : [{
    "urlPrefix": "http://localhost:8880",
    "filterVerb": "predicates",
    "prioritizeVerb": "priorities",
    "preemptVerb": "preemption",
    "bindVerb": "",
    "weight": 1,
    "enableHttps": false,
    "nodeCacheCapable": false
   }],

  •     URLPrefix 是 extender 服務的可用地址前綴
  •     FilterVerb filter 的調用地址,例如上述配置就是 http://localhost:8880/predicates,如果未空則不支持   
  •     enableHttps 是否 https 服務
  •     nodeCacheCapable :如果設置 NodeCache,那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。
type ExtenderConfig struct {
	// URLPrefix at which the extender is available
	URLPrefix string
	// Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
	FilterVerb string
	// Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender.
	PreemptVerb string
	// Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
	PrioritizeVerb string
	// The numeric multiplier for the node scores that the prioritize call generates.
	// The weight should be a positive integer
	Weight int
	// Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender.
	// If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender
	// can implement this function.
	BindVerb string
	// EnableHTTPS specifies whether https should be used to communicate with the extender
	EnableHTTPS bool
	// TLSConfig specifies the transport layer security config
	TLSConfig *ExtenderTLSConfig
	// HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
	// timeout is ignored, k8s/other extenders priorities are used to select the node.
	HTTPTimeout time.Duration
	// NodeCacheCapable specifies that the extender is capable of caching node information,
	// so the scheduler should only send minimal information about the eligible nodes
	// assuming that the extender already cached full details of all nodes in the cluster
	NodeCacheCapable bool
	// ManagedResources is a list of extended resources that are managed by
	// this extender.
	// - A pod will be sent to the extender on the Filter, Prioritize and Bind
	//   (if the extender is the binder) phases iff the pod requests at least
	//   one of the extended resources in this list. If empty or unspecified,
	//   all pods will be sent to this extender.
	// - If IgnoredByScheduler is set to true for a resource, kube-scheduler
	//   will skip checking the resource in predicates.
	// +optional
	ManagedResources []ExtenderManagedResource
	// Ignorable specifies if the extender is ignorable, i.e. scheduling should not
	// fail when the extender returns an error or is not reachable.
	Ignorable bool
}

  3.2 HTTPExtender 實現了 SchedulerExtender 接口

    NewHTTPExtender 實例化 HTTPExtender 對象,實現了 SchedulerExtender 接口

var extenders []algorithm.SchedulerExtender
if len(policy.ExtenderConfigs) != 0 {
	ignoredExtendedResources := sets.NewString()
	var ignorableExtenders []algorithm.SchedulerExtender
	for ii := range policy.ExtenderConfigs {
		klog.V(2).Infof("Creating extender with config %+v", policy.ExtenderConfigs[ii])
		extender, err := core.NewHTTPExtender(&policy.ExtenderConfigs[ii])

    哪裏會調用 Extender 呢?請看下文分解

   3.3 findNodesThatFit 函數

    預選階段,這裏會調用 extender 在根據自定義規則在過濾一下,所以這裏是 HTTPExtender 的 Filter 方法

// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
	var filtered []*v1.Node
	failedPredicateMap := FailedPredicateMap{}

	if len(filtered) > 0 && len(g.extenders) != 0 {
		for _, extender := range g.extenders {
			if !extender.IsInterested(pod) {
				continue
			}
			filteredList, failedMap, err := extender.Filter(pod, filtered, g.nodeInfoSnapshot.NodeInfoMap)

  3.3.1 HTTPExtender 的 Filter 方法

   這裏 ExtenderArgs 作爲請求 body 的主體結構,ExtenderFilterResult 作爲 response 的返回結構

// Filter based on extender implemented predicate functions. The filtered list is
// expected to be a subset of the supplied list. failedNodesMap optionally contains
// the list of failed nodes and failure reasons.
func (h *HTTPExtender) Filter(
	pod *v1.Pod,
	nodes []*v1.Node, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
) ([]*v1.Node, schedulerapi.FailedNodesMap, error) {
	var (
		result     schedulerapi.ExtenderFilterResult
		nodeList   *v1.NodeList
		nodeNames  *[]string
		nodeResult []*v1.Node
		args       *schedulerapi.ExtenderArgs
	)

     3.3.1.1 如果 filterVerb 未空,則未過略任何節點

if h.filterVerb == "" {
	return nodes, schedulerapi.FailedNodesMap{}, nil
}

     3.3.1.2 是否設置 nodeCacheCapable

     如果設置 NodeCache,那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。

if h.nodeCacheCapable {
	nodeNameSlice := make([]string, 0, len(nodes))
	for _, node := range nodes {
		nodeNameSlice = append(nodeNameSlice, node.Name)
	}
	nodeNames = &nodeNameSlice
} else {
	nodeList = &v1.NodeList{}
	for _, node := range nodes {
		nodeList.Items = append(nodeList.Items, *node)
	}
}

     3.3.1.3 封裝 ExtenderArgs 結構,包括需要調度的 Pod 與節點列表

args = &schedulerapi.ExtenderArgs{
	Pod:       pod,
	Nodes:     nodeList,
	NodeNames: nodeNames,
}

if err := h.send(h.filterVerb, args, &result); err != nil {
	return nil, nil, err
}

     3.3.1.4 發送 http 請求

     將 ExtenderArgs 序列化,請求地址就是 urlPrefix + “/” + filterVerb,所以主要實現邏輯在自定義 extender 服務中

// Helper function to send messages to the extender
func (h *HTTPExtender) send(action string, args interface{}, result interface{}) error {
	out, err := json.Marshal(args)
	if err != nil {
		return err
	}

	url := strings.TrimRight(h.extenderURL, "/") + "/" + action

	req, err := http.NewRequest("POST", url, bytes.NewReader(out))
	if err != nil {
		return err
	}

	req.Header.Set("Content-Type", "application/json")

	resp, err := h.client.Do(req)
	if err != nil {
		return err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		return fmt.Errorf("Failed %v with extender at URL %v, code %v", action, url, resp.StatusCode)
	}

	return json.NewDecoder(resp.Body).Decode(result)
}

   對於優選階段怎麼處理的呢?請看第 4 章節

4. PiroritizeNodes 函數

// PrioritizeNodes prioritizes the nodes by running the individual priority functions in parallel.
// Each priority function is expected to set a score of 0-10
// 0 is the lowest priority score (least preferred node) and 10 is the highest
// Each priority function can also have its own weight
// The node scores returned by the priority function are multiplied by the weights to get weighted scores
// All scores are finally combined (added) to get the total weighted scores of all nodes
func PrioritizeNodes(
	pod *v1.Pod,
	nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
	meta interface{},
	priorityConfigs []priorities.PriorityConfig,
	nodes []*v1.Node,
	extenders []algorithm.SchedulerExtender,
) (schedulerapi.HostPriorityList, error) {

   4.1 對所有 extender 實行併發異步處理

    對所有結果分數累加,分析 HTTPExtender 的 Prioritize 方法

if len(extenders) != 0 && nodes != nil {
	combinedScores := make(map[string]int, len(nodeNameToInfo))
	for i := range extenders {
		if !extenders[i].IsInterested(pod) {
			continue
		}
		wg.Add(1)
		go func(extIndex int) {
			defer wg.Done()
			prioritizedList, weight, err := extenders[extIndex].Prioritize(pod, nodes)
			if err != nil {
				// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
				return
			}
			mu.Lock()
			for i := range *prioritizedList {
				host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
				if klog.V(10) {
					klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), host, extenders[extIndex].Name(), score)
				}
				combinedScores[host] += score * weight
			}
			mu.Unlock()
		}(i)
	}
	// wait for all go routines to finish
	wg.Wait()
	for i := range result {
		result[i].Score += combinedScores[result[i].Host]
	}
}

   4.2 HTTPExtender 的 Prioritize 方法

// Prioritize based on extender implemented priority functions. Weight*priority is added
// up for each such priority function. The returned score is added to the score computed
// by Kubernetes scheduler. The total score is used to do the host selection.
func (h *HTTPExtender) Prioritize(pod *v1.Pod, nodes []*v1.Node) (*schedulerapi.HostPriorityList, int, error) {
	var (
		result    schedulerapi.HostPriorityList
		nodeList  *v1.NodeList
		nodeNames *[]string
		args      *schedulerapi.ExtenderArgs
	)

     4.2.1 如果未設置 prioritizeVerb 

     打個比方,交白卷打 0 分,也就是無需實現自定義 prioritize 打分,0 分也就是對原來的得分沒有影響

if h.prioritizeVerb == "" {
	result := schedulerapi.HostPriorityList{}
	for _, node := range nodes {
		result = append(result, schedulerapi.HostPriority{Host: node.Name, Score: 0})
	}
	return &result, 0, nil
}

    預選與優選的 HTTPExtender 處理方法相同,不過返回結果不同而已,一個是過略的 node 節點,一個是 node 節點給的分數

 

總結:

   部署需要scheduler指定配置 policy 的 Extender 配置

   Extender 服務實現 HTTP server,實現預選優選方法

 

參考:

    https://developer.ibm.com/technologies/containers/articles/creating-a-custom-kube-scheduler/

    https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章