1. Scheduler extender
有三種方式爲 kubernetes 添加新的調度規則,包括 predicates 和 priority 功能,本文講解第三種方式
- 第一種,直接在 kubernetes 添加調度規則,重新編譯
- 第二種,實現自己的調度,替換 k8s 的 scheduler
- 第三種,實現 scheduler extender,提供擴展 k8s 調度的一個能力
1.1 http body 結構 ExtenderArgs
設置的 scheduler 會向 extender 服務發起 http 請求,其中將 ExtenderArgs 序列化作爲 body,主要是需要調度的 pod 的信息,以及調度的節點列表。
// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
// nodes for a pod.
type ExtenderArgs struct {
// Pod being scheduled
Pod *v1.Pod
// List of candidate nodes where the pod can be scheduled; to be populated
// only if ExtenderConfig.NodeCacheCapable == false
Nodes *v1.NodeList
// List of candidate node names where the pod can be scheduled; to be
// populated only if ExtenderConfig.NodeCacheCapable == true
NodeNames *[]string
}
1.1.1 讀取 http body
如果使用 github.com/emicklei/go-restful 包比較簡單,直接使用 ReadEntity 讀取到 ExtenderArgs 變量即可
func predicates(r *restful.Request, w *restful.Response) {
var extenderArgs schedulerapi.ExtenderArgs
if err := r.ReadEntity(&extenderArgs); err != nil {
logrus.Errorf("predicate read entity error: %v", err)
w.WriteErrorString(http.StatusInternalServerError, err.Error())
return
}
1.1.2 定義自己的 filter 規則
輪詢所有 node 節點,使用自定義的過略規則,比如 CPU 內存 存儲等指標,通過的加入到 canSchedule,未通的加入到 canNotSchedule,返回結果在 ExtenderFilterResult。
predicateHandler 可以根據情況二定義,比如 CPU,內存 存儲等等指標
func handleFilter(args schedulerapi.ExtenderArgs) *schedulerapi.ExtenderFilterResult {
pod := args.Pod
canSchedule := make([]v1.Node, 0, len(args.Nodes.Items))
canNotSchedule := make(map[string]string)
for _, node := range args.Nodes.Items {
result, err := predicateHandler(*pod, node)
if err != nil {
canNotSchedule[node.Name] = err.Error()
} else if result {
canSchedule = append(canSchedule, node)
}
}
return &schedulerapi.ExtenderFilterResult{
Nodes: &v1.NodeList{
Items: canSchedule,
},
FailedNodes: canNotSchedule,
Error: "",
}
}
1.2 http response body 結構 ExtenderFilterResult
使用 ExtenderFilterResult 作爲結構告知 scheduler 哪些 node 可以調度,哪些是不可以調度節點
// ExtenderFilterResult represents the results of a filter call to an extender
type ExtenderFilterResult struct {
// Filtered set of nodes where the pod can be scheduled; to be populated
// only if ExtenderConfig.NodeCacheCapable == false
Nodes *v1.NodeList
// Filtered set of nodes where the pod can be scheduled; to be populated
// only if ExtenderConfig.NodeCacheCapable == true
NodeNames *[]string
// Filtered out nodes where the pod can't be scheduled and the failure messages
FailedNodes FailedNodesMap
// Error message indicating failure
Error string
}
2. How the scheduler extender works
2.1 extender scheduler policy 配置文件樣例 policy.yaml
- urlPrefix 是向 scheduler 註冊需要回調服務地址前綴
- enableHttps 是否 https 服務
- nodeCacheCapable :如果設置 NodeCache,那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "PodFitsHostPorts"},
{"name" : "PodFitsResources"},
{"name" : "NoDiskConflict"},
{"name" : "MatchNodeSelector"},
{"name" : "HostName"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "ServiceSpreadingPriority", "weight" : 1},
{"name" : "EqualPriority", "weight" : 1}
],
"extenders" : [{
"urlPrefix": "http://localhost:8880",
"filterVerb": "predicates",
"prioritizeVerb": "priorities",
"preemptVerb": "preemption",
"bindVerb": "",
"weight": 1,
"enableHttps": false,
"nodeCacheCapable": false
}],
"hardPodAffinitySymmetricWeight" : 10
}
2.2 定義的 KubeSchedulerConfiguration
設置調度器,以及調度算法的配置 policy 文件,使用上文的 2.1 policy.yaml 文件,下文 schedulerConfig.yaml
kube-scheduler 啓動時可以通過 --config=schedulerConfig.yaml 參數可以指定調度策略文件,用戶可以根據需要組裝Predicates 和 Priority函數。選擇不同的過濾函數和優先級函數、控制優先級函數的權重、調整過濾函數的順序都會影響調度過程。
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: my-scheduler
algorithmSource:
policy:
file:
path: policy.yaml
leaderElection:
leaderElect: true
lockObjectName: my-scheduler
lockObjectNamespace: kube-system
3. k8s scheduler extender 源碼實現分析
3.1 讀取配置文件如果設置 file 或者 configMap
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: my-scheduler
algorithmSource:
policy:
file:
path: extender-policy.yaml
source := schedulerAlgorithmSource
switch {
case source.Provider != nil:
// Create the config from a named algorithm provider.
sc, err := configurator.CreateFromProvider(*source.Provider)
if err != nil {
return nil, fmt.Errorf("couldn't create scheduler using provider %q: %v", *source.Provider, err)
}
config = sc
case source.Policy != nil:
// Create the config from a user specified policy source.
policy := &schedulerapi.Policy{}
switch {
case source.Policy.File != nil:
if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {
return nil, err
}
case source.Policy.ConfigMap != nil:
if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {
return nil, err
}
}
sc, err := configurator.CreateFromConfig(*policy)
if err != nil {
return nil, fmt.Errorf("couldn't create scheduler from policy: %v", err)
}
config = sc
3.1.1 CreateFromConfig 讀取設置的 policy.yaml 文件或者 configMap 配置
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "PodFitsHostPorts"},
{"name" : "PodFitsResources"},
{"name" : "NoDiskConflict"},
{"name" : "MatchNodeSelector"},
{"name" : "HostName"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "ServiceSpreadingPriority", "weight" : 1},
{"name" : "EqualPriority", "weight" : 1}
],
"extenders" : [{
"urlPrefix": "http://localhost:8880",
"filterVerb": "predicates",
"prioritizeVerb": "priorities",
"preemptVerb": "preemption",
"bindVerb": "",
"weight": 1,
"enableHttps": false,
"nodeCacheCapable": false
}],
"hardPodAffinitySymmetricWeight" : 10
}
3.1.2 SchedulerExtender 接口
方法比較簡單明瞭,路徑 pkg/scheduler/algorithm/scheduler_interface.go
type SchedulerExtender interface {
// Filter based on extender-implemented predicate functions. The filtered list is
// expected to be a subset of the supplied list. failedNodesMap optionally contains
// the list of failed nodes and failure reasons.
Filter(pod *v1.Pod,
nodes []*v1.Node, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
) (filteredNodes []*v1.Node, failedNodesMap schedulerapi.FailedNodesMap, err error)
// Prioritize based on extender-implemented priority functions. The returned scores & weight
// are used to compute the weighted score for an extender. The weighted scores are added to
// the scores computed by Kubernetes scheduler. The total scores are used to do the host selection.
Prioritize(pod *v1.Pod, nodes []*v1.Node) (hostPriorities *schedulerapi.HostPriorityList, weight int, err error)
// Bind delegates the action of binding a pod to a node to the extender.
Bind(binding *v1.Binding) error
..................................
}
3.1.3 ExtenderConfig 結構體
當調度 pod 時,extender 通過外部的進程來預選(filter)和優選 (prioritize) 節點,extender 也可以直接實現把 pod bind 到node
使用 extender 功能時,需要創建 scheduler policy 配置文件,配置文件指明怎樣能訪問到 extender
"extenders" : [{
"urlPrefix": "http://localhost:8880",
"filterVerb": "predicates",
"prioritizeVerb": "priorities",
"preemptVerb": "preemption",
"bindVerb": "",
"weight": 1,
"enableHttps": false,
"nodeCacheCapable": false
}],
- URLPrefix 是 extender 服務的可用地址前綴
- FilterVerb filter 的調用地址,例如上述配置就是 http://localhost:8880/predicates,如果未空則不支持
- enableHttps 是否 https 服務
- nodeCacheCapable :如果設置 NodeCache,那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。
type ExtenderConfig struct {
// URLPrefix at which the extender is available
URLPrefix string
// Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
FilterVerb string
// Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender.
PreemptVerb string
// Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
PrioritizeVerb string
// The numeric multiplier for the node scores that the prioritize call generates.
// The weight should be a positive integer
Weight int
// Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender.
// If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender
// can implement this function.
BindVerb string
// EnableHTTPS specifies whether https should be used to communicate with the extender
EnableHTTPS bool
// TLSConfig specifies the transport layer security config
TLSConfig *ExtenderTLSConfig
// HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
// timeout is ignored, k8s/other extenders priorities are used to select the node.
HTTPTimeout time.Duration
// NodeCacheCapable specifies that the extender is capable of caching node information,
// so the scheduler should only send minimal information about the eligible nodes
// assuming that the extender already cached full details of all nodes in the cluster
NodeCacheCapable bool
// ManagedResources is a list of extended resources that are managed by
// this extender.
// - A pod will be sent to the extender on the Filter, Prioritize and Bind
// (if the extender is the binder) phases iff the pod requests at least
// one of the extended resources in this list. If empty or unspecified,
// all pods will be sent to this extender.
// - If IgnoredByScheduler is set to true for a resource, kube-scheduler
// will skip checking the resource in predicates.
// +optional
ManagedResources []ExtenderManagedResource
// Ignorable specifies if the extender is ignorable, i.e. scheduling should not
// fail when the extender returns an error or is not reachable.
Ignorable bool
}
3.2 HTTPExtender 實現了 SchedulerExtender 接口
NewHTTPExtender 實例化 HTTPExtender 對象,實現了 SchedulerExtender 接口
var extenders []algorithm.SchedulerExtender
if len(policy.ExtenderConfigs) != 0 {
ignoredExtendedResources := sets.NewString()
var ignorableExtenders []algorithm.SchedulerExtender
for ii := range policy.ExtenderConfigs {
klog.V(2).Infof("Creating extender with config %+v", policy.ExtenderConfigs[ii])
extender, err := core.NewHTTPExtender(&policy.ExtenderConfigs[ii])
哪裏會調用 Extender 呢?請看下文分解
3.3 findNodesThatFit 函數
預選階段,這裏會調用 extender 在根據自定義規則在過濾一下,所以這裏是 HTTPExtender 的 Filter 方法
// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
var filtered []*v1.Node
failedPredicateMap := FailedPredicateMap{}
if len(filtered) > 0 && len(g.extenders) != 0 {
for _, extender := range g.extenders {
if !extender.IsInterested(pod) {
continue
}
filteredList, failedMap, err := extender.Filter(pod, filtered, g.nodeInfoSnapshot.NodeInfoMap)
3.3.1 HTTPExtender 的 Filter 方法
這裏 ExtenderArgs 作爲請求 body 的主體結構,ExtenderFilterResult 作爲 response 的返回結構
// Filter based on extender implemented predicate functions. The filtered list is
// expected to be a subset of the supplied list. failedNodesMap optionally contains
// the list of failed nodes and failure reasons.
func (h *HTTPExtender) Filter(
pod *v1.Pod,
nodes []*v1.Node, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
) ([]*v1.Node, schedulerapi.FailedNodesMap, error) {
var (
result schedulerapi.ExtenderFilterResult
nodeList *v1.NodeList
nodeNames *[]string
nodeResult []*v1.Node
args *schedulerapi.ExtenderArgs
)
3.3.1.1 如果 filterVerb 未空,則未過略任何節點
if h.filterVerb == "" {
return nodes, schedulerapi.FailedNodesMap{}, nil
}
3.3.1.2 是否設置 nodeCacheCapable
如果設置 NodeCache,那調度器只會傳給 nodenames 列表。如果沒有開啓調度器會把所有 nodeinfo 完整結構都傳遞過來。
if h.nodeCacheCapable {
nodeNameSlice := make([]string, 0, len(nodes))
for _, node := range nodes {
nodeNameSlice = append(nodeNameSlice, node.Name)
}
nodeNames = &nodeNameSlice
} else {
nodeList = &v1.NodeList{}
for _, node := range nodes {
nodeList.Items = append(nodeList.Items, *node)
}
}
3.3.1.3 封裝 ExtenderArgs 結構,包括需要調度的 Pod 與節點列表
args = &schedulerapi.ExtenderArgs{
Pod: pod,
Nodes: nodeList,
NodeNames: nodeNames,
}
if err := h.send(h.filterVerb, args, &result); err != nil {
return nil, nil, err
}
3.3.1.4 發送 http 請求
將 ExtenderArgs 序列化,請求地址就是 urlPrefix + “/” + filterVerb,所以主要實現邏輯在自定義 extender 服務中
// Helper function to send messages to the extender
func (h *HTTPExtender) send(action string, args interface{}, result interface{}) error {
out, err := json.Marshal(args)
if err != nil {
return err
}
url := strings.TrimRight(h.extenderURL, "/") + "/" + action
req, err := http.NewRequest("POST", url, bytes.NewReader(out))
if err != nil {
return err
}
req.Header.Set("Content-Type", "application/json")
resp, err := h.client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("Failed %v with extender at URL %v, code %v", action, url, resp.StatusCode)
}
return json.NewDecoder(resp.Body).Decode(result)
}
對於優選階段怎麼處理的呢?請看第 4 章節
4. PiroritizeNodes 函數
// PrioritizeNodes prioritizes the nodes by running the individual priority functions in parallel.
// Each priority function is expected to set a score of 0-10
// 0 is the lowest priority score (least preferred node) and 10 is the highest
// Each priority function can also have its own weight
// The node scores returned by the priority function are multiplied by the weights to get weighted scores
// All scores are finally combined (added) to get the total weighted scores of all nodes
func PrioritizeNodes(
pod *v1.Pod,
nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
meta interface{},
priorityConfigs []priorities.PriorityConfig,
nodes []*v1.Node,
extenders []algorithm.SchedulerExtender,
) (schedulerapi.HostPriorityList, error) {
4.1 對所有 extender 實行併發異步處理
對所有結果分數累加,分析 HTTPExtender 的 Prioritize 方法
if len(extenders) != 0 && nodes != nil {
combinedScores := make(map[string]int, len(nodeNameToInfo))
for i := range extenders {
if !extenders[i].IsInterested(pod) {
continue
}
wg.Add(1)
go func(extIndex int) {
defer wg.Done()
prioritizedList, weight, err := extenders[extIndex].Prioritize(pod, nodes)
if err != nil {
// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
return
}
mu.Lock()
for i := range *prioritizedList {
host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
if klog.V(10) {
klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), host, extenders[extIndex].Name(), score)
}
combinedScores[host] += score * weight
}
mu.Unlock()
}(i)
}
// wait for all go routines to finish
wg.Wait()
for i := range result {
result[i].Score += combinedScores[result[i].Host]
}
}
4.2 HTTPExtender 的 Prioritize 方法
// Prioritize based on extender implemented priority functions. Weight*priority is added
// up for each such priority function. The returned score is added to the score computed
// by Kubernetes scheduler. The total score is used to do the host selection.
func (h *HTTPExtender) Prioritize(pod *v1.Pod, nodes []*v1.Node) (*schedulerapi.HostPriorityList, int, error) {
var (
result schedulerapi.HostPriorityList
nodeList *v1.NodeList
nodeNames *[]string
args *schedulerapi.ExtenderArgs
)
4.2.1 如果未設置 prioritizeVerb
打個比方,交白卷打 0 分,也就是無需實現自定義 prioritize 打分,0 分也就是對原來的得分沒有影響
if h.prioritizeVerb == "" {
result := schedulerapi.HostPriorityList{}
for _, node := range nodes {
result = append(result, schedulerapi.HostPriority{Host: node.Name, Score: 0})
}
return &result, 0, nil
}
預選與優選的 HTTPExtender 處理方法相同,不過返回結果不同而已,一個是過略的 node 節點,一個是 node 節點給的分數
總結:
部署需要scheduler指定配置 policy 的 Extender 配置
Extender 服務實現 HTTP server,實現預選優選方法
參考:
https://developer.ibm.com/technologies/containers/articles/creating-a-custom-kube-scheduler/