<input id="ohw05"></input>
  • <table id="ohw05"><menu id="ohw05"></menu></table>
  • <var id="ohw05"></var>
  • <code id="ohw05"><cite id="ohw05"></cite></code>
    <label id="ohw05"></label>
    <var id="ohw05"></var>
  • kube-scheduler源碼分析(3)-搶占調度分析

    kube-scheduler源碼分析(3)-搶占調度分析

    kube-scheduler簡介

    kube-scheduler組件是kubernetes中的核心組件之一,主要負責pod資源對象的調度工作,具體來說,kube-scheduler組件負責根據調度算法(包括預選算法和優選算法)將未調度的pod調度到合適的最優的node節點上。

    kube-scheduler架構圖

    kube-scheduler的大致組成和處理流程如下圖,kube-scheduler對pod、node等對象進行了list/watch,根據informer將未調度的pod放入待調度pod隊列,并根據informer構建調度器cache(用于快速獲取需要的node等對象),然后sched.scheduleOne方法為kube-scheduler組件調度pod的核心處理邏輯所在,從未調度pod隊列中取出一個pod,經過預選與優選算法,最終選出一個最優node,上述步驟都成功則更新cache并異步執行bind操作,也就是更新pod的nodeName字段,失敗則進入搶占邏輯,至此一個pod的調度工作完成。

    kube-scheduler搶占調度概述

    優先級和搶占機制,解決的是 Pod 調度失敗時該怎么辦的問題。

    正常情況下,當一個 pod 調度失敗后,就會被暫時 “擱置” 處于 pending 狀態,直到 pod 被更新或者集群狀態發生變化,調度器才會對這個 pod 進行重新調度。

    但是有的時候,我們希望給pod分等級,即分優先級。當一個高優先級的 Pod 調度失敗后,該 Pod 并不會被“擱置”,而是會“擠走”某個 Node 上的一些低優先級的 Pod,這樣一來就可以保證高優先級 Pod 會優先調度成功。

    關于pod優先級,具體請參考:https://kubernetes.io/zh/docs/concepts/scheduling-eviction/pod-priority-preemption/

    搶占發生的原因,一定是一個高優先級的 pod 調度失敗,我們稱這個 pod 為“搶占者”,稱被搶占的 pod 為“犧牲者”(victims)。

    PDB概述

    PDB全稱PodDisruptionBudget,可以理解為是k8s中用來保證Deployment、StatefulSet等控制器在集群中存在的最小副本數量的一個對象。

    具體請參考:
    https://kubernetes.io/zh/docs/concepts/workloads/pods/disruptions/
    https://kubernetes.io/zh/docs/tasks/run-application/configure-pdb/

    搶占調度功能開啟與關閉配置

    kube-scheduler的搶占調度功能默認開啟。

    在 Kubernetes 1.15+版本,如果 NonPreemptingPriority被啟用了(kube-scheduler組件啟動參數--feature-gates=NonPreemptingPriority=true) ,PriorityClass 可以設置 preemptionPolicy: Never,則該 PriorityClass 的所有 Pod在調度失敗后將不會執行搶占邏輯。

    另外,在 Kubernetes 1.11+版本,kube-scheduler組件也可以配置文件參數設置將搶占調度功能關閉(注意:不能通過組件啟動命令行參數設置)。

    apiVersion: kubescheduler.config.k8s.io/v1alpha1
    kind: KubeSchedulerConfiguration
    ...
    disablePreemption: true
    

    配置文件通過kube-scheduler啟動參數--config指定。

    kube-scheduler啟動參數參考:https://kubernetes.io/zh/docs/reference/command-line-tools-reference/kube-scheduler/
    kube-scheduler配置文件參考:https://kubernetes.io/zh/docs/reference/scheduling/config/

    kube-scheduler組件的分析將分為三大塊進行,分別是:
    (1)kube-scheduler初始化與啟動分析;
    (2)kube-scheduler核心處理邏輯分析;
    (3)kube-scheduler搶占調度邏輯分析;

    本篇進行搶占調度邏輯分析。

    3.kube-scheduler搶占調度邏輯分析

    基于tag v1.17.4

    https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

    分析入口-scheduleOne

    把scheduleOne方法作為kube-scheduler組件搶占調度的分析入口,這里只關注到scheduleOne方法中搶占調度相關的邏輯:
    (1)調用sched.Algorithm.Schedule方法,調度pod;
    (2)pod調度失敗后,調用sched.DisablePreemption判斷kube-scheduler組件是否關閉了搶占調度功能;
    (3)如未關閉搶占調度功能,則調用sched.preempt進行搶占調度邏輯;

    // pkg/scheduler/scheduler.go
    func (sched *Scheduler) scheduleOne(ctx context.Context) {
        ...
        // 調度pod
        scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, state, pod)
    	if err != nil {
    		...
    		if fitError, ok := err.(*core.FitError); ok {
    		    // 判斷是否關閉了搶占調度功能
    			if sched.DisablePreemption {
    				klog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
    					" No preemption is performed.")
    			} else {
    			// 搶占調度邏輯
    				preemptionStartTime := time.Now()
    				sched.preempt(schedulingCycleCtx, state, fwk, pod, fitError)
    				...
    			}
    			...
    	}
    	...
    

    sched.preempt

    sched.preempt為kube-scheduler搶占調度處理邏輯所在,主要邏輯:
    (1)調用sched.Algorithm.Preempt,模擬pod搶占調度過程,返回pod可以搶占的node節點、被搶占的pod列表、需要去除NominatedNodeName屬性的pod列表;
    (2)調用sched.podPreemptor.setNominatedNodeName,請求apiserver,將可以搶占的node節點名稱設置到pod的NominatedNodeName屬性值中,然后該pod會重新進入待調度pod隊列,等待再一次調度;
    (3)遍歷被搶占的pod列表,請求apiserver,刪除pod;
    (4)遍歷需要去除NominatedNodeName屬性的pod列表,請求apiserver,更新pod,去除pod的NominatedNodeName屬性值;

    注意:搶占調度處理邏輯并馬上把調度失敗的pod再次搶占調度到node上,而是根據模擬搶占的結果,刪除被搶占pod,空出相應的資源,最后把該調度失敗的pod交給下一個調度周期再處理。

    // pkg/scheduler/scheduler.go
    func (sched *Scheduler) preempt(ctx context.Context, state *framework.CycleState, fwk framework.Framework, preemptor *v1.Pod, scheduleErr error) (string, error) {
    	...
        // (1)模擬pod搶占調度過程,返回pod可以搶占的node節點、被搶占的pod列表、需要去除nominateName屬性的pod列表
    	node, victims, nominatedPodsToClear, err := sched.Algorithm.Preempt(ctx, state, preemptor, scheduleErr)
    	if err != nil {
    		klog.Errorf("Error preempting victims to make room for %v/%v: %v", preemptor.Namespace, preemptor.Name, err)
    		return "", err
    	}
    	var nodeName = ""
    	if node != nil {
    		nodeName = node.Name
    		
    		sched.SchedulingQueue.UpdateNominatedPodForNode(preemptor, nodeName)
    
    		// (2)請求apiserver,將可以搶占的node節點名稱設置到pod的nominatedNode屬性值中,然后該pod會重新進入待調度pod隊列,等待再一次調度
    		err = sched.podPreemptor.setNominatedNodeName(preemptor, nodeName)
    		if err != nil {
    			klog.Errorf("Error in preemption process. Cannot set 'NominatedPod' on pod %v/%v: %v", preemptor.Namespace, preemptor.Name, err)
    			sched.SchedulingQueue.DeleteNominatedPodIfExists(preemptor)
    			return "", err
    		}
            
            // (3)遍歷被搶占的pod列表,請求apiserver,刪除pod
    		for _, victim := range victims {
    			if err := sched.podPreemptor.deletePod(victim); err != nil {
    				klog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
    				return "", err
    			}
    			// If the victim is a WaitingPod, send a reject message to the PermitPlugin
    			if waitingPod := fwk.GetWaitingPod(victim.UID); waitingPod != nil {
    				waitingPod.Reject("preempted")
    			}
    			sched.Recorder.Eventf(victim, preemptor, v1.EventTypeNormal, "Preempted", "Preempting", "Preempted by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
    
    		}
    		metrics.PreemptionVictims.Observe(float64(len(victims)))
    	}
    	// (4)遍歷需要去除nominateName屬性的pod列表,請求apiserver,更新pod,去除pod的nominateName屬性值
    	for _, p := range nominatedPodsToClear {
    		rErr := sched.podPreemptor.removeNominatedNodeName(p)
    		if rErr != nil {
    			klog.Errorf("Cannot remove 'NominatedPod' field of pod: %v", rErr)
    			// We do not return as this error is not critical.
    		}
    	}
    	return nodeName, err
    }
    

    sched.Algorithm.Preempt

    sched.Algorithm.Preempt方法模擬pod搶占調度過程,返回pod可以搶占的node節點、被搶占的pod列表、需要去除NominatedNodeName屬性的pod列表,主要邏輯為:
    (1)調用nodesWherePreemptionMightHelp,獲取預選失敗且移除部分pod之后可能可以滿足調度條件的節點;
    (2)獲取PodDisruptionBudget對象,用于后續篩選可以被搶占的node節點列表(關于PodDisruptionBudget的用法,可自行搜索資料查看);
    (3)調用g.selectNodesForPreemption,篩選可以被搶占的node節點列表,并返回node節點上被搶占的pod的最小集合;
    (4)遍歷scheduler-extender(kube-scheduler的一種webhook擴展機制),執行extender的搶占處理邏輯,根據處理邏輯過濾可以被搶占的node節點列表;
    (5)調用pickOneNodeForPreemption,從可被搶占的node節點列表中挑選出一個node節點;
    (6)調用g.getLowerPriorityNominatedPods,獲取被搶占node節點上NominatedNodeName屬性不為空且優先級比搶占pod低的pod列表;

    // pkg/scheduler/core/generic_scheduler.go
    func (g *genericScheduler) Preempt(ctx context.Context, state *framework.CycleState, pod *v1.Pod, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
    	// Scheduler may return various types of errors. Consider preemption only if
    	// the error is of type FitError.
    	fitError, ok := scheduleErr.(*FitError)
    	if !ok || fitError == nil {
    		return nil, nil, nil, nil
    	}
    	if !podEligibleToPreemptOthers(pod, g.nodeInfoSnapshot.NodeInfoMap, g.enableNonPreempting) {
    		klog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name)
    		return nil, nil, nil, nil
    	}
    	if len(g.nodeInfoSnapshot.NodeInfoMap) == 0 {
    		return nil, nil, nil, ErrNoNodesAvailable
    	}
    	// (1)獲取預選失敗且移除部分pod之后可能可以滿足調度條件的節點;
    	potentialNodes := nodesWherePreemptionMightHelp(g.nodeInfoSnapshot.NodeInfoMap, fitError)
    	if len(potentialNodes) == 0 {
    		klog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name)
    		// In this case, we should clean-up any existing nominated node name of the pod.
    		return nil, nil, []*v1.Pod{pod}, nil
    	}
    	var (
    		pdbs []*policy.PodDisruptionBudget
    		err  error
    	)
    	// (2)獲取PodDisruptionBudget對象,用于后續篩選可以被搶占的node節點列表(關于PodDisruptionBudget的用法,可自行搜索資料查看);
    	if g.pdbLister != nil {
    		pdbs, err = g.pdbLister.List(labels.Everything())
    		if err != nil {
    			return nil, nil, nil, err
    		}
    	}
    	// (3)獲取可以被搶占的node節點列表;  
    	nodeToVictims, err := g.selectNodesForPreemption(ctx, state, pod, potentialNodes, pdbs)
    	if err != nil {
    		return nil, nil, nil, err
    	}
    	
        // (4)遍歷scheduler-extender(kube-scheduler的一種webhook擴展機制),執行extender的搶占處理邏輯,根據處理邏輯過濾可以被搶占的node節點列表; 
    	// We will only check nodeToVictims with extenders that support preemption.
    	// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
    	// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
    	nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
    	if err != nil {
    		return nil, nil, nil, err
    	}
        
        // (5)從可被搶占的node節點列表中挑選出一個node節點;  
    	candidateNode := pickOneNodeForPreemption(nodeToVictims)
    	if candidateNode == nil {
    		return nil, nil, nil, nil
    	}
        
        // (6)獲取被搶占node節點上nominateName屬性不為空且優先級比搶占pod低的pod列表;  
    	// Lower priority pods nominated to run on this node, may no longer fit on
    	// this node. So, we should remove their nomination. Removing their
    	// nomination updates these pods and moves them to the active queue. It
    	// lets scheduler find another place for them.
    	nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
    	if nodeInfo, ok := g.nodeInfoSnapshot.NodeInfoMap[candidateNode.Name]; ok {
    		return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, nil
    	}
    
    	return nil, nil, nil, fmt.Errorf(
    		"preemption failed: the target node %s has been deleted from scheduler cache",
    		candidateNode.Name)
    }
    

    3.1 nodesWherePreemptionMightHelp

    nodesWherePreemptionMightHelp函數主要是返回預選失敗且移除部分pod之后可能可以滿足調度條件的節點。

    怎么判斷某個預選失敗的node節點移除部分pod之后可能可以滿足調度條件呢?主要邏輯看到predicates.UnresolvablePredicateExists方法。

    // pkg/scheduler/core/generic_scheduler.go
    func nodesWherePreemptionMightHelp(nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, fitErr *FitError) []*v1.Node {
    	potentialNodes := []*v1.Node{}
    	for name, node := range nodeNameToInfo {
    		if fitErr.FilteredNodesStatuses[name].Code() == framework.UnschedulableAndUnresolvable {
    			continue
    		}
    		failedPredicates := fitErr.FailedPredicates[name]
    
    		// If we assume that scheduler looks at all nodes and populates the failedPredicateMap
    		// (which is the case today), the !found case should never happen, but we'd prefer
    		// to rely less on such assumptions in the code when checking does not impose
    		// significant overhead.
    		// Also, we currently assume all failures returned by extender as resolvable.
    		if predicates.UnresolvablePredicateExists(failedPredicates) == nil {
    			klog.V(3).Infof("Node %v is a potential node for preemption.", name)
    			potentialNodes = append(potentialNodes, node.Node())
    		}
    	}
    	return potentialNodes
    }
    

    3.1.1 predicates.UnresolvablePredicateExists

    只要預選算法執行失敗的node節點,其失敗的原因不屬于unresolvablePredicateFailureErrors中任何一個原因時,則該預選失敗的node節點移除部分pod之后可能可以滿足調度條件。

    unresolvablePredicateFailureErrors包括節點NodeSelector不匹配、pod反親和規則不符合、污點不容忍、節點屬于NotReady狀態、節點內存不足等等。

    // pkg/scheduler/algorithm/predicates/error.go
    var unresolvablePredicateFailureErrors = map[PredicateFailureReason]struct{}{
    	ErrNodeSelectorNotMatch:      {},
    	ErrPodAffinityRulesNotMatch:  {},
    	ErrPodNotMatchHostName:       {},
    	ErrTaintsTolerationsNotMatch: {},
    	ErrNodeLabelPresenceViolated: {},
    	// Node conditions won't change when scheduler simulates removal of preemption victims.
    	// So, it is pointless to try nodes that have not been able to host the pod due to node
    	// conditions. These include ErrNodeNotReady, ErrNodeUnderPIDPressure, ErrNodeUnderMemoryPressure, ....
    	ErrNodeNotReady:            {},
    	ErrNodeNetworkUnavailable:  {},
    	ErrNodeUnderDiskPressure:   {},
    	ErrNodeUnderPIDPressure:    {},
    	ErrNodeUnderMemoryPressure: {},
    	ErrNodeUnschedulable:       {},
    	ErrNodeUnknownCondition:    {},
    	ErrVolumeZoneConflict:      {},
    	ErrVolumeNodeConflict:      {},
    	ErrVolumeBindConflict:      {},
    }
    
    // UnresolvablePredicateExists checks if there is at least one unresolvable predicate failure reason, if true
    // returns the first one in the list.
    func UnresolvablePredicateExists(reasons []PredicateFailureReason) PredicateFailureReason {
    	for _, r := range reasons {
    		if _, ok := unresolvablePredicateFailureErrors[r]; ok {
    			return r
    		}
    	}
    	return nil
    }
    

    3.2 g.selectNodesForPreemption

    g.selectNodesForPreemption方法,用于獲取可以被搶占的node節點列表,并返回node節點上被搶占的pod的最小集合,主要邏輯如下:
    (1)定義checkNode函數,主要是調用g.selectVictimsOnNode方法,方法返回某node是否適合被搶占,并返回該node節點上被搶占的pod的最小集合、被搶占pod中定義了PDB的pod數量;
    (2)拉起16個goroutine,并發調用checkNode函數,對預選失敗的node節點列表進行是否適合被搶占的檢查;

    // pkg/scheduler/core/generic_scheduler.go
    // selectNodesForPreemption finds all the nodes with possible victims for
    // preemption in parallel.
    func (g *genericScheduler) selectNodesForPreemption(
    	ctx context.Context,
    	state *framework.CycleState,
    	pod *v1.Pod,
    	potentialNodes []*v1.Node,
    	pdbs []*policy.PodDisruptionBudget,
    ) (map[*v1.Node]*extenderv1.Victims, error) {
    	nodeToVictims := map[*v1.Node]*extenderv1.Victims{}
    	var resultLock sync.Mutex
        
        // (1)定義checkNode函數
    	// We can use the same metadata producer for all nodes.
    	meta := g.predicateMetaProducer(pod, g.nodeInfoSnapshot)
    	checkNode := func(i int) {
    		nodeName := potentialNodes[i].Name
    		if g.nodeInfoSnapshot.NodeInfoMap[nodeName] == nil {
    			return
    		}
    		nodeInfoCopy := g.nodeInfoSnapshot.NodeInfoMap[nodeName].Clone()
    		var metaCopy predicates.Metadata
    		if meta != nil {
    			metaCopy = meta.ShallowCopy()
    		}
    		stateCopy := state.Clone()
    		stateCopy.Write(migration.PredicatesStateKey, &migration.PredicatesStateData{Reference: metaCopy})
    		// 調用g.selectVictimsOnNode方法,方法返回某node是否適合被搶占,并返回該node節點上被搶占的pod的最小集合、與PDB沖突的pod數量; 
    		pods, numPDBViolations, fits := g.selectVictimsOnNode(ctx, stateCopy, pod, metaCopy, nodeInfoCopy, pdbs)
    		if fits {
    			resultLock.Lock()
    			victims := extenderv1.Victims{
    				Pods:             pods,
    				NumPDBViolations: int64(numPDBViolations),
    			}
    			nodeToVictims[potentialNodes[i]] = &victims
    			resultLock.Unlock()
    		}
    	}
    	// (2)拉起16個goroutine,并發調用checkNode函數,對預選失敗的node節點列表進行是否適合被搶占的檢查;
    	workqueue.ParallelizeUntil(context.TODO(), 16, len(potentialNodes), checkNode)
    	return nodeToVictims, nil
    }
    

    3.2.1 g.selectVictimsOnNode

    g.selectVictimsOnNode方法用于判斷某node是否適合被搶占,并返回該node節點上被搶占的pod的最小集合、被搶占pod中定義了PDB的pod數量。

    主要邏輯:
    (1)首先,假設把該node節點上比搶占pod優先級低的所有pod都刪除掉,然后調用預選算法,看pod在該node上是否滿足調度條件,假如還是不符合調度條件,則該node節點不適合被搶占,直接return;
    (2)將所有比搶占pod優先級低的pod按優先級高低進行排序,優先級最低的排在最前面;
    (3)將排好序的pod列表按是否定義了PDB分成兩個pod列表;
    (4)先遍歷定義了PDB的pod列表,逐一刪除pod(被刪除的pod稱為被搶占pod),每刪除一個pod,調度預選算法,看pod在該node上是否滿足調度條件,如滿足則直接返回該node適合被搶占、被搶占的pod列表、被搶占pod中定義了PDB的pod數量;
    (5)假如遍歷完定義了PDB的pod列表后,搶占pod在該node上任然不滿足調度條件,則繼續遍歷沒有定義PDB的pod列表,逐一刪除pod,每刪除一個pod,調度預選算法,看pod在該node上是否滿足調度條件,如滿足則直接返回該node適合被搶占、被搶占的pod列表、被搶占pod中定義了PDB的pod數量;
    (6)如果上述兩個pod列表里的pod都被刪除后,搶占pod在該node上任然不滿足調度條件,則該node不適合被搶占,return。

    注意:以上說的刪除pod并不是真正的刪除,而是模擬刪除后,搶占pod是否滿足調度條件而已。真正的刪除被搶占pod的操作在后續確定了要搶占的node節點后,再刪除該node節點上被搶占的pod。

    // pkg/scheduler/core/generic_scheduler.go
    func (g *genericScheduler) selectVictimsOnNode(
    	ctx context.Context,
    	state *framework.CycleState,
    	pod *v1.Pod,
    	meta predicates.Metadata,
    	nodeInfo *schedulernodeinfo.NodeInfo,
    	pdbs []*policy.PodDisruptionBudget,
    ) ([]*v1.Pod, int, bool) {
    	var potentialVictims []*v1.Pod
    
    	removePod := func(rp *v1.Pod) error {
    		if err := nodeInfo.RemovePod(rp); err != nil {
    			return err
    		}
    		if meta != nil {
    			if err := meta.RemovePod(rp, nodeInfo.Node()); err != nil {
    				return err
    			}
    		}
    		status := g.framework.RunPreFilterExtensionRemovePod(ctx, state, pod, rp, nodeInfo)
    		if !status.IsSuccess() {
    			return status.AsError()
    		}
    		return nil
    	}
    	addPod := func(ap *v1.Pod) error {
    		nodeInfo.AddPod(ap)
    		if meta != nil {
    			if err := meta.AddPod(ap, nodeInfo.Node()); err != nil {
    				return err
    			}
    		}
    		status := g.framework.RunPreFilterExtensionAddPod(ctx, state, pod, ap, nodeInfo)
    		if !status.IsSuccess() {
    			return status.AsError()
    		}
    		return nil
    	}
    	// (1)首先,假設把該node節點上比搶占pod優先級低的所有pod都刪除掉,然后調用預選算法,看pod在該node上是否滿足調度條件,假如還是不符合調度條件,則該node節點不適合被搶占,直接return
    	// As the first step, remove all the lower priority pods from the node and
    	// check if the given pod can be scheduled.
    	podPriority := podutil.GetPodPriority(pod)
    	for _, p := range nodeInfo.Pods() {
    		if podutil.GetPodPriority(p) < podPriority {
    			potentialVictims = append(potentialVictims, p)
    			if err := removePod(p); err != nil {
    				return nil, 0, false
    			}
    		}
    	}
    	// If the new pod does not fit after removing all the lower priority pods,
    	// we are almost done and this node is not suitable for preemption. The only
    	// condition that we could check is if the "pod" is failing to schedule due to
    	// inter-pod affinity to one or more victims, but we have decided not to
    	// support this case for performance reasons. Having affinity to lower
    	// priority pods is not a recommended configuration anyway.
    	if fits, _, _, err := g.podFitsOnNode(ctx, state, pod, meta, nodeInfo, false); !fits {
    		if err != nil {
    			klog.Warningf("Encountered error while selecting victims on node %v: %v", nodeInfo.Node().Name, err)
    		}
    
    		return nil, 0, false
    	}
    	var victims []*v1.Pod
    	numViolatingVictim := 0
    	// (2)將所有比搶占pod優先級低的pod按優先級高低進行排序,優先級最低的排在最前面;  
    	sort.Slice(potentialVictims, func(i, j int) bool { return util.MoreImportantPod(potentialVictims[i], potentialVictims[j]) })
    	// Try to reprieve as many pods as possible. We first try to reprieve the PDB
    	// violating victims and then other non-violating ones. In both cases, we start
    	// from the highest priority victims.
    	// (3)將排好序的pod列表按是否定義了PDB分成兩個pod列表; 
    	violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
    	reprievePod := func(p *v1.Pod) (bool, error) {
    		if err := addPod(p); err != nil {
    			return false, err
    		}
    		fits, _, _, _ := g.podFitsOnNode(ctx, state, pod, meta, nodeInfo, false)
    		if !fits {
    			if err := removePod(p); err != nil {
    				return false, err
    			}
    			victims = append(victims, p)
    			klog.V(5).Infof("Pod %v/%v is a potential preemption victim on node %v.", p.Namespace, p.Name, nodeInfo.Node().Name)
    		}
    		return fits, nil
    	}
    	// (4)先遍歷定義了PDB的pod列表,逐一刪除pod(被刪除的pod稱為被搶占pod),每刪除一個pod,調度預選算法,看pod在該node上是否滿足調度條件,如滿足則直接返回該node適合被搶占、被搶占的pod列表、被搶占pod中定義了PDB的pod數量;  
    	for _, p := range violatingVictims {
    		if fits, err := reprievePod(p); err != nil {
    			klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
    			return nil, 0, false
    		} else if !fits {
    			numViolatingVictim++
    		}
    	}
    	// (5)假如遍歷完定義了PDB的pod列表后,搶占pod在該node上任然不滿足調度條件,則繼續遍歷沒有定義PDB的pod列表,逐一刪除pod,每刪除一個pod,調度預選算法,看pod在該node上是否滿足調度條件,如滿足則直接返回該node適合被搶占、被搶占的pod列表、被搶占pod中定義了PDB的pod數量;  
    	// Now we try to reprieve non-violating victims.
    	for _, p := range nonViolatingVictims {
    		if _, err := reprievePod(p); err != nil {
    			klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
    			return nil, 0, false
    		}
    	}
    	// (6)如果上述兩個pod列表里的pod都被刪除后,搶占pod在該node上任然不滿足調度條件,則該node不適合被搶占,return。 
    	return victims, numViolatingVictim, true
    }
    

    3.3 pickOneNodeForPreemption

    pickOneNodeForPreemption函數,從可被搶占的node節點列表中挑選出一個node節點,該函數將按順序參照下列規則來挑選最優的被搶占node,直到某個條件能夠選出唯一的一個node節點:
    (1)node節點沒有被搶占pod的,優先選擇;
    (2)被搶占pod中定義了PDB的pod數量最少的節點;
    (3)高優先級pod數量最少的節點;
    (4)對node節點上所有被搶占pod的優先級進行相加,選取其值最小的節點;
    (5)選擇被搶占pod數量最少的node節點;
    (6)選擇被搶占pod中運行時間最短的pod所在node節點;
    (7)返回符合上述條件的最后一個node節點;

    // pkg/scheduler/core/generic_scheduler.go
    func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*extenderv1.Victims) *v1.Node {
    	if len(nodesToVictims) == 0 {
    		return nil
    	}
    	minNumPDBViolatingPods := int64(math.MaxInt32)
    	var minNodes1 []*v1.Node
    	lenNodes1 := 0
    	for node, victims := range nodesToVictims {
    	    // (1)node節點沒有被搶占pod的,優先選擇
    		if len(victims.Pods) == 0 {
    			// We found a node that doesn't need any preemption. Return it!
    			// This should happen rarely when one or more pods are terminated between
    			// the time that scheduler tries to schedule the pod and the time that
    			// preemption logic tries to find nodes for preemption.
    			return node
    		}
    		//(2)與PDB沖突的pod數量最少的節點
    		numPDBViolatingPods := victims.NumPDBViolations
    		if numPDBViolatingPods < minNumPDBViolatingPods {
    			minNumPDBViolatingPods = numPDBViolatingPods
    			minNodes1 = nil
    			lenNodes1 = 0
    		}
    		if numPDBViolatingPods == minNumPDBViolatingPods {
    			minNodes1 = append(minNodes1, node)
    			lenNodes1++
    		}
    	}
    	if lenNodes1 == 1 {
    		return minNodes1[0]
    	}
        
        // (3)高優先級pod數量最少的節點
    	// There are more than one node with minimum number PDB violating pods. Find
    	// the one with minimum highest priority victim.
    	minHighestPriority := int32(math.MaxInt32)
    	var minNodes2 = make([]*v1.Node, lenNodes1)
    	lenNodes2 := 0
    	for i := 0; i < lenNodes1; i++ {
    		node := minNodes1[i]
    		victims := nodesToVictims[node]
    		// highestPodPriority is the highest priority among the victims on this node.
    		highestPodPriority := podutil.GetPodPriority(victims.Pods[0])
    		if highestPodPriority < minHighestPriority {
    			minHighestPriority = highestPodPriority
    			lenNodes2 = 0
    		}
    		if highestPodPriority == minHighestPriority {
    			minNodes2[lenNodes2] = node
    			lenNodes2++
    		}
    	}
    	if lenNodes2 == 1 {
    		return minNodes2[0]
    	}
        
        // (4)對node節點上所有被搶占pod的優先級進行相加,選取其值最小的節點
    	// There are a few nodes with minimum highest priority victim. Find the
    	// smallest sum of priorities.
    	minSumPriorities := int64(math.MaxInt64)
    	lenNodes1 = 0
    	for i := 0; i < lenNodes2; i++ {
    		var sumPriorities int64
    		node := minNodes2[i]
    		for _, pod := range nodesToVictims[node].Pods {
    			// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
    			// needed so that a node with a few pods with negative priority is not
    			// picked over a node with a smaller number of pods with the same negative
    			// priority (and similar scenarios).
    			sumPriorities += int64(podutil.GetPodPriority(pod)) + int64(math.MaxInt32+1)
    		}
    		if sumPriorities < minSumPriorities {
    			minSumPriorities = sumPriorities
    			lenNodes1 = 0
    		}
    		if sumPriorities == minSumPriorities {
    			minNodes1[lenNodes1] = node
    			lenNodes1++
    		}
    	}
    	if lenNodes1 == 1 {
    		return minNodes1[0]
    	}
        
        // (5)選擇被搶占pod數量最少的node節點; 
    	// There are a few nodes with minimum highest priority victim and sum of priorities.
    	// Find one with the minimum number of pods.
    	minNumPods := math.MaxInt32
    	lenNodes2 = 0
    	for i := 0; i < lenNodes1; i++ {
    		node := minNodes1[i]
    		numPods := len(nodesToVictims[node].Pods)
    		if numPods < minNumPods {
    			minNumPods = numPods
    			lenNodes2 = 0
    		}
    		if numPods == minNumPods {
    			minNodes2[lenNodes2] = node
    			lenNodes2++
    		}
    	}
    	if lenNodes2 == 1 {
    		return minNodes2[0]
    	}
        
        // (6)選擇被搶占pod中運行時間最短的pod所在node節點; 
    	// There are a few nodes with same number of pods.
    	// Find the node that satisfies latest(earliestStartTime(all highest-priority pods on node))
    	latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])
    	if latestStartTime == nil {
    		// If the earliest start time of all pods on the 1st node is nil, just return it,
    		// which is not expected to happen.
    		klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", minNodes2[0])
    		return minNodes2[0]
    	}
    	nodeToReturn := minNodes2[0]
    	for i := 1; i < lenNodes2; i++ {
    		node := minNodes2[i]
    		// Get earliest start time of all pods on the current node.
    		earliestStartTimeOnNode := util.GetEarliestPodStartTime(nodesToVictims[node])
    		if earliestStartTimeOnNode == nil {
    			klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", node)
    			continue
    		}
    		if earliestStartTimeOnNode.After(latestStartTime.Time) {
    			latestStartTime = earliestStartTimeOnNode
    			nodeToReturn = node
    		}
    	}
        
        // (7)返回符合上述條件的最后一個node節點
    	return nodeToReturn
    }
    

    總結

    kube-scheduler簡介

    kube-scheduler組件是kubernetes中的核心組件之一,主要負責pod資源對象的調度工作,具體來說,kube-scheduler組件負責根據調度算法(包括預選算法和優選算法)將未調度的pod調度到合適的最優的node節點上。

    kube-scheduler架構圖

    kube-scheduler的大致組成和處理流程如下圖,kube-scheduler對pod、node等對象進行了list/watch,根據informer將未調度的pod放入待調度pod隊列,并根據informer構建調度器cache(用于快速獲取需要的node等對象),然后sched.scheduleOne方法為kube-scheduler組件調度pod的核心處理邏輯所在,從未調度pod隊列中取出一個pod,經過預選與優選算法,最終選出一個最優node,上述步驟都成功則更新cache并異步執行bind操作,也就是更新pod的nodeName字段,失敗則進入搶占邏輯,至此一個pod的調度工作完成。

    kube-scheduler搶占調度概述

    優先級和搶占機制,解決的是 Pod 調度失敗時該怎么辦的問題。

    正常情況下,當一個 pod 調度失敗后,就會被暫時 “擱置” 處于 pending 狀態,直到 pod 被更新或者集群狀態發生變化,調度器才會對這個 pod 進行重新調度。

    但是有的時候,我們希望給pod分等級,即分優先級。當一個高優先級的 Pod 調度失敗后,該 Pod 并不會被“擱置”,而是會“擠走”某個 Node 上的一些低優先級的 Pod,這樣一來就可以保證高優先級 Pod 會優先調度成功。

    搶占發生的原因,一定是一個高優先級的 pod 調度失敗,我們稱這個 pod 為“搶占者”,稱被搶占的 pod 為“犧牲者”(victims)。

    kube-scheduler搶占邏輯流程圖

    下方處理流程圖展示了kube-scheduler搶占邏輯的核心處理步驟,在開始搶占邏輯處理之前,會先進行搶占調度功能是否開啟的判斷。

    posted @ 2022-03-13 15:55  良凱爾  閱讀(242)  評論(0編輯  收藏  舉報
    国产美女a做受大片观看