別被官方文檔迷惑了!這篇文章幫你詳解yarn公平調度

歡迎大家前往騰訊雲+社區,獲取更多騰訊海量技術實踐乾貨哦~

本文由@edwinhzhang發表於雲+社區專欄

FairScheduler是yarn常用的調度器,但是僅僅參考官方文檔,有很多參數和概念文檔裏沒有詳細說明,但是這些參明顯會影響到集羣的正常運行。本文的主要目的是通過梳理代碼將關鍵參數的功能理清楚。下面列出官方文檔中常用的參數:

yarn.scheduler.fair.preemption.cluster-utilization-threshold The utilization threshold after which preemption kicks in. The utilization is computed as the maximum ratio of usage to capacity among all resources. Defaults to 0.8f.
yarn.scheduler.fair.update-interval-ms The interval at which to lock the scheduler and recalculate fair shares, recalculate demand, and check whether anything is due for preemption. Defaults to 500 ms.
maxAMShare limit the fraction of the queue’s fair share that can be used to run application masters. This property can only be used for leaf queues. For example, if set to 1.0f, then AMs in the leaf queue can take up to 100% of both the memory and CPU fair share. The value of -1.0f will disable this feature and the amShare will not be checked. The default value is 0.5f.
minSharePreemptionTimeout number of seconds the queue is under its minimum share before it will try to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.
fairSharePreemptionTimeout number of seconds the queue is under its fair share threshold before it will try to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.
fairSharePreemptionThreshold If the queue waits fairSharePreemptionTimeout without receiving fairSharePreemptionThreshold*fairShare resources, it is allowed to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.

在上述參數描述中,timeout等參數值沒有給出默認值,沒有告知不設置會怎樣。minShare,fairShare等概念也沒有說清楚,很容易讓人云裏霧裏。關於這些參數和概念的詳細解釋,在下面的分析中一一給出。

FairScheduler整體結構

img 圖(1) FairScheduler 運行流程圖

公平調度器的運行流程就是RM去啓動FairScheduler,SchedulerDispatcher兩個服務,這兩個服務各自負責update線程,handle線程。

update線程有兩個任務:(1)更新各個隊列的資源(Instantaneous Fair Share),(2)判斷各個leaf隊列是否需要搶佔資源(如果開啓搶佔功能)

handle線程主要是處理一些事件響應,比如集羣增加節點,隊列增加APP,隊列刪除APP,APP更新container等。

FairScheduler類圖

img圖(2) FairScheduler相關類圖

隊列繼承模塊:yarn通過樹形結構來管理隊列。從管理資源角度來看,樹的根節點root隊列(FSParentQueue),非根節點(FSParentQueue),葉子節點(FSLeaf),app任務(FSAppAttempt,公平調度器角度的App)都是抽象的資源,它們都實現了Schedulable接口,都是一個可調度資源對象。它們都有自己的fair share(隊列的資源量)方法(這裏又用到了fair share概念),weight屬性(權重)、minShare屬性(最小資源量)、maxShare屬性(最大資源量),priority屬性(優先級)、resourceUsage屬性(資源使用量屬性)以及資源需求量屬性(demand),同時也都實現了preemptContainer搶佔資源的方法,assignContainer方法(爲一個ACCEPTED的APP分配AM的container)。

public interface Schedulable {
  /**
   * Name of job/queue, used for debugging as well as for breaking ties in
   * scheduling order deterministically.
   */
  public String getName();

  /**
   * Maximum number of resources required by this Schedulable. This is defined as
   * number of currently utilized resources + number of unlaunched resources (that
   * are either not yet launched or need to be speculated).
   */
  public Resource getDemand();

  /** Get the aggregate amount of resources consumed by the schedulable. */
  public Resource getResourceUsage();

  /** Minimum Resource share assigned to the schedulable. */
  public Resource getMinShare();

  /** Maximum Resource share assigned to the schedulable. */
  public Resource getMaxShare();

  /** Job/queue weight in fair sharing. */
  public ResourceWeights getWeights();

  /** Start time for jobs in FIFO queues; meaningless for QueueSchedulables.*/
  public long getStartTime();

 /** Job priority for jobs in FIFO queues; meaningless for QueueSchedulables. */
  public Priority getPriority();

  /** Refresh the Schedulable's demand and those of its children if any. */
  public void updateDemand();

  /**
   * Assign a container on this node if possible, and return the amount of
   * resources assigned.
   */
  public Resource assignContainer(FSSchedulerNode node);

  /**
   * Preempt a container from this Schedulable if possible.
   */
  public RMContainer preemptContainer();

  /** Get the fair share assigned to this Schedulable. */
  public Resource getFairShare();

  /** Assign a fair share to this Schedulable. */
  public void setFairShare(Resource fairShare);
}

隊列運行模塊:從類圖角度描述公平調度的工作原理。SchedulerEventDispatcher類負責管理handle線程。FairScheduler類管理update線程,通過QueueManager獲取所有隊列信息。

我們從Instantaneous Fair Share 和Steady Fair Share 這兩個yarn的基本概念開始進行代碼分析。

Instantaneous Fair Share & Steady Fair Share

Fair Share指的都是Yarn根據每個隊列的權重、最大,最小可運行資源計算的得到的可以分配給這個隊列的最大可用資源。本文描述的是公平調度,公平調度的默認策略FairSharePolicy的規則是single-resource,即只關注內存資源這一項指標。

Steady Fair Share:是每個隊列內存資源量的固定理論值。Steady Fair Share在RM初期工作後不再輕易改變,只有後續在增加節點(addNode)時纔會重新計算。RM的初期工作也是handle線程把集羣的每個節點添加到調度器中(addNode)。

Instantaneous Fair Share:是每個隊列的內存資源量的實際值,是在動態變化的。yarn裏的fair share如果沒有專門指代,都是指的的Instantaneous Fair Share。

1 Steady Fair Share計算方式

img 圖(3) steady fair share 計算流程

handle線程如果接收到NODE_ADDED事件,會去調用addNode方法。

  private synchronized void addNode(RMNode node) {
    FSSchedulerNode schedulerNode = new FSSchedulerNode(node, usePortForNodeName);
    nodes.put(node.getNodeID(), schedulerNode);
    //將該節點的內存加入到集羣總資源
    Resources.addTo(clusterResource, schedulerNode.getTotalResource());
    //更新available資源
    updateRootQueueMetrics();
    //更新一個container的最大分配,就是UI界面裏的MAX(如果沒有記錯的話)
    updateMaximumAllocation(schedulerNode, true);

    //設置root隊列的steadyFailr=clusterResource的總資源
    queueMgr.getRootQueue().setSteadyFairShare(clusterResource);
    //重新計算SteadyShares
    queueMgr.getRootQueue().recomputeSteadyShares();
    LOG.info("Added node " + node.getNodeAddress() +
        " cluster capacity: " + clusterResource);
  }

recomputeSteadyShares 使用廣度優先遍歷計算每個隊列的內存資源量,直到葉子節點。

 public void recomputeSteadyShares() {
    //廣度遍歷整個隊列樹
    //此時getSteadyFairShare 爲clusterResource
    policy.computeSteadyShares(childQueues, getSteadyFairShare());
    for (FSQueue childQueue : childQueues) {
      childQueue.getMetrics().setSteadyFairShare(childQueue.getSteadyFairShare());
      if (childQueue instanceof FSParentQueue) {
        ((FSParentQueue) childQueue).recomputeSteadyShares();
      }
    }
  }

computeSteadyShares方法計算每個隊列應該分配到的內存資源,總體來說是根據每個隊列的權重值去分配,權重大的隊列分配到的資源更多,權重小的隊列分配到得資源少。但是實際的細節還會受到其他因素影響,是因爲每隊列有minResources和maxResources兩個參數來限制資源的上下限。computeSteadyShares最終去調用computeSharesInternal方法。比如以下圖爲例:

圖中的數字是權重,假如有600G的總資源,parent=300G,leaf1=300G,leaf2=210G,leaf3=70G。

img圖(4) yarn隊列權重

computeSharesInternal方法概括來說就是通過二分查找法尋找到一個資源比重值R(weight-to-slots),使用這個R爲每個隊列分配資源(在該方法裏隊列的類型是Schedulable,再次說明隊列是一個資源對象),公式是steadyFairShare=R * QueueWeights

computeSharesInternal是計算Steady Fair Share 和Instantaneous Fair Share共用的方法,根據參數isSteadyShare來區別計算。

之所以要做的這麼複雜,是因爲隊列不是單純的按照比例來分配資源的(單純按權重比例,需要maxR,minR都不設置。maxR的默認值是0x7fffffff,minR默認值是0)。如果設置了maxR,minR,按比例分到的資源小於minR,那麼必須滿足minR。按比例分到的資源大於maxR,那麼必須滿足maxR。因此想要找到一個R(weight-to-slots)來儘可能滿足:

  • R*(Queue1Weights + Queue2Weights+...+QueueNWeights) <=totalResource
  • R*QueueWeights >= minShare
  • R*QueueWeights <= maxShare

注:QueueNWeights爲隊列各自的權重,minShare和maxShare即各個隊列的minResources和maxResources

computcomputeSharesInternal詳細來說分爲四個步驟:

  1. 確定可用資源:totalResources = min(totalResources-takenResources(fixedShare), totalMaxShare)
  2. 確定R上下限
  3. 二分查找法逼近R
  4. 使用R設置fair Share
  private static void computeSharesInternal(
      Collection<? extends Schedulable> allSchedulables,
      Resource totalResources, ResourceType type, boolean isSteadyShare) {

    Collection<Schedulable> schedulables = new ArrayList<Schedulable>();
    //第一步
    //排除有固定資源不能動的隊列,並得出固定內存資源
    int takenResources = handleFixedFairShares(
        allSchedulables, schedulables, isSteadyShare, type);

    if (schedulables.isEmpty()) {
      return;
    }
    // Find an upper bound on R that we can use in our binary search. We start
    // at R = 1 and double it until we have either used all the resources or we
    // have met all Schedulables' max shares.
    int totalMaxShare = 0;
    //遍歷schedulables(非固定fixed隊列),將各個隊列的資源相加得到totalMaxShare
    for (Schedulable sched : schedulables) {
      int maxShare = getResourceValue(sched.getMaxShare(), type);
      totalMaxShare = (int) Math.min((long)maxShare + (long)totalMaxShare,
          Integer.MAX_VALUE);
      if (totalMaxShare == Integer.MAX_VALUE) {
        break;
      }
    }
    //總資源要減去fiexd share
    int totalResource = Math.max((getResourceValue(totalResources, type) -
        takenResources), 0);
    //隊列所擁有的最大資源是有集羣總資源和每個隊列的MaxResource雙重限制
    totalResource = Math.min(totalMaxShare, totalResource);
    //第二步:設置R的上下限
    double rMax = 1.0;
    while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
        < totalResource) {
      rMax *= 2.0;
    }

    //第三步:二分法逼近合理R值
    // Perform the binary search for up to COMPUTE_FAIR_SHARES_ITERATIONS steps
    double left = 0;
    double right = rMax;
    for (int i = 0; i < COMPUTE_FAIR_SHARES_ITERATIONS; i++) {
      double mid = (left + right) / 2.0;
      int plannedResourceUsed = resourceUsedWithWeightToResourceRatio(
          mid, schedulables, type);
      if (plannedResourceUsed == totalResource) {
        right = mid;
        break;
      } else if (plannedResourceUsed < totalResource) {
        left = mid;
      } else {
        right = mid;
      }
    }
    //第四步:使用R值設置,確定各個非fixed隊列的fairShar,意味着只有活躍隊列可以分資源
    // Set the fair shares based on the value of R we've converged to
    for (Schedulable sched : schedulables) {
      if (isSteadyShare) {
        setResourceValue(computeShare(sched, right, type),
            ((FSQueue) sched).getSteadyFairShare(), type);
      } else {
        setResourceValue(
            computeShare(sched, right, type), sched.getFairShare(), type);
      }
    }
  }

(1) 確定可用資源

handleFixedFairShares方法來統計出所有fixed隊列的fixed內存資源(fixedShare)相加,並且fixed隊列排除掉不得瓜分系統資源。yarn確定fixed隊列的標準如下:

  private static int getFairShareIfFixed(Schedulable sched,
      boolean isSteadyShare, ResourceType type) {

    //如果隊列的maxShare <=0  則是fixed隊列,fixdShare=0
    if (getResourceValue(sched.getMaxShare(), type) <= 0) {
      return 0;
    }

    //如果是計算Instantaneous Fair Share,並且該隊列內沒有APP再跑,
    // 則是fixed隊列,fixdShare=0
    if (!isSteadyShare &&
        (sched instanceof FSQueue) && !((FSQueue)sched).isActive()) {
      return 0;
    }

    //如果隊列weight<=0,則是fixed隊列
    //如果對列minShare <=0,fixdShare=0,否則fixdShare=minShare
    if (sched.getWeights().getWeight(type) <= 0) {
      int minShare = getResourceValue(sched.getMinShare(), type);
      return (minShare <= 0) ? 0 : minShare;
    }

    return -1;
  }

(2)確定R上下限

R的下限爲1.0,R的上限是由resourceUsedWithWeightToResourceRatio方法來確定。該方法確定的資源值W,第一步中確定的可用資源值TW>=T時,R才能確定。

//根據R值去計算每個隊列應該分配的資源
  private static int resourceUsedWithWeightToResourceRatio(double w2rRatio,
      Collection<? extends Schedulable> schedulables, ResourceType type) {
    int resourcesTaken = 0;
    for (Schedulable sched : schedulables) {
      int share = computeShare(sched, w2rRatio, type);
      resourcesTaken += share;
    }
    return resourcesTaken;
  }
 private static int computeShare(Schedulable sched, double w2rRatio,
      ResourceType type) {
    //share=R*weight,type是內存
    double share = sched.getWeights().getWeight(type) * w2rRatio;
    share = Math.max(share, getResourceValue(sched.getMinShare(), type));
    share = Math.min(share, getResourceValue(sched.getMaxShare(), type));
    return (int) share;
  }

(3)二分查找法逼近R

滿足下面兩個條件中的一個即可終止二分查找:

  • W == T(步驟2中的W和T)
  • 超過25次(COMPUTE_FAIR_SHARES_ITERATIONS)

(4)使用R設置fair share

設置fair share時,可以看到區分了Steady Fair Share 和Instantaneous Fair Share。

  for (Schedulable sched : schedulables) {
      if (isSteadyShare) {
        setResourceValue(computeShare(sched, right, type),
            ((FSQueue) sched).getSteadyFairShare(), type);
      } else {
        setResourceValue(
            computeShare(sched, right, type), sched.getFairShare(), type);
      }
    }

2 Instaneous Fair Share計算方式

img圖(5)Instaneous Fair Share 計算流程

該計算方式與steady fair的計算調用棧是一致的,最終都要使用到computeSharesInternal方法,唯一不同的是計算的時機不一樣。steady fair只有在addNode的時候纔會重新計算一次,而Instantaneous Fair Share是由update線程定期去更新。

此處強調的一點是,在上文中我們已經分析如果是計算Instantaneous Fair Share,並且隊列爲空,那麼該隊列就是fixed隊列,也就是非活躍隊列,那麼計算fair share時,該隊列是不會去瓜分集羣的內存資源。

而update線程的更新頻率就是由 yarn.scheduler.fair.update-interval-ms來決定的。

private class UpdateThread extends Thread {

    @Override
    public void run() {
      while (!Thread.currentThread().isInterrupted()) {
        try {
          //yarn.scheduler.fair.update-interval-ms
          Thread.sleep(updateInterval);
          long start = getClock().getTime();
          // 更新Instantaneous Fair Share
          update();
          //搶佔資源
          preemptTasksIfNecessary();
          long duration = getClock().getTime() - start;
          fsOpDurations.addUpdateThreadRunDuration(duration);
        } catch (InterruptedException ie) {
          LOG.warn("Update thread interrupted. Exiting.");
          return;
        } catch (Exception e) {
          LOG.error("Exception in fair scheduler UpdateThread", e);
        }
      }
    }
  }

3 maxAMShare意義

handle線程如果接收到NODE_UPDATE事件,如果(1)該node的機器內存資源滿足條件,(2)並且有ACCEPTED狀態的Application,那麼將會爲該待運行的APP的AM分配一個container,使該APP在所處的queue中跑起來。但在分配之前還需要一道檢查canRuunAppAM。能否通過canRuunAppAM,就是由maxAMShare參數限制。

  public boolean canRunAppAM(Resource amResource) {
    //默認是0.5f
    float maxAMShare =
        scheduler.getAllocationConfiguration().getQueueMaxAMShare(getName());
    if (Math.abs(maxAMShare - -1.0f) < 0.0001) {
      return true;
    }
    //該隊的maxAMResource=maxAMShare * fair share(Instantaneous Fair Share)
    Resource maxAMResource = Resources.multiply(getFairShare(), maxAMShare);
    //amResourceUsage是該隊列已經在運行的App的AM所佔資源累加和
    Resource ifRunAMResource = Resources.add(amResourceUsage, amResource);
    //查看當前ifRunAMResource是否超過maxAMResource
    return !policy
        .checkIfAMResourceUsageOverLimit(ifRunAMResource, maxAMResource);
  }

上面代碼我們用公式來描述:

  • 隊列中運行的APP爲An,每個APP的AM佔用資源爲R
  • ACCEPTED狀態(待運行)的APP的AM大小爲R1
  • 隊列的fair share爲QueFS
  • 隊列的maxAMResource=maxAMShare * QueFS
  • ifRunAMResource=A1.R+A2.R+...+An.R+R1
  • ifRunAMResource > maxAMResource,則該隊列不能接納待運行的APP

之所以要關注這個參數,是因爲EMR很多客戶在使用公平隊列時會反映集羣的總資源沒有用滿,但是還有APP在排隊,沒有跑起來,如下圖所示:

img圖(6) APP阻塞實例

公平調度默認策略不關心Core的資源,只關心Memory。圖中Memory用了292G,還有53.6G的內存沒用,APP就可以阻塞。原因就是default隊列所有運行中APP的AM資源總和超過了(345.6 * 0.5),導致APP阻塞。

總結

通過分析fair share的計算流程,搞清楚yarn的基本概念和部分參數,從下面的表格對比中,我們也可以看到官方的文檔對概念和參數的描述是比較難懂的。剩餘的參數放在第二篇-公平調度之搶佔中分析。

官方描述 總結
Steady Fair Share The queue’s steady fair share of resources. These shares consider all the queues irrespective of whether they are active (have running applications) or not. These are computed less frequently and change only when the configuration or capacity changes.They are meant to provide visibility into resources the user can expect, and hence displayed in the Web UI. 每個非fixed隊列內存資源量的固定理論值。Steady Fair Share在RM初期工作後不再輕易改變,只有後續在增加節點改編配置(addNode)時纔會重新計算。RM的初期工作也是handle線程把集羣的每個節點添加到調度器中(addNode)。
Instantaneous Fair Share The queue’s instantaneous fair share of resources. These shares consider only actives queues (those with running applications), and are used for scheduling decisions. Queues may be allocated resources beyond their shares when other queues aren’t using them. A queue whose resource consumption lies at or below its instantaneous fair share will never have its containers preempted. 每個非fixed隊列(活躍隊列)的內存資源量的實際值,是在動態變化的,由update線程去定時更新隊列的fair share。yarn裏的fair share如果沒有專門指代,都是指的的Instantaneous Fair Share。
yarn.scheduler.fair.update-interval-ms The interval at which to lock the scheduler and recalculate fair shares, recalculate demand, and check whether anything is due for preemption. Defaults to 500 ms. update線程的間隔時間,該線程的工作是1更新fair share,2檢查是否需要搶佔資源。
maxAMShare limit the fraction of the queue’s fair share that can be used to run application masters. This property can only be used for leaf queues. For example, if set to 1.0f, then AMs in the leaf queue can take up to 100% of both the memory and CPU fair share. The value of -1.0f will disable this feature and the amShare will not be checked. The default value is 0.5f. 隊列所有運行中的APP的AM資源總和必須不能超過maxAMShare * fair share
問答
如何將yarn 升級到特定版本?
相關閱讀
Yarn與Mesos
Spark on Yarn | Spark,從入門到精通
YARN三大模塊介紹
【每日課程推薦】機器學習實戰!快速入門在線廣告業務及CTR相應知識

此文已由作者授權騰訊雲+社區發佈,更多原文請點擊

搜索關注公衆號「雲加社區」,第一時間獲取技術乾貨,關注後回覆1024 送你一份技術課程大禮包!

海量技術實踐經驗,盡在雲加社區

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章