[Arm Linux]cpuidle之menu governor

  • Concepts and ideas behind the menu governor
  • For the menu governor, there are 3 decision factors for picking a C state:
    1. Energy break even point
    1. Performance impact
    1. Latency tolerance (from pmqos infrastructure)
  • These these three factors are treated independently.
  • Energy break even point

  • C state entry and exit have an energy cost, and a certain amount of time in the C state is required to actually break even on this cost. CPUIDLE provides us this duration in the “target_residency” field. So all that we need is a good prediction of how long we’ll be idle. Like the traditional menu governor, we start with the actual known “next timer event” time.

  • Since there are other source of wakeups (interrupts for example) than the next timer event, this estimation is rather optimistic. To get a more realistic estimate, a correction factor is applied to the estimate, that is based on historic behavior. For example, if in the past the actual duration always was 50% of the next timer tick, the correction factor will be 0.5.

  • menu uses a running average for this correction factor, however it uses a set of factors, not just a single factor. This stems from the realization that the ratio is dependent on the order of magnitude of the expected duration; if we expect 500 milliseconds of idle time the likelihood of getting an interrupt very early is much higher than if we expect 50 micro seconds of idle time. A second independent factor that has big impact on the actual factor is if there is (disk) IO outstanding or not.

  • (as a special twist, we consider every sleep longer than 50 milliseconds as perfect; there are no power gains for sleeping longer than this)

  • For these two reasons we keep an array of 12 independent factors, that gets indexed based on the magnitude of the expected duration as well as the “is IO outstanding” property.

  • Repeatable-interval-detector


  • There are some cases where “next timer” is a completely unusable predictor:

  • Those cases where the interval is fixed, for example due to hardware interrupt mitigation, but also due to fixed transfer rate devices such as mice.

  • For this, we use a different predictor: We track the duration of the last 8 intervals and if the stand deviation of these 8 intervals is below a threshold value, we use the average of these intervals as prediction.

  • Limiting Performance Impact


  • C states, especially those with large exit latencies, can have a real noticeable impact on workloads, which is not acceptable for most sysadmins, and in addition, less performance has a power price of its own.

  • As a general rule of thumb, menu assumes that the following heuristic holds:

  • The busier the system, the less impact of C states is acceptable

  • This rule-of-thumb is implemented using a performance-multiplier: If the exit latency times the performance multiplier is longer than the predicted duration, the C state is not considered a candidate for selection due to a too high performance impact. So the higher this multiplier is, the longer we need to be idle to pick a deep C state, and thus the less likely a busy CPU will hit such a deep C state.

  • Two factors are used in determing this multiplier:

  • a value of 10 is added for each point of “per cpu load average” we have.

  • a value of 5 points is added for each process that is waiting for IO on this CPU.

  • (these values are experimentally determined)

  • The load average factor gives a longer term (few seconds) input to the decision, while the iowait value gives a cpu local instantanious input.

  • The iowait factor may look low, but realize that this is also already represented in the system load average.


  • menu governor背後的概念和想法

  • 對於menu governor,選擇C狀態有3個決策因素:

  • 1)能量收支平衡點

  • 2)績效影響

  • 3)延遲容限(來自pmqos基礎架構)

  • 這三個因素被獨立對待。

  • 能量收支平衡點


  • C狀態的進入和退出會消耗能量,因此在C狀態下要花費一定的時間才能真正實現收支平衡。 CPUIDLE在“ target_residency”字段中爲我們提供了此持續時間。因此,我們所需要做的只是很好地預測我們將閒置多長時間。像傳統的菜單調節器一樣,我們從實際已知的“下次計時器事件”時間開始。

  • 由於除了下一個定時器事件以外,還有其他喚醒源(例如中斷),因此此估計相當樂觀。爲了獲得更實際的估計,將基於歷史行爲的校正因子應用於估計。例如,如果過去的實際持續時間始終是下一個計時器刻度的50%,則校正因子將爲0.5。

*菜單使用移動平均值作爲該校正因子,但是它使用一組因子,而不僅僅是一個因子。這是由於認識到該比率取決於預期持續時間的數量級。如果我們希望有500毫秒的空閒時間,那麼儘早獲得中斷的可能性比我們期望的50微秒的空閒時間要高得多。對實際因素影響很大的第二個獨立因素是(磁盤)IO是否未完成。
*(作爲一種特殊的轉折,我們認爲每次睡眠時間超過50毫秒都是完美的;睡眠時間超過此時間不會產生功率增益)

  • 由於這兩個原因,我們保留了12個獨立因素的數組,這些因素基於預期持續時間的大小以及“是IO突出”屬性進行索引。

  • 可重複間隔檢測器


*在某些情況下,“下一個計時器”是完全不可用的預測變量:
*間隔固定的情況,例如,由於緩解了硬件中斷,還由於固定速率的設備(例如鼠標)。
*爲此,我們使用不同的預測變量:我們跟蹤最近8個間隔的持續時間,如果這8個間隔的標準偏差低於閾值,則將這些間隔的平均值用作預測。

  • 限制性能影響

  • C狀態,尤其是那些具有較大退出延遲的狀態,可能會對工作負載產生真正的顯着影響,這對於大多數系統管理員來說是不可接受的,此外,性能降低本身具有功耗價格。

*作爲一般的經驗法則,菜單假定以下啓發式成立:
*系統越忙,可接受的C狀態影響就越小

  • 此經驗法則是使用性能倍數實現的:如果退出延遲乘以性能倍數的時間比預測的持續時間長,則由於對性能的影響過大,C狀態不被視爲選擇候選對象。因此,該乘數越高,我們需要空閒的時間就越長,以選擇深度C狀態,因此繁忙的CPU達到這種深度C狀態的可能性就越小。

*在確定此乘數時使用了兩個因素:
*對於我們具有的“每CPU平均負載”的每個點,將添加10值。
*爲該CPU上等待IO的每個進程添加5點的值。
*(這些值是實驗確定的)

  • 負載平均因子爲決策提供了較長的輸入時間(幾秒鐘),而iowait值給出了cpu本地瞬時輸入。
  • iowait因子可能看起來很低,但是請注意,這也已經在系統平均負載中表示出來。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章