[big/little system scheduler]五. big.LITTLE Technology

概述

本文涉及的內容如下:

爲何需要big.little 技術
如何配置big.little
big.little系統如何調度

爲何需要big.little技術

Modern software stacks place conflicting requirements on mobile systems. On the one hand is a demand for very high performance for tasks such as games, while on the other is a continuing requirement to be frugal with energy reserves for low intensity applications like audio playback.

Traditionally, it has not been possible to have a single processor design that can be capable of both high peak performance as well as high energy efficiency. This meant that a lot of energy was wasted because the high performance core would be used for low intensity tasks leading to reduced battery life. Performance would itself be affected by the thermal limits at which the cores could run for sustained periods.

big.LITTLE technology from ARM solves this problem by coupling together energy-efficient LITTLE cores with high-performance big cores. big.LITTLE is an example of a heterogeneous processing system. Such systems typically include several different processor types with different microarchitectures, like general-purpose processors and specialized ASICs.

big.LITTLE takes the heterogeneity one step further in that it includes general-purpose processors that are different in their micro-architecture but compatible in their instruction set architecture. A term that is often used with such systems is Heterogeneous Multiprocessing (HMP) (See Heterogeneous multi-processing on page 14-8). What makes HMP different from Asymmetric Multiprocessing (AMP) (Asymmetric multi-processing on page 14-7) is that all the processors in an HMP system are fully coherent and run the same operating system image.

Software can run on big or the LITTLE processors (or both) depending on performance requirements. When peak performance is required software can be moved to run only on big processors. For normal tasks, software can be run perfectly well on LITTLE processors. Through this combination, big.LITTLE provides a solution capable of delivering the peak performance required by the latest mobile devices, within the thermal bounds of the system, and with maximum energy efficiency.
其實主要節省功耗,比如播放music的時候,可以使用little cluster,對於玩遊戲需要較高的performance性能,則需要big cluster. 但是如果一個應用其內部一些線程,時而變大,時而變小,那麼就可能在big和little cluster之間頻繁切換. 如果不讓其切換,保證cache的效率,調度器干預其行爲.

如何保證big.little技術的成功

Both types of core in a big.LITTLE system are fully cache coherent and share the same instruction set architecture (ISA). The same application binary runs unmodified on either. Differences in the internal microarchitecture of the processors enable them to provide the different power and performance characteristics that are fundamental to the big.LITTLE concept. These are typically managed by the operating system.

big.LITTLE software models require transparent and efficient transfer of data between big and LITTLE clusters. Hardware coherency enables this, transparently to the software. Coherency between clusters is provided by a cache-coherent interconnect such as the ARM CoreLink CCI-400 described in Chapter 14. Without hardware coherency, the transfer of data between big and LITTLE cores would always occur through main memory but this would be slow and not power efficient. In addition, it would require complex cache management software, to enable data coherency between big and LITTLE clusters.

In addition, such a system also requires a shared interrupt controller, such as the GIC-400, enabling interrupts to be migrated between any cores in the clusters. All cores can signal each other using distributed interrupt controllers such as the CoreLink GIC-400. Task switching is typically handled entirely within the OS scheduler, and invisible to the application software. An example system is shown in Figure 16-1.

A number of big.LITTLE configurations are possible, Figure 16-1 uses Cortex-A57 cores as the big cluster and Cortex-A53 cores as the LITTLE cluster, though other configurations are possible.

The LITTLE cluster is capable of handling most low intensity tasks such as audio playback, web-page scrolling, operating system events, and other always on, always connected tasks. As such, it is likely that the LITTLE cluster is where the software stack remains until intensive tasks such as gaming or video processing are run.

The big cluster can be utilized for heavy workloads such as certain high performance game graphics. Web page rendering is another common example. A coupling of these two cluster types provides opportunities to save energy and satisfy the increasing performance demands of applications stacks in mobile devices.

爲了保證big.little技術的成功,關鍵的一步是cache一致性的保證. 我們知道ARM cluster內的core共享cache1/cache2, cluster間通過cache3(DSU)共享數據. 即通過cache一致性連接保證task能夠在big.little 之前遷移.

進程如何在big.little上執行

There are two primary execution models for big.LITTLE:

Migration

Migration models are a natural extension to power-performance management techniques such as DVFS, (see Dynamic voltage and frequency scaling on page 15-6).
The migration model has two types:

Cluster migration.
CPU migration.

A migration action is similar to a DVFS operating point transition. Operating points on the DVFS curve of a core are traversed in response to load variations. When the current core (or cluster) has attained the highest operating point, if the
software stack requires more performance, a core (or cluster) migration action is effected. Execution then continues on the other core (or cluster) with the operating points on this core (or cluster) being traversed. When performance is not required, execution can switch back.

Global Task Scheduling

In Global Task Scheduling (see Global Task Scheduling on page 16-5), the operating system task scheduler is aware of the differences in compute capacity between big and LITTLE cores. The scheduler tracks the performance requirement for each individual software thread, and uses that information to decide which type of core to use for each. Unused cores can be powered off. This approach has a number of advantages over the migration models.

分兩個類型:

遷移: core間遷移和cluster之間遷移
全局調度器,即task可以根據某種條件選擇big.little系統中的任一一個core運行.

下面詳細講解

cluster migration

Only one cluster, either big or LITTLE, is active at any one time, except very briefly during a cluster context switch to the other cluster. To achieve the best power and performance efficiency, the software stack runs mostly on the energy-efficient LITTLE cluster and only runs for short time periods on the big cluster. This model requires the same number of cores in both clusters.

This model does not cope well with unbalanced software workloads, that is, workloads that place significantly different loads on cores within a cluster. In such situations, cluster migration results in a complete switch to the big cluster even though not all the cores need that level of performance. For this reason cluster migration is less popular than other methods.
大致意思就是隻能同時一個cluster運行,如果原先小的任務在little cluster運行,變得對性能要求很高了,就需要big cluster,這時候,就會將little cluster的所有task全部遷移到big cluster上去.但是這種基本上不在使用.

CPU migration

In this model, each big core is paired with a LITTLE core. Only one core in each pair is active at any one time, with the inactive core being powered down. The active core in the pair is chosen according to current load conditions. Using the example in Figure 16-2 on page 16-5, the operating system sees four logical cores. Each logical core can physically be a big or LITTLE core. This choice is driven by Dynamic Voltage and Frequency Scaling (DVFS). This model requires the same number of cores in both the clusters.

The system actively monitors the load on each core. High load causes the execution context to be moved to the big core, and conversely, when the load is low, the execution is moved to the LITTLE core. Only one core in the pairing can be active at any time. When the load is moved from an outbound core (the core the load leaves) to an inbound core (the core it arrives at), the former is switched off. This model allows a mix of big and LITTLE cores to be active at any one time.

Global Task Scheduling

hrough the development of big.LITTLE technology, ARM has evolved the software models starting with various migration models through to Global Task Scheduling (GTS) that forms the basis for all future development in big.LITTLE technology. The ARM implementation of GTS is called big.LITTLE Multiprocessing (MP).

In this model the operating system task scheduler is aware of the differences in compute capacity between big and LITTLE cores. Using statistical data, the scheduler tracks the performance requirement for each individual software thread, and uses that information to

decide which type of core to use for each. This model can work on a big.LITTLE system with any number of cores in any cluster. This is shown in Figure 16-3 on page 16-5. This approach has a number of advantages over the migration models, such as:

The system can have different numbers of big and LITTLE cores.
Unlike the migration model, any number of cores can be active at any one time. This can increase the maximum compute capacity available if peak performance is required.
It is possible to isolate the big cluster for the exclusive use of intensive threads, while light threads run on the LITTLE cluster. This enables heavy compute tasks to complete faster, as there are no additional background threads.
It is possible to target interrupts individually to big or LITTLE cores

big.LITTLE MP

下面詳細講解下全局調度器如何工作的.
For big.LITTLE MP on the Linux kernel the fundamental requirement is for the scheduler to decide when a software thread can run on a LITTLE core or a big core. The scheduler does this by comparing the tracked load of software threads against tunable load thresholds, an up migration threshold and a down migration threshold as shown in Figure 16-4.

When the tracked load average of a thread currently allocated to a LITTLE core exceeds the up migration threshold, the thread is considered eligible for migration to a big core. Conversely, when the load average of a thread that is currently allocated to a big core drops below the down migration threshold, it is considered eligible for migration to a LITTLE core. In big.LITTLE MP, these basic rules govern task migration between big and LITTLE cores. Within the clusters, standard Linux scheduler load balancing applies. This tries to keep the load balanced across all the cores in one cluster.

The model is refined by adjusting the tracked load metric based on the current frequency of a core. A task that is running when the core is running at half speed, accrues tracked load at half the rate that it would if the core was running at full speed. This enables big.LITTLE MP and DVFS management to work together in harmony.

big.LITTLE MP uses several mechanisms to determine when to migrate a task between big and LITTLE cores:

fork migration: 新創建的一個線程,怎樣去選擇一個合適的cpu運行.
This operates when the fork system call is used to create a new software thread. At this point, clearly no historical load information is available. The system defaults to a big core for new threads on the assumption that a light thread migrates quickly down to a LITTLE core as a result of Wake migration.
Fork migration benefits demanding tasks without being expensive. Threads that are low intensity and persistent, such as Android system services, are only moved to big cores at creation time, quickly moving to more suitable LITTLE cores thereafter. Threads that are clearly demanding throughout, are not penalized by being made to launch on LITTLE cores first. Threads that run occasionally, but tend to require performance, benefit from being launched on the big cluster and continuing to run there as required.
wake migration:一個線程運行一段時間之後,自身進入休眠等待階段.當被喚醒時候,可以繼續選擇合適的CPU運行,而不是之前運行過的CPU.
force migration:當一個線程當前運行在little core上運行,變成了一個big task, 則強制遷移到big cluster上運行
idle pull migration:當CPU進入idle狀態,爲了維持load均衡,需要主動pull task讓自己運行起來.
offload migration:這種遷移是將正常的load balance disable
Offload migration requires that normal scheduler load balancing be disabled. The downside of this is that long-running threads can concentrate on the big cores, leaving the LITTLE cores idle and under-utilized. Overall system performance, in this situation, can clearly be improved by utilizing all the cores.
Offload migration works to periodically migrate threads downwards to LITTLE cores to make use of unused compute capacity. Threads that are migrated downwards in this way remain candidates for up migration if they exceed the threshold at the next scheduling opportunity. 這種遷移的確定是可以長時間的維持系統性能,但是little cluster不能得到調度,一個改善調度的方案是週期性的檢測,將一些輕負載的task遷移到little cluster,當它們超過upmigrate threshold的時候,在遷移會big cluster.

這部分讓我瞭解了big.little涉及到的調度相關的歷史演進. 具體的調度可以參考調度器相關的專欄:linux kernel cfs scheduler

文獻來源: ARM官方文檔.

[big/little system scheduler]五. big.LITTLE Technology

概述

爲何需要big.little技術

如何保證big.little技術的成功

進程如何在big.little上執行

Migration

Global Task Scheduling

cluster migration

CPU migration

Global Task Scheduling

big.LITTLE MP

[軟件工具百科] 互聯網資源歷史快照歸檔站點與數字圖書館

網易面試：SpringBoot如何開啓虛擬線程？

杭州的 IT 崩盤了麼？

程序員常見的文本查看工具

VS2022 解決方案打不開 .NET Framework 4.0 、 4.5 等老項目

Vue3 運行可以，build 打包發佈報錯，app.config.globalProperties 用法坑

既然測試也要求寫代碼，那乾脆讓開發兼任測試不就好了嗎？

ITSM落地經驗之建設藍圖規劃

PDF 補丁丁 1.0.2 版更新

奇怪！應用的日誌呢？？

Analysis and Solution for Cpuidle Power Nightmare

[power]二. Dynamic voltage and frequency scaling(DVFS)簡單概述

Kernel space lock contention配置及其使用

[Python解析systrace.html]chrome打開systrace分析，圖形顯示時間點與文本時間點一一對應，方便debug使用

pr_emerg耗時，影響性能原理排查

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結