Why Memory Barriers中文翻譯(下)

轉載自:http://www.wowotech.net/kernel_synchronization/why-memory-barrier-2.html

 

在上一篇why memory barriers文檔中,由於各種原因,有幾個章節沒有翻譯。其實所謂的各種原因總結出一句話就是還沒有明白那些章節所要表達的內容。當然,對於一個真正的熱愛鑽研的linuxer,不理解的那些章節始終都是一塊心病。終於,在一個月黑風高的夜晚,我發了一封郵件給perfbook的作者Paul,請其指點一二。果然是水平越高越平易近人,很快,大神回覆了,給出了一些他的意見,大意就是不必糾結於具體的細節,始終focus在幾個基本的規則上就OK了。受此鼓舞,我還是堅持把剩餘章節翻譯出來,於是形成了本文。

注:有些特定CPU architecture章節實在是無心翻譯,暫時TODO吧。

 

七、Example Memory-Barrier Sequences

This section presents some seductive but subtly broken uses of memory barriers. Although many of them will work most of the time, and some will work all the time on some specific CPUs, these uses must be avoided if the goal is to produce code that works reliably on all CPUs. To help us better see the subtle breakage, we first need to focus on an ordering-hostile architecture.

這一小節我們會分析幾個有意思的memory barrier的例子。儘管這些代碼大部分時間都可以正常工作,並且其中有些在某些特定平臺的CPU上可以一直正常工作,但是,如果我們的目標是星辰大海(平臺無關代碼,即在所有的CPU上都可以正確運行),那麼還是有必要看看這些微妙的錯誤是如何發生的。我們先來看看ordering-hostile architecture這個概念。

1、Ordering-Hostile Architecture

A number of ordering-hostile computer systems have been produced over the decades, but the nature of the hostility has always been extremely subtle, and understanding it has required detailed knowledge of the specific hardware. Rather than picking on a specific hardware vendor, and as a presumably attractive alternative to dragging the reader through detailed technical specifications, let us instead design a mythical but maximally memory-ordering-hostile
computer architecture.

在過去的數十年裏,許多ordering-hostile的計算系統被生產出來,對於這類計算機系統,memory order非常精妙,需要大量的硬件知識才能理解它。與其選擇一個特定的硬件平臺,把讀者帶入特定的一些技術細節,不如做點看起來更有意思的事情:我們自己設計一個對memory order約束最弱的CPU architecture。

This hardware must obey the following ordering constraints [McK05a, McK05b]:

  1. Each CPU will always perceive its own memory accesses as occurring in program order.
  2. CPUs will reorder a given operation with a store only if the two operations are referencing different locations.
  3. All of a given CPU’s loads preceding a read memory barrier (smp_rmb()) will be perceived by all CPUs to precede any loads following that read memory barrier.
  4. All of a given CPU’s stores preceding a write memory barrier (smp_wmb()) will be perceived by all CPUs to precede any stores following that write memory barrier.
  5. All of a given CPU’s accesses (loads and stores) preceding a full memory barrier (smp_mb()) will be perceived by all CPUs to precede any accesses following that memory barrier.

這種硬件必須服從下面的規則:

(1) 對於每個CPU而言,從它自己的角度看,其內存訪問的順序總是符合program order的。

(2) 如果CPU要對一個指定的操作(load或者store)和另外的一個store操作進行重排的話,那麼一定要符合指定的條件:即這兩個操作的memory地址是不能有重疊區域的。

(3) 如果program order是先load A然後load B,這樣的操作在CPU 0上執行的時候,從其自己來看memory系統的變化,當然是load A,然後load B(請參考前面的第一條規則)。但是,系統中的其他其他CPU如何看待CPU 0的操作呢(想像所有的CPU都是趴在總線上的觀察者,不斷的觀察memory的變化)?根據前面文章描述的知識,可以肯定的是load A和load B的順序是無法保證的。如果增加了個read memory barrier,那結果可就不一樣了。假設CPU 0執行的是load A,然後read memory barrier,最後load B,那麼總線上的所有CPU,包括CPU 0自己,看到的內存操作順序都是load A,然後load B。

(4) 如果系統中所有的CPU都是潛伏在總線上的觀察者,不斷的觀察memory的變化。那麼任意一個給定的系統中的CPU在執行store代碼的時候,都可以被系統中的所有CPU感知(包括它自己)。如果給定CPU執行的代碼被write memory barrier分成兩段:wmb之前的store操作,wmb之後的store操作,那麼系統中所有的CPU的觀察結果都是一樣的,wmb之前的store操作先執行完畢,wmb之後的store操作隨後被執行。

(5) 某一個CPU執行的全功能內存屏障指令之前的memory access的操作(load or store),必定先被系統中所有的CPU感知到,之後系統中的所有CPU纔看到全功能內存屏障指令之後的memory access的操作(load or store)的執行結果。

Imagine a large non-uniform cache architecture (NUCA) system that, in order to provide fair allocation of interconnect bandwidth to CPUs in a given node, provided per-CPU queues in each node’s interconnect interface, as shown in Figure C.8. Although a given CPU’s accesses are ordered as specified by memory barriers executed by that CPU, however, the relative order of a given pair of CPUs’ accesses could be severely reordered, as we will see.

一個示例性的Ordering-Hostile Architecture如下圖所示:

nuca

這個系統被稱爲non-uniform cache architecture (NUCA) 的系統,爲了公平的分配Interconnect到各個CPU之間的帶寬,我們在Node和interconnect器件之間的接口設計了per cpu的Message queue。儘管,一個指定CPU的內存訪問可以通過memory barrier對訪問順序進行約定,但是,一對CPU上的內存訪問可以被完全打亂重排,下面我們會詳細描述。

2、Example 1

Table C.2 shows three code fragments, executed concurrently by CPUs 0, 1, and 2. Each of “a”, “b”, and “c” are initially zero.

下表列出了CPU 0、CPU 1和CPU 2各自執行的代碼片段,a,b,c變量的初始值都是0。

CPU 0 CPU 1 CPU 2

a = 1;
smp_wmb();
b = 1;

 


while (b == 0);
c = 1;

 

z = c;
smp_rmb();
x = a;
assert(z == 0 || x == 1);

Suppose CPU 0 recently experienced many cache misses, so that its message queue is full, but that CPU 1 has been running exclusively within the cache, so that its message queue is empty. Then CPU 0’s assignment to “a” and “b” will appear in Node 0’s cache immediately (and thus be visible to CPU 1), but will be blocked behind CPU 0’s prior traffic. In contrast, CPU 1’s assignment to “c” will sail through CPU 1’s previously empty queue. Therefore, CPU 2 might well see CPU 1’s assignment to “c” before it sees CPU 0’s assignment to “a”, causing the assertion to fire, despite the memory barriers.

我們假設CPU 0最近經歷了太多cache miss,以至於它的message queue滿了。CPU 1比較幸運,它操作的內存變量在cacheline中都是exclusive狀態,不需要發送和node 1進行交互的message,因此它的message queue是空的。對於CPU0而言,a和b的賦值都很快體現在了NODE 0的cache中(CPU1也是可見的),不過node 1不知道,因爲message都阻塞在了cpu 0的message queue中。與之相反的是CPU1,其對c的賦值可以通過message queue通知到node1。這樣導致的效果就是,從cpu2的角度看,它先觀察到了CPU1上的c=1賦值,然後纔看到CPU0上的對a變量的賦值,這也導致了儘管使用了memory barrier,CPU2上仍然遭遇了assert fail。之所以會fail,主要是在cpu2上發生了當z等於1的時候,卻發現x等於0。

 Therefore, portable code cannot rely on this assertion not firing, as both the compiler and the CPU can reorder the code so as to trip the assertion.

因此,可移植代碼不能依賴assert總是succeed,編譯器和CPU都會重排代碼,導致assertion trigger。

3、 Example 2

Table C.3 shows three code fragments, executed concurrently by CPUs 0, 1, and 2. Both “a” and “b” are initially zero.

下表列出了CPU 0、CPU 1和CPU 2各自執行的代碼片段,a,b變量的初始值都是0。

CPU 0 CPU 1 CPU 2

a = 1;

while (a == 0);

smp_mb(); 
b = 1;

 

y = b;
smp_rmb();
x = a;
assert(y == 0 || x == 1);

 

Again, suppose CPU 0 recently experienced many cache misses, so that its message queue is full, but that CPU 1 has been running exclusively within the cache, so that its message queue is empty. Then CPU 0’s assignment to “a” will appear in Node 0’s cache immediately (and thus be visible to CPU 1), but will be blocked behind CPU 0’s prior traffic. In contrast, CPU 1’s assignment to “b” will sail through CPU 1’s previously empty queue. Therefore, CPU 2 might well see CPU 1’s assignment to “b” before it sees CPU 0’s assignment to “a”, causing the assertion to fire, despite the memory barriers.

我們再次假設CPU0最近遇到太多cache miss,以至於它的message queue滿了,而cpu1由於操作的變量cacheline都是exclusive狀態,因此它的message queue是空的。同樣的,CPU0上的對a的賦值會立刻被同在一個node中的CPU1觀察到,但是,由於CPU0之前的一些協議交互塞滿了其message queue,因此,對於node 0之外的CPUs則看不到CPU0對a的賦值。相反,CPU1的對b的賦值,由於其message queue是空的,因此,輕舟已過萬重山,順利到達其他node。對於CPU2而言,當執行assert的代碼的時候,有可能執行的結果是y等於1並且x等於0,從而導致assertion fail。

In theory, portable code should not rely on this example code fragment, however, as before, in practice it actually does work on most mainstream computer systems.

理論上,上面的代碼不是可移植代碼,雖然在大部分的計算機系統上都能正常運行。

4、Example 3

Table C.4 shows three code fragments, executed concurrently by CPUs 0, 1, and 2. All variables are initially zero.

下表列出了CPU 0、CPU 1和CPU 2各自執行的代碼片段,所有的變量的初始值都是0。

example3

Note that neither CPU 1 nor CPU 2 can proceed to line 5 until they see CPU 0’s assignment to “b” on line 3. Once CPU 1 and 2 have executed their memory barriers on line 4, they are both guaranteed to see all assignments by CPU 0 preceding its memory barrier on line 2. Similarly, CPU 0’s memory barrier on line 8 pairs with those of CPUs 1 and 2 on line 4, so that CPU 0 will not execute the assignment to “e” on line 9 until after its assignment to “a” is visible to both of the other CPUs. Therefore, CPU 2’s assertion on line 9 is guaranteed not to fire.

無論是CPU1還是CPU2,它們都不能越過smp_mb,直到觀察到CPU0上的對b的賦值(在line3)。一旦CPU1和CPU2執行了smp_mb這條指令(line 4),這兩個CPU都可以保證觀察到CPU0上smb_wmb(line 2)之前的指令的執行結果。同樣的,CPU0上在第8行的smp_mb(),和CPU1以及CPU2上的smp_mb(line 4)組成一對,可以保證CPU0不會執行e的賦值(line 9),直到其對a的賦值(line1)被CPU1和CPU2觀察到。因此CPU2上assert不會fail。

 

八、 Memory-Barrier Instructions For Specific CPUs

Each CPU has its own peculiar memory-barrier instructions, which can make portability a challenge, as indicated by Table C.5. In fact, many software environments, including pthreads and Java, simply prohibit direct use of memory barriers, restricting the programmer to mutual-exclusion primitives that incorporate them to the extent that they are required. In the table, the first four columns indicate whether a given CPU allows the four possible combinations of loads and stores to be reordered. The next two columns indicate whether a given CPU allows loads and stores to be reordered with atomic instructions.

The seventh column, data-dependent reads reordered, requires some explanation, which is undertaken in the following section covering Alpha CPUs. The short version is that Alpha requires memory barriers for readers as well as updaters of linked data structures. Yes, this does mean that Alpha can in effect fetch the data pointed to before it fetches the pointer itself, strange but true. Please see: http://www.openvms.compaq.com/wizard/wiz_2637.html if you think that I am just making this up. The benefit of this extremely weak memory model is that Alpha can use simpler cache hardware, which in turn permitted higher clock frequency in Alpha’s heyday.

The last column indicates whether a given CPU has a incoherent instruction cache and pipeline. Such CPUs require special instructions be executed for self-modifying code.

Parenthesized CPU names indicate modes that are architecturally allowed, but rarely used in practice.

每種CPU都有自己特有的memory barrier指令,這也導致了想要寫可移植代碼比較困難。具體的信息可以參考下面的表格:

cpu mb

實際上,很多的軟件環境,例如pthread、Java簡單而粗暴的禁止了對memory barrier的直接使用,當然,對於有需要的場景,程序員也不能直接使用memory barrier,而是使用互斥原語,而互斥原語實際上是隱含了對memory barrier的使用。上面的表格描述了各種CPU對memory reorder行爲的支持情況,第一列是CPU信息,後面的列是各種reorder的case,包括:

(1)load和store的4種組合。memory 訪問指令就只有2種(對於RISC而言),load和store,運用簡單的排列組合知識就可以知道有4種組合情況,load after store,load after load, store after load和store after store。

(2)內存訪問是否可以和原子操作進行reorder。這類case有兩種,一是是否允許store操作和原子操作進行reorder,另外一個是是否允許load操作和原子操作進行reorder。

(3)是否對有數據依賴關係的讀操作進行重排。這個在絕大多數的CPU上都不支持,除了Alpha處理器。首先我們先解釋一下什麼是“data-dependent read”,請看下錶:

CPU 0 CPU 1
給一個結構體S的各個成員賦值

將指針變量ps的值設定爲結構體S的地址
load  ps

load S結構體的值

對於CPU1而言,讀取指針變量ps的值和讀取S結構體的值是兩個有依賴關係的load操作,如果不load ps,其實是無法獲取S結構體的值的,這也就是傳說中的“data-dependent read”。對於CPU 0上的寫入操作,兩個操作是向不同的memory地址中寫入數據,因此沒有依賴性。

爲了合理約定memory order,我們需要添加一些memory barrier的代碼:

CPU 0 CPU 1
給一個結構體S的各個成員賦值
write memory barrier
將指針變量ps的值設定爲結構體S的地址
load  ps
read memory barrier
load S結構體的值

CPU1上的load ps操作要麼看到舊指針,要麼看到新指針。如果看到新的指針說明CPU0上的write memory barrier之後的賦值操作已經完成,在這樣的條件下,CPU1上的rmb之後的load操作必然看到CPU0上wmb之前對結構體S成員賦的新值。由於數據的依賴性,對於大部分的CPU而言,去掉讀一側CPU1上的rmb也是沒有問題的(使用單側的memory barrier),CPU的硬件保證了讀側的load順序。不幸的是對於Alpha而言,必須使用成對的memory barrier,即讀的一側需要使用阻止read depends類型的memory barrier。如果你覺得我在鬼扯,請去http://www.openvms.compaq.com/wizard/wiz_2637.html 下載Alpha的手冊來仔細閱讀。之所以採用了這麼弱的memory model是爲了在Alpha上使用更簡單的cache硬件,而更簡單的硬件使得Alpha可以採用更高的主頻,從而讓計算機跑的更快。

(4)最後一列是說明該CPU是否支持獨立的icache和pipeline(這裏的獨立是相對於data cache而言,原文是incoherent,這裏可能理解有誤)。這種CPU需要爲self-modifying的代碼提供特殊的memory barrier指令

被括號圍起來的CPU ARCH說明CPU雖然支持該mode,不過在實際中很少使用。

The common “just say no” approach to memory barriers can be eminently reasonable where it applies, but there are environments, such as the Linux kernel, where direct use of memory barriers is required. Therefore, Linux provides a carefully chosen least-common-denominator set of memory-barrier primitives, which are as follows:

在某些場合下,對memory barrier說“不”也是非常合理的,畢竟這個東東理解起來還是比較困難的,但是,有些場合,例如linux kernel中,有直接使用memory barrier的需求,就不能just say no了。對於linux kernel而言,它需要支持多種CPU體系結構的memory barrier的操作。雖然各自CPU arch有自己的memory barrier的設計,但是所有的CPU arch可以找到一個最小的集合,在這個集合中,每一個memory barrier的操作都可以被各個CPU arch所支持,它們是:

(1) smp_mb(): “memory barrier” that orders both loads and stores. This means that loads and stores preceding the memory barrier will be committed to memory before any loads and stores following the memory barrier.對load和store都有效的全功能memory barrier,執行了該memory barrier指令的CPU可以保證smp_mb之前的load和store操作將先於該指令之後的load或者store操作被提交到存儲系統。

(2) smp_rmb(): “read memory barrier” that orders only loads.只規定load操作的memory barrier。

(3) smp_wmb(): “write memory barrier” that orders only stores.只規定store操作的memory barrier。

(4) smp_read_barrier_depends() that forces subsequent operations that depend on prior operations to be ordered. This primitive is a no-op on all platforms except Alpha. 如果前後兩個memory access有依賴關係(例如前一個的操作是取後一個要操作memory的地址),那麼smp_read_barrier_depends()這個memory barrier原語可以用來規定這兩個有依賴關係的內存操作順序。當然,除了Alpha,在其他的CPU上,該原語都是空。

(5) mmiowb() that forces ordering on MMIO writes that are guarded by global spinlocks. This primitive is a no-op on all platforms on which the memory barriers in spinlocks already enforce MMIO ordering. The platforms with a
non-no-op mmiowb() definition include some (but not all) IA64, FRV, MIPS, and SH systems. This primitive is relatively new, so relatively few drivers take advantage of it. mmiowb主要用於約束被spin lock保護的MMIO write操作順序。當然,有些CPU architecture平臺中,spinlock中的memory barrier操作已經保證了MMI的寫入順序,那麼這個宏是空的。mmiowb是空的CPU architecture包括(但不限於)IA64, FRV, MIPS, and SH。這個原語比較新,因此在比較新的driver會使用它。

The smp_mb(), smp_rmb(), and smp_wmb() primitives also force the compiler to eschew any optimizations that would have the effect of reordering memory optimizations across the barriers. The smp_read_barrier_depends() primitive has a similar effect, but only on Alpha CPUs. See Section 14.2 for more information on use of these primitives.These primitives generate code only in SMP kernels, however, each also has a UP version (mb(), rmb(), wmb(), and read_barrier_depends(), respectively) that generate a memory barrier even in UP kernels. The smp_ versions should be used in most cases. However, these latter primitives are useful when writing drivers, because MMIO accesses must remain ordered even in UP kernels. In absence of memory-barrier instructions, both CPUs and compilers would happily rearrange these accesses, which at best would make the device act strangely, and could crash your kernel or, in some cases, even damage your hardware.

smp_mb(), smp_rmb(), 和 smp_wmb() 原語除了指導CPU的工作(規定cpu提交到到存儲系統的操作順序),它們對編譯器也有作用(linux中的優化屏障barrier()),這些原語阻止了編譯器爲了優化而重排內存訪問順序。smp_read_barrier_depends有同樣的效果,但是僅僅適用在Alpha處理器上。想要知道更多的關於這些原語使用的知識,請參考14.2章節。上面的這些smp_xxx的原語主要用在SMP的場合,在SMP的場景下會生成具體的memory barrier指令代碼,在UP場合下,它們不會生成代碼(僅僅是一個優化屏障而已)。smp_xxx的原語有對應的UP版本的memory barrier原語(mb(), rmb(), wmb(), 和read_barrier_depends()),這些UP版本的原語無論是SMP還是UP場合,都會生成具體的memory barrier指令代碼。雖然大部分場景都是使用smp_xxx版本的memory barrier原語,但是,對於驅動工程師,由於需要規定MMIO的順序(不僅適用SMP,也適用UP),因此,UP版本的memory barrier原語也經常使用。如果缺少了這些memory barrier原語,那麼CPU和編譯器都可以愉快的按照其自己的想法來對memory access順序進行重排,而這樣的行爲輕則讓設備行爲異常,或者內核crash,重則造成硬件的損傷。

So most kernel programmers need not worry about the memory-barrier peculiarities of each and every CPU, as long as they stick to these interfaces. If you are working deep in a given CPU’s architecture-specific code, of course, all bets are off.
Furthermore, all of Linux’s locking primitives (spinlocks, reader-writer locks, semaphores, RCU, ...) include any needed barrier primitives. So if you are working with code that uses these primitives, you don’t even need to worry about Linux’s memory-ordering primitives.

因此,只要熟悉了linux kernel提供的architecture無關的memory barrier接口API(原語),那麼大部分的內核開發者並不需要擔心特定CPU上的memory barrier操作。當然,如果的你就是撰寫特定CPU上的代碼,上面的話不作數。

生活多麼美好,開發kernel而不需要深入特定CPU的細節,稍等,還有令你心情更加愉悅的事情。其實內核中的所有的互斥原語(spinlocks, reader-writer locks, semaphores, RCU, ...)也都隱含了它們需要的memory barrier原語,因此,如果你的工作是直接使用這些互斥原語,那麼你也根本不需要“鳥”那些linux提供的memroy barrier原語。

That said, deep knowledge of each CPU’s memory-consistency model can be very helpful when debugging, to say nothing of when writing architecture-specific code or synchronization primitives.
Besides, they say that a little knowledge is a very dangerous thing. Just imagine the damage you could do with a lot of knowledge! For those who wish to understand more about individual CPUs’ memory consistency models, the next sections describes those of the most popular and prominent CPUs. Although nothing can replace actually reading a given CPU’s documentation, these sections give a good overview.

行文至此,你肯定忍不住提出疑問:那爲何我們還要專門描寫一個特定CPU上memory barrier的章節呢?生活固然美好,但是,深入掌握每一個CPU的內存一致性模型對調試內核代碼的時候幫助很大。當然,如果你是需要些特定CPU的代碼或者同步原語,那更不必說了,該cpu的memory order和memory barrier的相關知識必須拿下。

除此之外,一知半解(掌握一點memory barrier的知識)是一個很危險的事情,想象一下,如果你掌握了大量的memory barrier知識所帶來的破壞力(作者這裏要表達什麼我也搞不清楚,如果有理解的幫忙提點一下)。雖然有些工程師使用linux提供的原語快樂的工作着,但是,總是有些工程師期望能夠理解更多關於各種CPU上的memory order和memory barrier相關的知識,我們下面的小節就是爲你們準備的。下面的小節將描述各種主流CPU的memory barrier指令。需要注意的是:我們這裏是概述,如果想要了解更多,可以下載具體CPU的文檔來仔細研讀。

1、 Alpha

It may seem strange to say much of anything about a CPU whose end of life has been announced, but Alpha is interesting because, with the weakest memory ordering model, it reorders memory operations the most aggressively. It therefore has defined the Linuxkernel memory-ordering primitives, which must work on all CPUs, including Alpha. Understanding Alpha is therefore surprisingly important to the Linux kernel hacker.

看起來現在討論一個已經宣佈死亡的CPU是有點奇怪,不過Alpha是一個非常有意思的產品,因爲這款CPU有着世界上最弱的內存訪問順序模型。爲了性能,Alpha處理器不擇手段,喪心病狂的對內存操作進行重排。由於linux內核必須在包括Alpha在內的各種處理器上工作,因此linux下的memory barrier的原語其實是根據這款CPU而定義的。因此,對於那些內核狂熱分子,理解Alpha是有非常重要的意義的。

The difference between Alpha and the other CPUs is illustrated by the code shown in Figure C.9. This smp_wmb() on line 9 of this figure guarantees that the element initialization in lines 6-8 is executed before the element is added to the list on line 10, so that the lock-free search will work correctly. That is, it makes this guarantee on all CPUs except Alpha.

Alpha has extremely weak memory ordering such that the code on line 20 of Figure C.9 could see the old garbage values that were present before the initialization on lines 6-8.

Alpha和其他處理器的不同之處可以通過下面的代碼描述:

1 struct el *insert(long key, long data)
2 {
3 struct el *p;
4 p = kmalloc(sizeof(*p), GFP_ATOMIC);
5 spin_lock(&mutex);
6 p->next = head.next;
7 p->key = key;
8 p->data = data;
9 smp_wmb();
10 head.next = p;
11 spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16 struct el *p;
17 p = head.next;
18 while (p != &head) {
19 /* BUG ON ALPHA!!! */
20 if (p->key == key) {
21 return (p);
22 }
23 p = p->next;
24 };
25 return (NULL);
26 }

在上面的代碼片段中,第9行的smp_wmb可以保證6~8行的結構體成員初始化代碼先執行,之後才執行第10行的代碼,也就是講結構體加入鏈表的代碼。只有這樣,才能保證沒有使用lock的鏈表搜索操作的正確性。上面的代碼在任何的處理器上都可以正確運行,除了Alpha。

Alpha處理器有着非常非常弱的內存訪問順序模型,這使得上面第20行代碼的通過指針load key成員的操作可以看到舊的垃圾數據而不是6~8行的初始化的新值。

Figure C.10 shows how this can happen on an aggressively parallel machine with partitioned caches, so that alternating caches lines are processed by the different partitions of the caches. Assume that the list header head will be processed by cache bank 0, and that the new element will be processed by cache bank 1. On Alpha, the smp_wmb() will guarantee that the cache invalidates performed by lines 6-8 of Figure C.9 will reach the interconnect before that of line 10 does, but makes absolutely no guarantee about the order in which the new values will reach the reading CPU’s core. For example, it is possible that the reading CPU’s cache bank 1 is very busy, but cache bank 0 is idle. This could result in the cache invalidates for the new element being delayed, so that the reading CPU gets the new value for the pointer, but sees the old cached values for the new element. See the Web site called out earlier for more information, or, again, if you think that I am just making all this up.

怎麼會這樣呢?我們通過下面這幅圖片向您展示這個問題是如何發生的:

alpha

上圖展示的是一個使用分區cache的並行計算系統,這種系統的cache分成兩個bank,不同的cache bank處理不同的cacheline。例如A cache line被bank 0處理,那麼A+1的cache line會被bank 1處理。假設鏈表頭head由cache bank 0來處理,而具體結構體的成員數據由bank 1來處理。在Alpha處理器上,smp_wmb() 可以確保上面6~8行的結構體成員初始化代碼產生的invalidate message要早於第10行程序產生的invalidate message到達互連器件,但是,對於發起讀操作那一側的CPU core而言,並不能保證這些新值可以到達reading CPU core。例如說有可能發起讀操作那一側的CPU 的cache bank 1非常忙,但是bank0比較閒,這也就導致了結構體成員的invalidate message被推遲了,在這種情況下,CPU讀到了新的指針數值,但是卻看到了舊的數據。再強調一次,如果你覺得我在鬼扯,請去網上下載Alpha的手冊來仔細閱讀。

One could place an smp_rmb() primitive between the pointer fetch and dereference. However, this imposes unneeded overhead on systems (such as i386, IA64, PPC, and SPARC) that respect data dependencies on the read side. A smp_read_barrier_depends() primitive has been added to the Linux 2.6 kernel to eliminate overhead on these systems. This primitive may be used as shown on line 19 of Figure C.11.

當然,你也可以在獲取結構體指針代碼和通過指針訪問結構體成員代碼之間插入一個smp_rmb原語,但是,對於某些系統(例如i386、IA64、PPC和SPARC)而言,會帶來不必要的開銷。因爲,在這些系統中,兩個有依賴關係的load操作是可以保證訪問順序的,即便沒有smp_rmb原語,cpu也不會大量其memory access的順序。爲了解決這個問題,從2.6的內核開始,新增一個smp_read_barrier_depends()的原語,具體代碼如下:

1 struct el *insert(long key, long data)
2 {
3 struct el *p;
4 p = kmalloc(sizeof(*p), GFP_ATOMIC);
5 spin_lock(&mutex);
6 p->next = head.next;
7 p->key = key;
8 p->data = data;
9 smp_wmb();
10 head.next = p;
11 spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16 struct el *p;
17 p = head.next;
18 while (p != &head) {
19 smp_read_barrier_depends();
20 if (p->key == key) {
21 return (p);
22 }
23 p = p->next;
24 };
25 return (NULL);
26 }

對於大部分CPU而言,smp_read_barrier_depends()是空操作,因此不會影響性能,在Alpha上,smp_read_barrier_depends()又可以保證有依賴關係的load的操作順序,因此保證了程序邏輯上的正確性。

It is also possible to implement a software barrier that could be used in place of smp_wmb(), which would force all reading CPUs to see the writing CPU’s writes in order. However, this approach was deemed by the Linux community to impose excessive overhead on extremely weakly ordered CPUs such as Alpha. This software barrier could be implemented by sending inter-processor interrupts (IPIs) to all other CPUs. Upon receipt of such an IPI, a CPU would execute a memory-barrier instruction, implementing a memory-barrier shootdown. Additional logic is required to avoid deadlocks. Of course, CPUs that respect data dependencies would define such a barrier to simply be smp_wmb(). Perhaps this decision should be revisited in the future as Alpha fades off into the sunset.

The Linux memory-barrier primitives took their names from the Alpha instructions, so smp_mb() is mb, smp_rmb() is rmb, and smp_wmb() is wmb. Alpha is the only CPU where smp_read_barrier_depends() is an smp_mb() rather than
a no-op.

增加一個smp_read_barrier_depends的原語帶來了一定的複雜度,畢竟,arch無關的memory barrier接口多了一個。基本上這種接口的原則就是:如果一個接口可以解決問題的不要使用2個。當然,社區的人不是沒有想過這個問題,他們也曾經提出一種software barrier方案可以不修改讀CPU一側的行爲,同時保證了所有的讀一側的CPUs看到正確的寫入側的CPU的寫入順序。這個方案是這樣的:

(1)對於其他CPU而言,software barrier就是smp_wmb

(2)對於Alpha而言,software barrier是送往其他CPU的IPI中斷。收到該中斷的CPU都執行一個memory barrier指令

不過由於這個方案對於類似Alpha這樣採用比較弱內存順序模型的CPU而言,開銷太大,因此,linux社區否定了這個方案。不過,隨着Alpha處理器的衰落,未來這個方案可以拿出來再討論一下。

linux kernel中的memory barrier原語的命名來自Alpha處理器的memory barrier彙編指令,smp_mb()就是mb,smp_rmb()就是rmb,smp_wmb()就是wmb。Alpha處理器是唯一定義smp_read_barrier_depends() 原語是smp_mb() 的處理器,其他處理上,smp_read_barrier_depends() 原語沒有任何操作。

2、AMD64

TODO

3、ARMV7-A/R

The ARM family of CPUs is extremely popular in embedded applications, particularly for power-constrained applications such as cellphones. There have nevertheless been multiprocessor implementations of ARM for more than five years. Its memory model is similar to that of Power (see Section C.7.6, but ARM uses a different set of memorybarrier instructions [ARM10]:

(1) DMB (data memory barrier) causes the specified type of operations to appear to have completed before any subsequent operations of the same type. The “type” of operations can be all operations or can be restricted to only writes (similar to the Alpha wmb and the POWER eieio instructions). In addition, ARM allows cache coherence to have one of three scopes: single processor, a subset of the processors (“inner”) and global (“outer”).

(2)DSB (data synchronization barrier) causes the specified type of operations to actually complete before any subsequent operations (of any type) are executed. The “type” of operations is the same as that of DMB. The DSB instruction was called DWB (drain write buffer or data write barrier, your choice) in early versions of the ARM architecture.

(3)ISB (instruction synchronization barrier) flushes the CPU pipeline, so that all instructions following the ISB are fetched only after the ISB completes. For example, if you are writing a self-modifying program (such as a JIT), you should execute an ISB after between generating the code and executing it.

ARM處理器廣泛應用在嵌入式設備中,特別是對功耗要求比較高的應用場景,例如手機。然而,只是在最近5年才除了了支持多核架構的ARM CPU。ARM的內存模型類似Power處理器(參考本章第六小節),不過ARM使用了不同的memory barrier指令,具體如下:

(1)DMB (data memory barrier) 。DMB指令可以在指定的作用域內(也就是shareability domain啦),約束指定類型的內存操作順序。更詳細的表述是這樣的,從指定的shareability domain內的所有觀察者角度來看,CPU先完成內存屏障指令之前的特定類型的內存操作,然後再完成內存屏障指令之後的內存操作。DMB指令需要提供兩個參數:一個shareability domain,另外一個是access type。具體的內存操作類型(access type)可能是:store或者store + load。如果只是約束store操作的DMB指令,其效果類似Alpha處理器的wmb,或者類似POWER處理器的eieio指令。shareability domain有三種:單處理器、一組處理器(inner)和全局(outer)。

(2)DSB(data synchronization barrier)。DSB和DMB不一樣,DSB更“狠”一些,同時對CPU的性能殺傷力更大一些。DMB是“看起來”如何如何,DSB是真正的對指令執行的約束。也就是說,首先執行完畢DSB之前的指定的內存訪問類型的指令,等待那些指定類型的內存訪問指令執行完畢之後,對DSB之後的指定類型內存訪問的指令纔開始執行。具體DSB的參數和DMB類似,這裏就不贅述了。順便一提的是,在ARM早期版本中,DSB被稱爲DWB(全稱是drain write buffer 或者 data write barrier都OK,隨你選擇)。不過DWB並不能完全反應其功能,因爲這條指令需要同步cache、TLB、分支預測以及顯式的load/store內存操作,由此可見,DSB是一個更合適的名字。

(3)ISB (instruction synchronization barrier)。就對CPU的性能殺傷力而言,如果說DSB是原子彈,那麼ISB殺更大,是屬於氫彈了。ISB會flush cpu pipeline(也就是將指令之前的所有stage的pipeline都設置爲無效),這會導致ISB指令之後的指令會在ISB指令完成之後,重新開始取指,譯碼……,因此,整個CPU的流水線全部會重新來過。可以舉一個具體的應用例子:如果你要寫一個自修改的程序(該程序會修改自己的執行邏輯,例如JIT),那麼你需要在代碼生成和執行這些生成的代碼之間插入一個ISB。

None of these instructions exactly match the semantics of Linux’s rmb() primitive, which must therefore be implemented as a full DMB. The DMB and DSB instructions have a recursive definition of accesses ordered before and after the barrier, which has an effect similar to that of POWER’s cumulativity.

這些指令並不能精準的命中linux kernel中的rmb()原語,因此,對於ARM處理器,其rmb世界上是用mb(即全功能memory barrier)來實現的。在約束內存訪問順序上,DMB和DSB指令有一個遞進的效果(DSB強於DMB),其概念類似POWER處理器上的cumulativity。

ARM also implements control dependencies, so that if a conditional branch depends on a load, then any store executed after that conditional branch will be ordered after the load. However, loads following the conditional branch will not be guaranteed to be ordered unless there is an ISB instruction between the branch and the load. Consider the following example:

程序的執行流中難免會遇到控制依賴關係(control dependencies),例如:在A指令執行結果滿足某些條件的情況下,才執行B指令。ARM是這樣處理控制依賴關係的:如果條件調整指令依賴某個load操作,那麼任何該條件跳轉指令之後的store操作將會在load操作之後完成(也就是說,control dependencies其效果類似smp_mb)。需要注意的是,在條件跳轉指令之後的load操作沒有這個限制,我們可以參考下面的代碼:

1 r1 = x;
2 if (r1 == 0)
3 nop();
4 y = 1;
5 r2 = z;
6 ISB();
7 r3 = z;

In this example, load-store control dependency ordering causes the load from x on line 1 to be ordered before the store to y on line 4. However, ARM does not respect load-load control dependencies, so that the load on line 1 might well happen after the load on line 5. On the other hand, the combination of the conditional branch on line 2 and the ISB instruction on line 6 ensures that the load on line 7 happens after the load on line 1. Note that inserting an additional ISB instruction somewhere between lines 3 and 4 would enforce ordering between lines 1 and 5.

在這個例子中,load-store類型的控制依賴關係可以保證第四行的store y在第一行的load x之後執行。不過,ARM處理器對load-load類型的控制依賴關係並沒有施加任何的內存訪問順序的保證,因此,第五行的load z操作可能先於第一行的load x執行。另一方面,結合第二行的條件跳轉以及第六行的ISB指令,可以保證第七行的load z的操作在第一行的load x之後執行。請注意:在第三行和第四行之間插入額外的ISB可以確保第一行的load x和第五行load z之間的操作順序。

4、IA64

IA64 offers a weak consistency model, so that in absence of explicit memory-barrier instructions, IA64 is within its rights to arbitrarily reorder memory references [Int02b]. IA64 has a memory-fence instruction named mf, but also has “half-memory fence” modifiers to loads, stores, and to some of its atomic instructions [Int02a]. The acq modifier prevents subsequent memory-reference instructions from being reordered before the acq, but permits prior memory-reference instructions to be reordered after the acq, as fancifully illustrated by Figure C.12. Similarly, the rel modifier prevents prior memory-reference instructions from being reordered after the rel, but allows subsequent memory-reference instructions to be reordered before the rel.

IA64提供了較弱的一致性模型,因此,在缺少memory barrier指令的情況下,IA64可以在許可範圍內任意的重排內存的訪問順序。IA64提供了一條叫做mf的指令,此外還提供了支持half-memory fence功能的load、store以及一些原子操作。acq這個修飾符阻止了後面的內存訪問指令越過acq修飾那條指令之前執行,不過,acq並不阻止前面的內存訪問指令在acq修飾那條指令之後執行。具體可以參考下圖:

acq

LD,  ACQ是一條隱含half-memory fence功能的load操作,該指令之前的內存操作可以delay至該指令之後執行,反之則不可以。類似的,還有一個rel的修飾符,也是提供了支持half-memory fence功能,不同的是rel是約束rel修飾的那條指令之前的內存訪問操作不得越過該指令之後執行。

These half-memory fences are useful for critical sections, since it is safe to push operations into a critical section, but can be fatal to allow them to bleed out. However, as one of the only CPUs with this property, IA64 defines Linux’s semantics of memory ordering associated with lock acquisition and release.

half-memory fences應用在臨界區上是非常合適的,將臨界區的代碼框在acq和rel定義的區域內,這樣,臨界區的代碼的內存訪問操作不會溢出。做爲唯一一個支持這個屬性的CPU(其實不是唯一的,ARMv8中也支持了,呵呵),IA64定義了linux中lock acquisition和release的在內存訪問順序上的語義。

The IA64 mf instruction is used for the smp_rmb(), smp_mb(), and smp_wmb() primitives in the Linux kernel. Oh, and despite rumors to the contrary, the “mf” mnemonic really does stand for “memory fence”.

對於IA64這個CPU architecture而言,linux中的smp_rmb(), smp_mb(), 和 smp_wmb() 這三個原語都是對應mf彙編指令。有些關於mf的謠言,不過實際上,mf就是memory fence的簡寫。

Finally, IA64 offers a global total order for “release” operations, including the “mf” instruction. This provides the notion of transitivity, where if a given code fragment sees a given access as having happened, any later code fragment will also see that earlier access as having happened. Assuming, that is, that all the code fragments involved correctly use memory barriers.

這段不是非常理解,TODO。

5、PA-RISC

Although the PA-RISC architecture permits full reordering of loads and stores, actual CPUs run fully ordered [Kan96]. This means that the Linux kernel’s memory-ordering primitives generate no code, however, they do use the gcc memory attribute to disable compiler optimizations that would reorder code across the memory barrier.

PA-RISC處理器保證了load和store的順序,也就是說,實際上該CPU在執行程序的時候,內存訪問是完全符合programmer order的,這也就是意味着在linux kernel的所有memory barrier的原語都是空的,沒有任何代碼。當然,必要的優化屏障還是需要的。

6、POWER / PowerPC

TODO

7、SPARC RMO, PSO, and TSO

TODO

8、x86

TODO

9、zSeries

TODO

 

九、Are Memory Barriers Forever?

There have been a number of recent systems that are significantly less aggressive about out-of-order execution in general and re-ordering memory references in particular. Will this trend continue to the point where memory barriers are a thing of the past?
The argument in favor would cite proposed massively multi-threaded hardware architectures, so that each thread would wait until memory was ready, with tens, hundreds, or even thousands of other threads making progress in the meantime. In such an architecture, there would be no need for memory barriers, because a given thread would simply wait for all outstanding operations to complete before proceeding to the next instruction. Because there would be potentially thousands of other threads, the CPU would be completely utilized, so no CPU time would be wasted.

在最近幾個計算機系統上出現這樣的現象:這些系統在out-of-order excution方面變得沒有那麼激進,特別是在對內存訪問的重排方面。這些變化會不會導致memory barrier變成一個僅僅能從計算機歷史找到的名詞呢?

同意memory barrier退出歷史舞臺的那些人會認爲目前硬件進入multi-threaded時代(注意這裏的thread不是軟件的thread),並且硬件線程數是越來越多,即便是一個線程上等待內存操作而暫停工作,但是系統中還有10個,幾百個甚至上千其他的線程在繼續前進。在這樣的CPU架構下,根本沒有memory barrier存在的必要了,因爲對於任何一個線程,它都是等待該指令的操作完成之後,然後處理下一條指令。由於整個CPU系統存在數千個硬件線程可以使用,因此也沒有浪費CPU的資源。

The argument against would cite the extremely limited number of applications capable of scaling up to a thousand threads, as well as increasingly severe realtime requirements, which are in the tens of microseconds for some applications. The realtimeresponse requirements are difficult enough to meet as is, and would be even more difficult to meet given the extremely low single-threaded throughput implied by the massive multi-threaded scenarios.

反對的人認爲這將大大限制了應用程序的數目。同時,如果僅僅依靠硬件多線程,應用程序很難擴展到上千個線程(這裏的線程是軟件線程)。此外,也會大大增加實時響應時間,在有些應用中,甚至是幾十個微妙的delay。實時系統響應時間的需求本來就很難滿足,如果在multi-thread硬件情況下,single-thread的througput非常低,這時候響應時間更加難以符合實時系統的需要。

Another argument in favor would cite increasingly sophisticated latency-hiding hardware implementation techniques that might well allow the CPU to provide the illusion of fully sequentially consistent execution while still providing almost all of the performance advantages of out-of-order execution. A counter-argument would cite the increasingly severe power-efficiency requirements presented both by battery-operated devices and by environmental responsibility.
Who is right? We have no clue, so are preparing to live with either scenario.

還有另外的論點也是支持幹掉memory barrier,這種觀點認爲雖然processor和memory的速度有差異,但是複雜度不斷增加的latency-hiding硬件實現技術最終可以達成這樣的效果:CPU可以提供一個完全順序執行的假象,同時又不失亂序執行的性能優勢。反對的聲音認爲這會增加功耗,不適合那些使用電池供電的設備,同時也不環保。

誰對誰錯?我們也不知道,也許我們兩種場景都會遇到。

 

十、Advice to Hardware Designers

There are any number of things that hardware designers can do to make the lives of software people difficult. Here is a list of a few such things that we have encountered in the past, presented here in the hope that it might help prevent future such problems:

硬件設計者如果設計思考不周,的確是有能力讓軟件工程師的生活變得異常的悲摧,做爲一名軟件工程師,在過去的工作中的確是遇到了一些這樣的案例,我把它們羅列下來,希望能夠防止其他硬件工程師犯同樣的錯誤:

1. I/O devices that ignore cache coherence.

IO設備忽略了緩存一致性

This charming misfeature can result in DMAs from memory missing recent changes to the output buffer, or, just as bad, cause input buffers to be overwritten by the contents of CPU caches just after the DMA completes. To make your system work in face of such misbehavior, you must carefully flush the CPU caches of any location in any DMA buffer before presenting that buffer to the I/O device. Similarly, you need to flush the CPU caches of any location in any DMA buffer after DMA to that buffer completes. And even then, you need to be very careful to avoid pointer bugs, as even a misplaced read to an input buffer can result in corrupting the data input!

硬件工程師在設計I/O device的時候可能會build-in一個DMA控制器來輔助設備和memory之間的數據搬移。假設是從memory中搬移數據到設備上,假設硬件設計者不考慮cache coherence,那麼DMA在搬運input memory的時候,將不能體現其最新的數據值(最新的在cache中)。在從設備搬移數據到memory中的時候會更糟糕,因爲有可能會被CPU cache中的數據覆蓋掉。在這種情況下,如果想讓系統可以正常運作,你必須非常小心的操作DMA buffer,例如在DMA開始讀取DMA buffer之前首先flush這段buffer的cache。即便如此,你仍然需要非常仔細,避免出現指針錯誤,因爲即便是一個對DMA buffer錯位的讀操作都可以導致輸入數據的破壞。

2. External busses that fail to transmit cache-coherence data.

外部總線無法發送cache-coherence數據

This is an even more painful variant of the above problem, but causes groups of devices—and even memory itself—to fail to respect cache coherence. It is my painful duty to inform you that as embedded systems move to multicore architectures, we will no doubt see a fair number of such problems arise. Hopefully these problems will clear up by the year 2015.

這是上一個問題的變種,而且更加令人痛苦。這個錯誤會導致一組設備,甚至是memory自己無法遵守cache coherence。因爲痛苦的經歷過,所以有責任告訴你,在從嵌入式系統轉移的multicore體系結構的時候,毫無疑問,我們會遇到大量的這樣的問題,希望這些問題能在2015年內解決掉。

3. Device interrupts that ignore cache coherence.

設備中斷場景中忘記考慮cache coherence

This might sound innocent enough — after all, interrupts aren’t memory references, are they? But imagine a CPU with a split cache, one bank of which is extremely busy, therefore holding onto the last cacheline of the input buffer. If the corresponding I/O-complete interrupt reaches this CPU, then that CPU’s memory reference to the last cache line of the buffer could return old data, again resulting in data corruption, but in a form that will be invisible in a later crash dump. By the time the system gets around to dumping the offending input buffer, the DMA will most likely have completed.

這個錯誤聽起來有些無辜,畢竟中斷不是內存引用,不是嗎?不過,設想一下,CPU使用了分bank的緩存機制,一個bank非常繁忙,因此,cpu無法將最新的數據寫入cacheline。如果在這個時候,I/O完成的中斷到達該CPU,然後該CPU會訪問buffer中的數據,但是得到舊的值,從而導致data corruption,更可怕的是在隨後的system crash的時候,你並不知道之前在interrupt handler中從buffer中copy數據發生了錯誤。當系統crash的時候再dump 惱人的輸入的DMA buffer,DMA操作早已經完成了。

4. Inter-processor interrupts (IPIs) that ignore cache coherence.

IPI忘記考慮緩存一致性

This can be problematic if the IPI reaches its destination before all of the cache lines in the corresponding message buffer have been committed to memory.

正確的順序應該是首先將對應message buffer中的cache line提交到內存,然後指定的CPU收到IPI中斷,如果順序反了就會存在問題。

5. Context switches that get ahead of cache coherence.

沒有處理緩存一致性就進行了上下文切換

If memory accesses can complete too wildly out of order, then context switches can be quite harrowing. If the task flits from one CPU to another before all the memory accesses visible to the source CPU make it to the destination CPU, then the task could easily see the corresponding variables revert to prior values, which can fatally confuse most algorithms.

如果內存訪問操作被嚴重的亂序執行,那麼上下文切換會處理一些令人煩惱的問題。比如說一個task從一個CPU遷移到另外一個CPU上執行,雖然完成了切換,但是source CPU可見的內存操作在destination CPU上不可見,在這種情況下,執行程序很容易就看到這樣的怪事:明明已經讀到新值了,但是再次訪問同樣的內存,得到的確是舊值,這樣內存順序是任何算法都無法承受的。

6. Overly kind simulators and emulators.

模擬器和仿真器無法仿真真實的硬件環境

It is difficult to write simulators or emulators that force memory re-ordering, so software that runs just fine in these environments can get a nasty surprise when it first runs on the real hardware. Unfortunately, it is still the rule that the hardware is more devious than are the simulators and emulators, but we hope that this situation changes.

撰寫支持亂序執行的模擬器和仿真器還是有些困難的,因此,如果軟件可以正確的在仿真環境中運行正常,或許在真實硬件上運行的時候會遇到各種奇奇怪怪的問題。很遺憾,真實的硬件永遠都是比模擬器和仿真器更加複雜,曲折,因此這個情況始終存在,希望後續會慢慢好轉。

Again, we encourage hardware designers to avoid these practices!

再次強調,希望硬件設計者避免發生上述錯誤。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章