What every programmer should know about memory (Part 2-1) 譯

What Every Programmer Should Know About Memory
Ulrich Drepper
Red Hat, Inc.
[email protected]
November 21, 2007


2.1 RAM Types

There have been many types of RAM over the years and each type varies, sometimes significantly, from the other. The older types are today really only interesting to the historians. We will not explore the details of those. Instead we will concentrate on modern RAM types; we will only scrape the surface, exploring some details which are visible to the kernel or application developer through their performance characteristics.

The first interesting details are centered around the question why there are different types of RAM in the same machine. More specifically, why there are both static RAM (SRAM {In other contexts SRAM might mean “synchronous RAM”.}) and dynamic RAM (DRAM). The former is much faster and provides the same functionality. Why is not all RAM in a machine SRAM? The answer is, as one might expect, cost. SRAM is much more expensive to produce and to use than DRAM. Both these cost factors are important, the second one increasing in importance more and more. To understand these difference we look at the implementation of a bit of storage for both SRAM and DRAM.

In the remainder of this section we will discuss some low-level details of the implementation of RAM. We will keep the level of detail as low as possible. To that end, we will discuss the signals at a “logic level” and not at a level a hardware designer would have to use. That level of detail is unnecessary for our purpose here.

2.1.1 Static RAM

Figure 2.4 shows the structure of a 6 transistor SRAM cell. The core of this cell is formed by the four transistorsM1toM4which form two cross-coupled inverters. They have two stable states, representing 0 and 1 respectively. The state is stable as long as power onVddis available.

If access to the state of the cell is needed the word access lineWLis raised. This makes the state of the cell immediately available for reading onBLandBL. If the cell state must be overwritten theBLandBLlines are first set to the desired values and thenWLis raised. Since the outside drivers are stronger than the four transistors (M1throughM4) this allows the old state to be overwritten.

See [sramwiki] for a more detailed description of the way the cell works. For the following discussion it is important to note that

  • one cell requires six transistors. There are variants with four transistors but they have disadvantages.
  • maintaining the state of the cell requires constant power.
  • the cell state is available for reading almost immediately once the word access lineWLis raised. The signal is as rectangular (changing quickly between the two binary states) as other transistor-controlled signals.
  • the cell state is stable, no refresh cycles are needed.

There are other, slower and less power-hungry, SRAM forms available, but those are not of interest here since we are looking at fast RAM. These slow variants are mainly interesting because they can be more easily used in a system than dynamic RAM because of their simpler interface.

2.1.2 Dynamic RAM

Dynamic RAM is, in its structure, much simpler than static RAM. Figure 2.5 shows the structure of a usual DRAM cell design. All it consists of is one transistor and one capacitor. This huge difference in complexity of course means that it functions very differently than static RAM.

A dynamic RAM cell keeps its state in the capacitorC. The transistorMis used to guard the access to the state. To read the state of the cell the access lineALis raised; this either causes a current to flow on the data lineDLor not, depending on the charge in the capacitor. To write to the cell the data lineDLis appropriately set and thenALis raised for a time long enough to charge or drain the capacitor.

There are a number of complications with the design of dynamic RAM. The use of a capacitor means that reading the cell discharges the capacitor. The procedure cannot be repeated indefinitely, the capacitor must be recharged at some point. Even worse, to accommodate the huge number of cells (chips with 109 or more cells are now common) the capacity to the capacitor must be low (in the femto-farad range or lower). A fully charged capacitor holds a few 10’s of thousands of electrons. Even though the resistance of the capacitor is high (a couple of tera-ohms) it only takes a short time for the capacity to dissipate. This problem is called “leakage”.

This leakage is why a DRAM cell must be constantly refreshed. For most DRAM chips these days this refresh must happen every 64ms. During the refresh cycle no access to the memory is possible. For some workloads this overhead might stall up to 50% of the memory accesses (see [highperfdram]).

A second problem resulting from the tiny charge is that the information read from the cell is not directly usable. The data line must be connected to a sense amplifier which can distinguish between a stored 0 or 1 over the whole range of charges which still have to count as 1.

A third problem is that charging and draining a capacitor is not instantaneous. The signals received by the sense amplifier are not rectangular, so a conservative estimate as to when the output of the cell is usable has to be used. The formulas for charging and discharging a capacitor are

This means it takes some time (determined by the capacity C and resistance R) for the capacitor to be charged and discharged. It also means that the current which can be detected by the sense amplifiers is not immediately available. Figure 2.6 shows the charge and discharge curves. The X—axis is measured in units of RC (resistance multiplied by capacitance) which is a unit of time.

Unlike the static RAM case where the output is immediately available when the word access line is raised, it will always take a bit of time until the capacitor discharges sufficiently. This delay severely limits how fast DRAM can be.

The simple approach has its advantages, too. The main advantage is size. The chip real estate needed for one DRAM cell is many times smaller than that of an SRAM cell. The SRAM cells also need individual power for the transistors maintaining the state. The structure of the DRAM cell is also simpler and more regular which means packing many of them close together on a die is simpler.

Overall, the (quite dramatic) difference in cost wins. Except in specialized hardware — network routers, for example — we have to live with main memory which is based on DRAM. This has huge implications on the programmer which we will discuss in the remainder of this paper. But first we need to look into a few more details of the actual use of DRAM cells.

2.1.3 DRAM Access

A program selects a memory location using a virtual address. The processor translates this into a physical address and finally the memory controller selects the RAM chip corresponding to that address. To select the individual memory cell on the RAM chip, parts of the physical address are passed on in the form of a number of address lines.

It would be completely impractical to address memory locations individually from the memory controller: 4GB of RAM would require 232 address lines. Instead the address is passed encoded as a binary number using a smaller set of address lines. The address passed to the DRAM chip this way must be demultiplexed first. A demultiplexer with N address lines will have 2N output lines. These output lines can be used to select the memory cell. Using this direct approach is no big problem for chips with small capacities.

But if the number of cells grows this approach is not suitable anymore. A chip with 1Gbit {I hate those SI prefixes. For me a giga-bit will always be 230 and not 109 bits.} capacity would need 30 address lines and 230 select lines. The size of a demultiplexer increases exponentially with the number of input lines when speed is not to be sacrificed. A demultiplexer for 30 address lines needs a whole lot of chip real estate in addition to the complexity (size and time) of the demultiplexer. Even more importantly, transmitting 30 impulses on the address lines synchronously is much harder than transmitting “only” 15 impulses. Fewer lines have to be laid out at exactly the same length or timed appropriately. {Modern DRAM types like DDR3 can automatically adjust the timing but there is a limit as to what can be tolerated.}

Figure 2.7 shows a DRAM chip at a very high level. The DRAM cells are organized in rows and columns. They could all be aligned in one row but then the DRAM chip would need a huge demultiplexer. With the array approach the design can get by with one demultiplexer and one multiplexer of half the size. {Multiplexers and demultiplexers are equivalent and the multiplexer here needs to work as a demultiplexer when writing. So we will drop the differentiation from now on.} This is a huge saving on all fronts. In the example the address linesa0anda1through the row address selection (RAS) demultiplexer select the address lines of a whole row of cells. When reading, the content of all cells is thusly made available to the column address selection (CAS) {The line over the name indicates that the signal is negated} multiplexer. Based on the address linesa2anda3the content of one column is then made available to the data pin of the DRAM chip. This happens many times in parallel on a number of DRAM chips to produce a total number of bits corresponding to the width of the data bus.

For writing, the new cell value is put on the data bus and, when the cell is selected using theRASandCAS, it is stored in the cell. A pretty straightforward design. There are in reality — obviously — many more complications. There need to be specifications for how much delay there is after the signal before the data will be available on the data bus for reading. The capacitors do not unload instantaneously, as described in the previous section. The signal from the cells is so weak that it needs to be amplified. For writing it must be specified how long the data must be available on the bus after theRASandCASis done to successfully store the new value in the cell (again, capacitors do not fill or drain instantaneously). These timing constants are crucial for the performance of the DRAM chip. We will talk about this in the next section.

A secondary scalability problem is that having 30 address lines connected to every RAM chip is not feasible either. Pins of a chip are a precious resources. It is “bad” enough that the data must be transferred as much as possible in parallel (e.g., in 64 bit batches). The memory controller must be able to address each RAM module (collection of RAM chips). If parallel access to multiple RAM modules is required for performance reasons and each RAM module requires its own set of 30 or more address lines, then the memory controller needs to have, for 8 RAM modules, a whopping 240+ pins only for the address handling.

To counter these secondary scalability problems DRAM chips have, for a long time, multiplexed the address itself. That means the address is transferred in two parts. The first part consisting of address bitsa0anda1in the example in Figure 2.7) select the row. This selection remains active until revoked. Then the second part, address bitsa2anda3, select the column. The crucial difference is that only two external address lines are needed. A few more lines are needed to indicate when theRASandCASsignals are available but this is a small price to pay for cutting the number of address lines in half. This address multiplexing brings its own set of problems, though. We will discuss them in Section 2.2.

2.1.4 Conclusions

Do not worry if the details in this section are a bit overwhelming. The important things to take away from this section are:

  • there are reasons why not all memory is SRAM
  • memory cells need to be individually selected to be used
  • the number of address lines is directly responsible for the cost of the memory controller, motherboards, DRAM module, and DRAM chip
  • it takes a while before the results of the read or write operation are available

The following section will go into more details about the actual process of accessing DRAM memory. We are not going into more details of accessing SRAM, which is usually directly addressed. This happens for speed and because the SRAM memory is limited in size. SRAM is currently used in CPU caches and on-die where the connections are small and fully under control of the CPU designer. CPU caches are a topic which we discuss later but all we need to know is that SRAM cells have a certain maximum speed which depends on the effort spent on the SRAM. The speed can vary from only slightly slower than the CPU core to one or two orders of magnitude slower.


2.1 RAM 類別

多年來已經出現了許多類型的RAM並且各不相同,一些時候會有顯著的不同.今日較老的類型已經無人問津.我們不會探索這些細節.事實上我們將專注於現代的RAM類型.我們將刨開表面去探索一些細節,通過他們的性能特性,內核和應用層開發人員將會看到一些細節.

第一個有趣的細節集中於爲什麼不同類型的RAM可以存在於一臺機器.更明確來講,爲什麼SRAM(也可以理解爲同步內存)和DRAM.前者在提供了相同的功能下更快.爲什麼一臺機器上不全部使用SRAM?正如人們所預料的,答案是代價.SRAM無論是在生產還是使用都是昂貴的.這兩個代價都是很重要的,第二個越來越重要.爲了理解這些不同,我們將瀏覽在SRAM和DRAM上位存儲的實現.

在本節的其餘部分我們將討論RAM的底層實現.我們將儘可能的保持細節層次.爲了實現這,我們將討論這個信號在一個邏輯層次,而不是一個硬件工程師層次,但是這對我們來說是無關緊要的.

2.1.1 靜態RAM

圖2-4展示了6晶體管SRAM單元的架構.單元的核心有晶體管M1到M4,它們組成了兩個交叉耦合的反向器.它們有兩個穩定的狀態,分別代表0和1.這個狀態只有Vdd有電就是穩定的.

想要去讀這個單元的狀態需要字線WL升起.這立刻使得單元的狀態在位線BL與BLB上變得可讀.如果想要重寫單元的狀態,首先應在BL和BLB上設置期望的值並且WL升起.因爲外部的電壓是比4個晶體管電壓更高,這使得舊的狀態被覆蓋.

更多單元工作的原理請看[sramwiki].下面對於接下來的討論是重要的.

一個單元需要6個晶體管.也有4個晶體管的變體但是它們有缺點.
維護單元的狀態需要穩定的狀態.
一旦WL字線升起,單元的狀態是可以立即讀取的.這個信號是矩形的和其他的控制的信號是一致的(因爲在兩個狀態之間迅速的改變).

這個單元狀態是穩定的.不需要週期性刷新.

也有其它更慢,能耗更小的SRAM形式,但是我們需求的是更快的RAM,所以我們不需要關注它們.這些較慢的SRAM變種比DRAM更容易的使用因爲它們簡單的接口.

2.1.2 DRAM


DRAM在架構上是比SRAM更簡單的.圖2.5展示了普通的DRAM單元架構.它包含了一個晶體管和一個電容器.這種複雜程度的不同意味着它和SRAM在功能上非常不同.

一個動態的RAM單元使用電容保持它的狀態.晶體管的使用控制着狀態的訪問.讀單元的狀態需要線AL升起.這可能會造成電流流到數據線DL上,這種可能性取決於電容是否有電.爲了寫數據,先DL會適當的設置並且先AL會升起足夠的時間使得電容充電或放電完成.

DRAM的設計有點複雜.電容的使用意味這讀單元的時候需要放電,因此這個過程不能無線重複,電容必須在某刻重新充電.更糟糕的是,爲了容納巨大數量的單元(通常一個芯片上有10的9ci方個單元)一個電容器上的電容必須是低的(毫微微法拉一下).充滿點的電容器持有幾萬個電子.即使電容器的電阻很高(幾兆歐姆),但是只需要很短的時間來放電.這種現象稱爲泄漏.
泄漏是導致DRAM單元持續刷新的原因.對於大部分得DRAM來說,刷行的週期是64ms.在刷新的時候是不能訪問芯片的.這間接延遲了工作中50%的內存訪問時間.
第二個問題細微的電容使得從單元上讀取的信息不是直接可用的.數據線必須連接到一個信號放大器上,纔可以分辨出單元的輸出是0還是1.

第三個問題是電容的充放電不是瞬時的.因此信號放大器的信號接收不是矩形的.所以一個保守的估計在於什麼時候的單元輸出是可用的.所以一個電容器充放電的公式如下:

這意味這需要花費一些時間需要去充放電(時間取決於電容C和電阻R).它也意味着檢測到的電容放大器的輸出電流是不能立刻就是使用的.圖2.6表示了充電和放電時的曲線.x軸表示的是單位時間下的RC.

這種簡單的方法有它的優點.這主要的優點是尺長.一個DRAM的體積比SRAM小很多倍.SRAM單元需要獨立的供電去維持晶體管的狀態.DRAM的體積結構是更簡單的並且整齊的,這意味着DRAM的規模化是更簡單的.

總的來說,DRAM在成本方面取勝.除了一些特別的硬件(網絡路由),我們生活中的主要內存是基於DRAM的.這對程序員有着巨大的影響,我們會在後邊討論.但是首先我們需要了解DRAM使用的一些細節.

2.1.3 DRAM的訪問.

程序在管理內存時使用了虛擬內存.處理器轉換這個虛擬內存成物理地址並且內存控制器選擇一個RAM芯片對應這個物理地址.爲了在RAM芯片中選擇單獨的內存單元,物理地址中部分會在地址譯碼器譯碼之後以地址線的方式傳遞.

單獨的處理來自內存控制器的內存地址是十分不切實際的.4GB的內存需要32根地址線.實際上地址被譯碼器翻譯成二進制的數字,然後通過一組較小的地址線去複用地址.一個N地址線的複用會有2的N此方輸出結果.這些輸出線可以去選擇內存單元.使用這種直接的方法在小容量的芯片去應對內存單元的選擇中不成問題的.

但是隨着單元數量的變多,這種方法將變得不再合適.一個1GB的芯片需要30根地址線和2的30次選項線(我痛恨SI前綴.對我來說,吉特級別是2的30次方而不是10的9次方比特).在速度不被放棄時,輸入地址線複用的次數指數增長.一個30根地址線的複用將佔用許多的芯片體積,除此之外,信號分離器也會變的更加複雜.更重要的是,在地址線上同步傳輸30個脈衝是比15個脈衝是更加複雜的.

圖2.7展示了一個非常高級的DRAM芯片.這個芯片使用行列組織而成.它們都可以排成一行但是DRAM需要一個巨大的信號分離器一個信號分離器和一個多路複用器可以實現數組方式的設計.(信號分離器和多路複用器是等價的並且信號分離器會在寫時當作多路複用器,我們將討論這個區別).實現的各方面需要較深較廣的理解.比如,a0和a1地址線通過行地址選擇器(RAS)信號分離器選擇了一整行的單元.在讀的時候,行選擇器RAS使得這一行上的所有單元變得可讀.地址線a2和a3使得一列的單元是可讀的.這同時發生很多次當許多DRAM芯片去產生和數據總線帶寬一樣長的數據位.

這一部分可瀏覽

http://www.sholck.top/index.php/2018/08/05/1-12/

獲取更加詳細的信息

(自加部分參考: 內存陣列址尋址過程是這樣的,在內存陣列中分爲行和列,當命令請求到達內存後,首先被觸 發的是tRAS (Active to Precharge Delay),數據被請求後需預先充電,一旦tRAS被激活後,RAS纔開始在一半的物理地址中尋址,行被選定後,tRCD初始化,最後才通過CAS找到 精確的地址。整個過程也就是先行尋址再列尋址。從CAS開始到CAS結束就是現在講解的CAS延遲了。因爲CAS是尋址的最後一個步驟,所以在內存參數中 它是最重要的。)

當寫時,被寫的值被傳遞進入數據線,當使用RAS和CAS選中單元時,值被寫入單元.這是一個很直觀的設計.在現實中這是相當複雜的.需要值得注意說明的是在信號被傳遞到數據線與數據可讀之間的延遲是多少.正如前面描述的一樣,電容不會立刻放電.從單元中流出來的電流信號是如此的微弱.當寫時值得注意的是在選中(RAS和CAS)到(成功的存儲新值到單元中)之間的時間差.這些時間參數是十分重要的對於DRAM芯片的性能.

第二個問題是在擴展方向的.30個地址線連接到每個RAM芯片是不可行的.芯片的針腳是十分寶貴的資源,以至於這個數據必須平行的傳輸.(比如64位一組的傳輸).內存控制器必須有能力去解析每一個RAM模塊.假設爲了滿足性能,那麼就需要並行的去訪問多個內存模塊,並且每個RAM模塊需要擁有它自己的30根地址線的集.假設8個RAM模塊,那麼內存控制器爲了地址解析就需要多至240+根的針腳。(理解:每個模塊由8個基本存儲單位爲1的基本單元構成。故每個模塊能夠並行的輸出8位,需要30個引腳,可以採用SIMM封裝。如果爲了內存控制器可以並行的訪問每個模塊,我們的內存控制器則需要30*8個引腳)

2.1.4
不用擔心在這節中的細節是難以理解。重點總結如下:
1>爲什麼所有的內存不都是SRAM
2>內存單元需要被單獨的選出被使用
3>地址線的數量直接決定這內存控制器,主板,DRAM模塊,DRAM芯片的消耗
4> 需要花費一段時間去獲取讀和寫操作的結果。(流程:地址傳輸-/RAS或/CAS引腳被激活-行列地址譯碼器-電容放電-信號放大器)

接下來的章節將會更多的介紹訪問DRAM內存的實際流程細節。我們將不會有更多SRAM(直接尋址)的細節介紹。SRAM使用在CPU的二級緩存中並且能由CPU設計師完全的控制。CPU緩存我們會在之後介紹,但是我們需要知道SRAM單元有一個明確的最大的速度,這來源於在SRAM上的付出。這個速度是比CPU核心稍微慢一到兩個數量級。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章