What every programmer should know about memory (Part 2-2) 譯

What Every Programmer Should Know About Memory
Ulrich Drepper
Red Hat, Inc.
[email protected]
November 21, 2007


2.2 DRAM Access Technical Details

In the section introducing DRAM we saw that DRAM chips multiplex the addresses in order to save resources. We also saw that accessing DRAM cells takes time since the capacitors in those cells do not discharge instantaneously to produce a stable signal; we also saw that DRAM cells must be refreshed. Now it is time to put this all together and see how all these factors determine how the DRAM access has to happen.

We will concentrate on current technology; we will not discuss asynchronous DRAM and its variants as they are simply not relevant anymore. Readers interested in this topic are referred to [highperfdram] and [arstechtwo]. We will also not talk about Rambus DRAM (RDRAM) even though the technology is not obsolete. It is just not widely used for system memory. We will concentrate exclusively on Synchronous DRAM (SDRAM) and its successors Double Data Rate DRAM (DDR).

Synchronous DRAM, as the name suggests, works relative to a time source. The memory controller provides a clock, the frequency of which determines the speed of the Front Side Bus (FSB) — the memory controller interface used by the DRAM chips. As of this writing, frequencies of 800MHz, 1,066MHz, or even 1,333MHz are available with higher frequencies (1,600MHz) being announced for the next generation. This does not mean the frequency used on the bus is actually this high. Instead, today’s buses are double- or quad-pumped, meaning that data is transported two or four times per cycle. Higher numbers sell so the manufacturers like to advertise a quad-pumped 200MHz bus as an “effective” 800MHz bus.

For SDRAM today each data transfer consists of 64 bits — 8 bytes. The transfer rate of the FSB is therefore 8 bytes multiplied by the effective bus frequency (6.4GB/s for the quad-pumped 200MHz bus). That sounds a lot but it is the burst speed, the maximum speed which will never be surpassed. As we will see now the protocol for talking to the RAM modules has a lot of downtime when no data can be transmitted. It is exactly this downtime which we must understand and minimize to achieve the best performance.

2.2.1 Read Access Protocol

Figure 2.8 shows the activity on some of the connectors of a DRAM module which happens in three differently colored phases. As usual, time flows from left to right. A lot of details are left out. Here we only talk about the bus clock,RASandCASsignals, and the address and data buses. A read cycle begins with the memory controller making the row address available on the address bus and lowering theRASsignal. All signals are read on the rising edge of the clock (CLK) so it does not matter if the signal is not completely square as long as it is stable at the time it is read. Setting the row address causes the RAM chip to start latching the addressed row.

TheCASsignal can be sent aftertRCD(RAS-to-CASDelay) clock cycles. The column address is then transmitted by making it available on the address bus and lowering theCASline. Here we can see how the two parts of the address (more or less halves, nothing else makes sense) can be transmitted over the same address bus.

Now the addressing is complete and the data can be transmitted. The RAM chip needs some time to prepare for this. The delay is usually calledCASLatency (CL). In Figure 2.8 theCASlatency is 2. It can be higher or lower, depending on the quality of the memory controller, motherboard, and DRAM module. The latency can also have half values. With CL=2.5 the first data would be available at the first falling flank in the blue area.

With all this preparation to get to the data it would be wasteful to only transfer one data word. This is why DRAM modules allow the memory controller to specify how much data is to be transmitted. Often the choice is between 2, 4, or 8 words. This allows filling entire lines in the caches without a newRAS/CASsequence. It is also possible for the memory controller to send a newCASsignal without resetting the row selection. In this way, consecutive memory addresses can be read from or written to significantly faster because theRASsignal does not have to be sent and the row does not have to be deactivated (see below). Keeping the row “open” is something the memory controller has to decide. Speculatively leaving it open all the time has disadvantages with real-world applications (see [highperfdram]). Sending newCASsignals is only subject to the Command Rate of the RAM module (usually specified as Tx, where x is a value like 1 or 2; it will be 1 for high-performance DRAM modules which accept new commands every cycle).

In this example the SDRAM spits out one word per cycle. This is what the first generation does. DDR is able to transmit two words per cycle. This cuts down on the transfer time but does not change the latency. In principle, DDR2 works the same although in practice it looks different. There is no need to go into the details here. It is sufficient to note that DDR2 can be made faster, cheaper, more reliable, and is more energy efficient (see [ddrtwo] for more information).

2.2.2 Precharge and Activation

Figure 2.8 does not cover the whole cycle. It only shows parts of the full cycle of accessing DRAM. Before a newRASsignal can be sent the currently latched row must be deactivated and the new row must be precharged. We can concentrate here on the case where this is done with an explicit command. There are improvements to the protocol which, in some situations, allows this extra step to be avoided. The delays introduced by precharging still affect the operation, though.

Figure 2.9 shows the activity starting from oneCASsignal to theCASsignal for another row. The data requested with the firstCASsignal is available as before, after CL cycles. In the example two words are requested which, on a simple SDRAM, takes two cycles to transmit. Alternatively, imagine four words on a DDR chip.

Even on DRAM modules with a command rate of one the precharge command cannot be issued right away. It is necessary to wait as long as it takes to transmit the data. In this case it takes two cycles. This happens to be the same as CL but that is just a coincidence. The precharge signal has no dedicated line; instead, some implementations issue it by lowering the Write Enable (WE) andRASline simultaneously. This combination has no useful meaning by itself (see [micronddr] for encoding details).

Once the precharge command is issued it takestRP(Row Precharge time) cycles until the row can be selected. In Figure 2.9 much of the time (indicated by the purplish color) overlaps with the memory transfer (light blue). This is good! ButtRPis larger than the transfer time and so the nextRASsignal is stalled for one cycle.

If we were to continue the timeline in the diagram we would find that the next data transfer happens 5 cycles after the previous one stops. This means the data bus is only in use two cycles out of seven. Multiply this with the FSB speed and the theoretical 6.4GB/s for a 800MHz bus become 1.8GB/s. That is bad and must be avoided. The techniques described in Section 6 help to raise this number. But the programmer usually has to do her share.

There is one more timing value for a SDRAM module which we have not discussed. In Figure 2.9 the precharge command was only limited by the data transfer time. Another constraint is that an SDRAM module needs time after aRASsignal before it can precharge another row (denoted astRAS). This number is usually pretty high, in the order of two or three times thetRPvalue. This is a problem if, after aRASsignal, only oneCASsignal follows and the data transfer is finished in a few cycles. Assume that in Figure 2.9 the initialCASsignal was preceded directly by aRASsignal and thattRASis 8 cycles. Then the precharge command would have to be delayed by one additional cycle since the sum oftRCD, CL, andtRP(since it is larger than the data transfer time) is only 7 cycles.

DDR modules are often described using a special notation: w-x-y-z-T. For instance: 2-3-2-8-T1. This means:

There are numerous other timing constants which affect the way commands can be issued and are handled. Those five constants are in practice sufficient to determine the performance of the module, though.

It is sometimes useful to know this information for the computers in use to be able to interpret certain measurements. It is definitely useful to know these details when buying computers since they, along with the FSB and SDRAM module speed, are among the most important factors determining a computer’s speed.

The very adventurous reader could also try to tweak a system. Sometimes the BIOS allows changing some or all these values. SDRAM modules have programmable registers where these values can be set. Usually the BIOS picks the best default value. If the quality of the RAM module is high it might be possible to reduce the one or the other latency without affecting the stability of the computer. Numerous overclocking websites all around the Internet provide ample of documentation for doing this. Do it at your own risk, though and do not say you have not been warned.

2.2.3 Recharging

A mostly-overlooked topic when it comes to DRAM access is recharging. As explained in Section 2.1.2, DRAM cells must constantly be refreshed. This does not happen completely transparently for the rest of the system. At times when a row {Rows are the granularity this happens with despite what [highperfdram] and other literature says (see [micronddr]).} is recharged no access is possible. The study in [highperfdram] found that “[s]urprisingly, DRAM refresh organization can affect performance dramatically”.

Each DRAM cell must be refreshed every 64ms according to the JEDEC specification. If a DRAM array has 8,192 rows this means the memory controller has to issue a refresh command on average every 7.8125µs (refresh commands can be queued so in practice the maximum interval between two requests can be higher). It is the memory controller’s responsibility to schedule the refresh commands. The DRAM module keeps track of the address of the last refreshed row and automatically increases the address counter for each new request.

There is really not much the programmer can do about the refresh and the points in time when the commands are issued. But it is important to keep this part to the DRAM life cycle in mind when interpreting measurements. If a critical word has to be retrieved from a row which currently is being refreshed the processor could be stalled for quite a long time. How long each refresh takes depends on the DRAM module.

2.2.4 Memory Types

It is worth spending some time on the current and soon-to-be current memory types in use. We will start with SDR (Single Data Rate) SDRAMs since they are the basis of the DDR (Double Data Rate) SDRAMs. SDRs were pretty simple. The memory cells and the data transfer rate were identical.

In Figure 2.10 the DRAM cell array can output the memory content at the same rate it can be transported over the memory bus. If the DRAM cell array can operate at 100MHz, the data transfer rate of the bus is thus 100Mb/s. The frequency f for all components is the same. Increasing the throughput of the DRAM chip is expensive since the energy consumption rises with the frequency. With a huge number of array cells this is prohibitively expensive. {Power = Dynamic Capacity × Voltage2 × Frequency.} In reality it is even more of a problem since increasing the frequency usually also requires increasing the voltage to maintain stability of the system. DDR SDRAM (called DDR1 retroactively) manages to improve the throughput without increasing any of the involved frequencies.

The difference between SDR and DDR1 is, as can be seen in Figure 2.11 and guessed from the name, that twice the amount of data is transported per cycle. I.e., the DDR1 chip transports data on the rising and falling edge. This is sometimes called a “double-pumped” bus. To make this possible without increasing the frequency of the cell array a buffer has to be introduced. This buffer holds two bits per data line. This in turn requires that, in the cell array in Figure 2.7, the data bus consists of two lines. Implementing this is trivial: one only has the use the same column address for two DRAM cells and access them in parallel. The changes to the cell array to implement this are also minimal.

The SDR DRAMs were known simply by their frequency (e.g., PC100 for 100MHz SDR). To make DDR1 DRAM sound better the marketers had to come up with a new scheme since the frequency did not change. They came with a name which contains the transfer rate in bytes a DDR module (they have 64-bit busses) can sustain:

Hence a DDR module with 100MHz frequency is called PC1600. With 1600 > 100 all marketing requirements are fulfilled; it sounds much better although the improvement is really only a factor of two. { I will take the factor of two but I do not have to like the inflated numbers.}

To get even more out of the memory technology DDR2 includes a bit more innovation. The most obvious change that can be seen in Figure 2.12 is the doubling of the frequency of the bus. Doubling the frequency means doubling the bandwidth. Since this doubling of the frequency is not economical for the cell array it is now required that the I/O buffer gets four bits in each clock cycle which it then can send on the bus. This means the changes to the DDR2 modules consist of making only the I/O buffer component of the DIMM capable of running at higher speeds. This is certainly possible and will not require measurably more energy, it is just one tiny component and not the whole module. The names the marketers came up with for DDR2 are similar to the DDR1 names only in the computation of the value the factor of two is replaced by four (we now have a quad-pumped bus). Table 2.1 shows the names of the modules in use today.

There is one more twist to the naming. The FSB speed used by CPU, motherboard, and DRAM module is specified by using the effective frequency. I.e., it factors in the transmission on both flanks of the clock cycle and thereby inflates the number. So, a 133MHz module with a 266MHz bus has an FSB “frequency” of 533MHz.

The specification for DDR3 (the real one, not the fake GDDR3 used in graphics cards) calls for more changes along the lines of the transition to DDR2. The voltage will be reduced from 1.8V for DDR2 to 1.5V for DDR3. Since the power consumption equation is calculated using the square of the voltage this alone brings a 30% improvement. Add to this a reduction in die size plus other electrical advances and DDR3 can manage, at the same frequency, to get by with half the power consumption. Alternatively, with higher frequencies, the same power envelope can be hit. Or with double the capacity the same heat emission can be achieved.

The cell array of DDR3 modules will run at a quarter of the speed of the external bus which requires an 8 bit I/O buffer, up from 4 bits for DDR2. See Figure 2.13 for the schematics.

Initially DDR3 modules will likely have slightly higherCASlatencies just because the DDR2 technology is more mature. This would cause DDR3 to be useful only at frequencies which are higher than those which can be achieved with DDR2, and, even then, mostly when bandwidth is more important than latency. There is already talk about 1.3V modules which can achieve the sameCASlatency as DDR2. In any case, the possibility of achieving higher speeds because of faster buses will outweigh the increased latency.

One possible problem with DDR3 is that, for 1,600Mb/s transfer rate or higher, the number of modules per channel may be reduced to just one. In earlier versions this requirement held for all frequencies, so one can hope that the requirement will at some point be lifted for all frequencies. Otherwise the capacity of systems will be severely limited.

Table 2.2 shows the names of the expected DDR3 modules. JEDEC agreed so far on the first four types. Given that Intel’s 45nm processors have an FSB speed of 1,600Mb/s, the 1,866Mb/s is needed for the overclocking market. We will likely see more of this towards the end of the DDR3 lifecycle.

All DDR memory has one problem: the increased bus frequency makes it hard to create parallel data busses. A DDR2 module has 240 pins. All connections to data and address pins must be routed so that they have approximately the same length. Even more of a problem is that, if more than one DDR module is to be daisy-chained on the same bus, the signals get more and more distorted for each additional module. The DDR2 specification allow only two modules per bus (aka channel), the DDR3 specification only one module for high frequencies. With 240 pins per channel a single Northbridge cannot reasonably drive more than two channels. The alternative is to have external memory controllers (as in Figure 2.2) but this is expensive.

What this means is that commodity motherboards are restricted to hold at most four DDR2 or DDR3 modules. This restriction severely limits the amount of memory a system can have. Even old 32-bit IA-32 processors can handle 64GB of RAM and memory demand even for home use is growing, so something has to be done.

One answer is to add memory controllers into each processor as explained in Section 2. AMD does it with the Opteron line and Intel will do it with their CSI technology. This will help as long as the reasonable amount of memory a processor is able to use can be connected to a single processor. In some situations this is not the case and this setup will introduce a NUMA architecture and its negative effects. For some situations another solution is needed.

Intel’s answer to this problem for big server machines, at least for the next years, is called Fully Buffered DRAM (FB-DRAM). The FB-DRAM modules use the same components as today’s DDR2 modules which makes them relatively cheap to produce. The difference is in the connection with the memory controller. Instead of a parallel data bus FB-DRAM utilizes a serial bus (Rambus DRAM had this back when, too, and SATA is the successor of PATA, as is PCI Express for PCI/AGP). The serial bus can be driven at a much higher frequency, reverting the negative impact of the serialization and even increasing the bandwidth. The main effects of using a serial bus are

  1. more modules per channel can be used.
  2. more channels per Northbridge/memory controller can be used.
  3. the serial bus is designed to be fully-duplex (two lines).

An FB-DRAM module has only 69 pins, compared with the 240 for DDR2. Daisy chaining FB-DRAM modules is much easier since the electrical effects of the bus can be handled much better. The FB-DRAM specification allows up to 8 DRAM modules per channel.

Compared with the connectivity requirements of a dual-channel Northbridge it is now possible to drive 6 channels of FB-DRAM with fewer pins: 2×240 pins versus 6×69 pins. The routing for each channel is much simpler which could also help reducing the cost of the motherboards.

Fully duplex parallel busses are prohibitively expensive for the traditional DRAM modules, duplicating all those lines is too costly. With serial lines (even if they are differential, as FB-DRAM requires) this is not the case and so the serial bus is designed to be fully duplexed, which means, in some situations, that the bandwidth is theoretically doubled alone by this. But it is not the only place where parallelism is used for bandwidth increase. Since an FB-DRAM controller can run up to six channels at the same time the bandwidth can be increased even for systems with smaller amounts of RAM by using FB-DRAM. Where a DDR2 system with four modules has two channels, the same capacity can handled via four channels using an ordinary FB-DRAM controller. The actual bandwidth of the serial bus depends on the type of DDR2 (or DDR3) chips used on the FB-DRAM module.

We can summarize the advantages like this:

Pins 240 69 Channels 2 6 DIMMs/Channel 2 8 Max Memory 16GB 192GB Throughput ~10GB/s ~40GB/s

There are a few drawbacks to FB-DRAMs if multiple DIMMs on one channel are used. The signal is delayed—albeit minimally—at each DIMM in the chain, which means the latency increases. But for the same amount of memory with the same frequency FB-DRAM can always be faster than DDR2 and DDR3 since only one DIMM per channel is needed; for large memory systems DDR simply has no answer using commodity components.

2.2.5 Conclusions

This section should have shown that accessing DRAM is not an arbitrarily fast process. At least not fast compared with the speed the processor is running and with which it can access registers and cache. It is important to keep in mind the differences between CPU and memory frequencies. An Intel Core 2 processor running at 2.933GHz and a 1.066GHz FSB have a clock ratio of 11:1 (note: the 1.066GHz bus is quad-pumped). Each stall of one cycle on the memory bus means a stall of 11 cycles for the processor. For most machines the actual DRAMs used are slower, thusly increasing the delay. Keep these numbers in mind when we are talking about stalls in the upcoming sections.

The timing charts for the read command have shown that DRAM modules are capable of high sustained data rates. Entire DRAM rows could be transported without a single stall. The data bus could be kept occupied 100%. For DDR modules this means two 64-bit words transferred each cycle. With DDR2-800 modules and two channels this means a rate of 12.8GB/s.

But, unless designed this way, DRAM access is not always sequential. Non-continuous memory regions are used which means precharging and newRASsignals are needed. This is when things slow down and when the DRAM modules need help. The sooner the precharging can happen and theRASsignal sent the smaller the penalty when the row is actually used.

Hardware and software prefetching (see Section 6.3) can be used to create more overlap in the timing and reduce the stall. Prefetching also helps shift memory operations in time so that there is less contention at later times, right before the data is actually needed. This is a frequent problem when the data produced in one round has to be stored and the data required for the next round has to be read. By shifting the read in time, the write and read operations do not have to be issued at basically the same time.


2.2 DRAM 訪問技術細節
在介紹DRAM的章節我們看到了DRAM芯片通過複用了地址線節省了資源。我們也看到了訪問DRAM單元花費了時間因爲在單元中的電容不能瞬間放電圖去產生一個穩定的電流信號。我們也看到了DRAM單元需要被刷新(64ms/16ms)。現在我們需要綜合考慮這些因素並且看一下這些因素是怎麼決定DRAM的訪問過程的。

我們將關注於當前的技術;我們將不再討論異步DRAM(asynchronous DRAM)和他的變種,僅僅只是和接下來的內容不相關。讀者如果對這感興趣,可以參考[highperfdram]和[arstechtwo].我們將不再討論Rambus DRAM,即使這個技術沒有過時。它沒有被系統內存廣泛的使用。我們將專注於同步DRAM(Synchronous DRAM SDRAM)並且 它的後繼者 雙速率 DDR(Double Data DRAM)

同步DRAM,正如其名,和一個同步時鐘相關連。內存控制器提供了一個時鐘,這個時鐘的頻率決定了FSB(前端總線)的速率。DRAM芯片使用了內存控制器的接口-FSB。在寫作的這時,FSB的頻率已經有800MHz,1066MHz,甚至1333MHz,下一代更高的頻率1600MHz也正在被宣佈。這並不意味總線頻率在使用時有這麼高。事實上,今天總線是雙倍或者4倍傳輸的,意味着數據在週期內被兩倍或者4倍的傳輸。爲了更高的銷售,生產商們喜歡宣傳4倍的200MHz總線頻率爲效果爲800MHz的總線。

目前SDRAM每次數據傳輸包含64bits-8bytes。因此FSB的速率計算方法是8B*有效的總線速率(200MHz的4倍的總線速率 爲6.4GB/s,8*4*200*10的6次)。這聽起來很高,但是這只是峯值速率,這個速度是無法達到的。正如我們所知道的,在訪問RAM模塊的時候,有一段空檔期是無法傳輸數據的。我們必須非常準確的瞭解這個空檔期才能最小化它來獲取最好的性能。

圖2.8用三種顏色展示了不同階段中與DRAM模塊相關部件的活動。照例,時間發生從左到右。許多細節被忽略。這兒我們只討論總線時鐘,RAS和CAS,數據總線和地址總線。讀週期開始於內存控制器傳輸行地址行到地址總線並且降低了RAS信號。所有的信號在CLK的上升沿被讀取(/RAS引腳被賦予低電平而被激活,行地址被送到行地址選通器,行地址解碼器根據接收到的數據選擇相應的行),所以不關心這是不是矩形的方波,只有讀的時候信號穩定即可。設置行地址將會導致RAM芯片圖去鎖地址行。

CAS信號在tRCD個時鐘週期後發出。傳輸列地址到地址總線並且降低了CAS信號(/CAS引腳被賦予低電平而被激活,列地址被送到列地址選通器Column Address Latch,列地址解碼器C
olumn Address Decode根據接收到的數據選擇相應的列)。這兒我們可以看到兩部分的地址是怎麼傳輸通過相同的地址總線。

現在尋址完成並且這個數據可以被傳輸。RAM芯片需要一些時間去準備。這個延遲被成爲CL(CAS 時延因素)。在圖2.8中,CL是2個週期。它可以或大或小,取決於內存控制器,主板,DRAM模塊的質量。這個因素可以有半個週期。如果CL=2.5,那麼第一個數據在藍色區域的第一個下降沿是可用的。

所有以上的準備去獲得一個字將是浪費的。這就是爲什麼DRAM模塊允許內存控制器去制定多少數據可以被傳輸。經常這個選擇是2,4,8字。這可以允許在沒有新RAS/CAS序列的前提下填存緩存的整個線。內存控制器在沒有重置行選擇的前提下發送一個CAS信號也是可行的。這樣的話,因爲RAS信號不需要重新發送而且行不需要失活,連續的內存地址可以被更快的讀或寫。內存控制器能夠決定行是否一直open,如果一直打開的話會對應用有不利的影響。發送新的CAS信號只與RAM模塊的命令速率有關(通常指定爲Tx,像T1,T2,高性能DRAM模塊中爲1表示在每個週期可以接受一個新的命令)。

這個例子中SRAM 每個週期可以輸出一個字。這就是第一代。DDR有能力在週期內傳輸兩個字節。這降低了傳輸速率但是無法改變時延。原則上看,DDR2實際上是相同的工作原理,僅僅是看上去不同罷了。知道DDR2可以更快,更便宜,更可靠,更節能便足夠了。

2.2.2 預存電和激活

圖2.8沒有展示全過程。他只是展示了訪問DRAM的全過程。在一個新的RAS信號被髮送到行選通器之前,行必須失活並且行必須預存電,我們關注的是在顯示命令發送之後的情況。這個協議做了一些改進,在一些情況下,允許這個額外的步驟省略。這個預存電帶來的延遲可能會影響操作。

圖2.9 展示了一個CAS信號到另一個CAS信號的全過程。在CL週期之後,第一個CAS信號所請求的數據是可用的。在一個簡單的SDRAM中,消耗兩個週期去傳輸兩個被請求的字節。在DDR芯片中,可以傳輸4個字節。

即使在一個命令速率爲1的DRAM上,預充電命令也不能立刻發出。這是重要的去等數據傳輸完成。在這個例子中,傳輸數據消耗了兩個時鐘週期。正好巧合和CL一致。預充電信號沒有專線。相反的,一些實現通過同時降低WE線和RAS線。這種組合本身沒有什麼意義。

一旦預充電命令被髮出,它將花費tRP(Row Prechange time)個週期直到行可以被選擇。在圖2.9中,大部分的時間(紫色部分)是重疊的傳輸時間(淡藍色)。tRP時間大於傳輸時間,所以我們需要多等一個週期來發送RAS信號。(在數據傳輸完之後,我們必須使得RAS和CAS都失活一個週期)

如果我們將圖中的時間線補齊,我們會發現下一次數據傳輸將會發生在這次數據傳輸完的5個週期之後。這意味着有效數據傳輸只佔到了2/7。每次數據傳輸*前端總線速率在理論上應該爲6.4GB/s,但是實際上只有1.8GB/s。這是十分糟糕的,必須去避免。第6節提到的技術將會提高這個數值。但是程序員也必須盡力努力。

有些SDRAM模塊的時間參數我們沒有談及到。在圖2.9中這個預充電命令只被數據傳輸時間限制。另外一個限制因素是SDRAM在發出RAS信號到下一次進行行的預充電之前是有一段時間間隔的(tRAS)。這個數值通常是相當高的,通常是兩到三倍的tRP時間值。假如在RAS信號之後,只有一個CAS信號跟隨並且這個數據可以在很短的週期內就可以傳輸結束。在圖2.9中,假設第一個CAS信號是緊跟在RAS信號兩個週期(RCD還是2),並且tRAS信號是8個週期,那麼這個預充電命令就必須推遲一個時間,因爲tRAS=8,(RCD+CL+tRP =7)

DDR模塊經常使用一個特殊的符號來描述:w-x-y-z-T.比如:2-3-2-8-T1.意思如下:
w 2 CAS延遲(CL)
x 3 RAS到CAS延遲(tRCD)
y 2 RAS預充電 tRP
z 8 RAS激活到預充電的時間(tRAS)
T T1 命令速率

這兒還有一些其他的時間參數來影響命令的發送和處理.但是這5個參數足以決定模塊的性能.
知道計算機的這些信息在說明一些參數的時候是有用的.在買計算機時知道這些也肯定有幫助的.這些信息和FSB,SDRAM模塊速度都是最重要的因素來決定計算機的速度.

有冒險精神的讀者可以嘗試去微調系統.一些時候BIOS允許改變其中一些或者全部的值.SDRAM模塊有可編程的寄存器,這些值可以被修改.通常BIOS都會選擇最優的值.如果RAM的質量足夠的好.在不影響系統計算機系統穩定性的前提下可能去降低一個或其他的時延參數.在網上有大量超頻的網站提供了足夠的文檔.做這個是有風險的,我已經警告過了.

2.2.3 預充電
在提到DRAM訪問時重充電總是被忽略的話題.正如在2.1.2節中提到的,DRAM單元總是不斷的被刷新.This does not happen completely transparently for the rest of the system.(不理解什麼意思,我是sholck222,我是一個英語學渣).在一個行預充電的時候,是不能訪問的.在[highperfdram]研究中發現”驚人的,DRAM 刷新機制可以顯著的影響性能.
依據JEDEC說明書,每一個DRAM單元必須在每64ms(目前我好像見過16ms的)內刷新一次.如果一個DRAM陣列有8192行,則意味這內存控制器平均0.0078125ms(64ms/8192)需要發出一個刷新命令(在實際中刷新命令也可以納入隊列,所以在兩次請求之間的間隔可以更高 Why?).內存控制器需要去調度刷新命令.DRAM模塊記錄着最後一次刷新的行並且在新請求之後自動的增加這個計數.

程序員無法對刷新和命令何時發出做出更改,但是在解讀參數時應該記住這對DRAM生命週期是重要的.如果在一個行正在刷新時,而在這行請求某個關鍵字,處理器將會延遲相當長的一段時間.DRAM模塊刷新時常取決於本身.

2.2.4內存類型

這是非常值得花費一些時間去研究當前和將要使用的內存類型。我們開始研究SDRAM(同步DRAM)因爲它是DDR(雙倍速率)的基礎。SDRAM是簡單的。這些內存單元和數據傳輸是一直的。

在圖2-10 DRAM單元陣列輸出內存內容的速率和在內存總線上傳輸數據的速率是一致的。如果DRAM單元陣列速率可以爲100MHz,在總線的數據傳輸速率也是100MHz.全部組件的速率是一致的。增加DRAM芯片的吞吐量是代價昂貴的因爲能源消耗會隨着頻率一起增高。一個巨大的內存單元陣列是非常非常昂貴的。{功率= 電容*電壓的平方*頻率}。實際中在增加頻率的時候會帶來問題,通常需要提高電壓來維持系統的穩定性。DDR SDRAM (DDR1)可以在不提高參與頻率的前提下提高吞吐量。

正如圖2.11看到的,和SDR和DDR1的名字上我們也能看出它們的差異,DDR1在每個週期傳輸兩倍的數據。DDR1芯片傳輸數據在上升沿和下降沿。一些時候我們稱之爲’雙泵’總線。爲了使在不提高單元陣列頻率的前提下,我們需要引入緩衝。在這個緩存中,每條數據線擁有兩位。這反過來要求,在單元陣列中數據總線包含兩條線。實現這個是簡單的。我們使用相同的列地址來平行的訪問兩個DRAM。單元陣列實現這個只需作很小的改變。

SDR DRAMS被熟知只是因爲它們的頻率。因爲這個頻率沒有改變,爲了讓DDR1 DRAM 聽起來更好,營銷人員提出了一種新的命名策略。這種新的命令包含了DDR模塊(64位的總線)的傳輸速率。

100MHZ * 64bit * 2 = 1600MB/s (*2 是因爲內部的雙泵 總線)

從此我們把有着100MHZ頻率的DDR被叫作PC1600。因爲1600>100,所以銷售需求被滿足了。儘管只是提高了兩倍,但是這是聽起來非常好的。(我承認提高了兩倍,但是我不喜歡這種數字遊戲)

DDR2包含了一點兒創新從而是得內存技術更進一步。最明顯的變化是翻倍了總線的頻率。(圖2-12)翻倍總線的頻率意味這翻倍帶寬。因爲翻倍單元陣列的頻率不是經濟的,因此這要求I/O 緩存區在每個時鐘週期內可以獲得4位,這4位之後發送到總線。這意味DDR2模塊的改變只是讓DIMM封裝的芯片I/O緩存區可以運行的更快。這種方案是可行的,同時不會使得供電量提高,這只是一個小的組件而不是整個模塊。銷售商命名DDR2的方式是和DDR1的方式似的,只是緩存內部由4位雙泵總線代替了2位雙泵總線。表2-1展示了今日這些使用模塊的名稱。

CPU,主板,DRAM模塊使用有效的頻率來表示FSB速度,也就是把在每個時鐘週期上升和下降沿傳輸數據的因素考慮進去,這使得FSB被撐大。比如:一個有着266MHZ總線的133MHZ模塊卻有一個533的FSB頻率。

DDR3(這個不是在顯卡中使用的假冒GDDR3)相對DDR2改變了很多。電壓從DDR2的1.8V降低到1.5V.因爲電能的消耗是和電壓的平方成正比而且這大概節省了30%。加上芯片尺寸的縮小和電氣技術的進步,DDR3可以實現在相同的頻率下,但是隻消耗一半的電能。或者在相同的電能消耗下,或得更高的頻率。或者在相同熱量排放的前提下使得電容翻倍。

DDR3單元陣列的運行速度是內部總線速度的1/4,使得I/O緩存區的內存從DDR2的4位提升到8位,可以看圖2.13中的電路圖。

起初DDR3模塊相比DDR2可能會有輕微較高的CAS延遲,這是因爲DDR2技術是更加的成熟。這造成了DDR3在DDR2無法達到的高頻或者帶寬遠比延遲更重要的前提下更有用,之前討論過的1.3V的DDR3模塊可以實現DDR2相同的CAS延遲。在任何情況下,提高速度總是比增加的延遲更加重要。

DDR3可能有一個問題,對於1600Mb/s傳輸速率或者更高速率的DDR3模塊,每個通道的模塊數量可能會降低到1.在早期的版本,所有不同頻率的模塊都有這個限制,所以我們希望對於這些芯片,這個限制能夠有所改善,否則這將嚴重的影響系統的限制。

表2-2展示了預期的DDR3模塊的名字。JEDEC目前同意了前面4種。Inter 45nm的處理器有一個1600Mb/s的FSB速度。一個FBS 1866Mb/s的處理器可以用在超頻的市場中。在DDR3的發展中,我們也將看到更多的模塊類型。

所有的DDR 內存都有一個問題:提高總線頻率使得搭建平行的數據總線變得困難。一個DDR2模塊有240個引腳。所有連接到數據和地址總線的引腳必須被安排成幾乎一樣的長度。一個更大的問題是,如果超過一個的DDR模塊通過菊花鏈的方式連接到相同的總線,這個信號將會越來越變形。DDR2指定只允許一個總線和兩個模塊相連,DDR3對於高頻指定只允許一個模塊。每個總線帶着240個引腳使得北橋無法合理的驅動超過兩個總線。替代解決辦法是將內存控制器移到內存外(正如圖2.2所示),但是這種方法是昂貴的。

這意味這現代主板被限制只能最多有4個DDR2或者DDR3模塊。這嚴重限制了一個系統所能擁有的內存數量。即使一個久的32位IA-32的處理器也可以處理64GB的RAM並且家庭主機也對內存需求日漸增長,所以我們需要做一些改變了。

Inter針對大型服務器方面,在未來幾年,將使用Full Buffered DRAM來處理此問題。這個FB-DRAM使用和DDR2模塊相同的組件來使得生產它們是容易的。區別在於與內存控制器的連接方式。FB-DRAM沒有使用並行總線,反而使用了串行總線。串行總線可以在一個更高的頻率下工作,同時也reverting的串行總線的消極影響,甚至增加帶寬。使用一個串行總線的影響如下:
1.每個通道上可以使用更多的模塊。
2.在北橋和內存控制器之間可以使用更多的通道。
3.串行總線可以被設計成全雙工的。

一個FB-DRAM模塊相比DDR2的240引腳,只有69引腳。因爲總線的電氣影響可以被很好的處理,所以通過菊花鏈相連多個FB-DRAM模塊也是更容易的。FB-DRAM指定允許每個通道上有8個DRAM。

與雙通道北橋的連接條件對比,FB-DRAM用很少的引腳可以驅動6個模塊,2*240(雙通道北橋)vs 6*69(FB-DRAM)。每一個通道的排線也更加的簡單,這可以降低主板的成本。

對於傳統的DRAM模塊,全雙工並行總線是極其昂貴的,(duplicating all those lines is too costly /what this means?)串行總線使其不是問題(即使和全雙工並行總線有着細微的差距),並且串行總線設計成全雙工的,這意味着,在一些情況下,僅靠這一點,總線的帶寬在理論下就可以翻倍。因爲FB-DRAM控制器可以同時和6個控制器相連,所以可以用其來增加一些小內存系統的帶寬。一個有着雙通道,4個內存模塊的DDR2的系統,可以被一個普通的,有着4通道的FB-DRAM代替。這個串行總線實際的帶寬具體是由FB-DRAM模塊所使用的DDR2(或者DDR3)來決定的。

我們對比之後可以總結的優點如下:

當多個DIMMs 在一個通道上被使用時,FB-DRAM將會有一些缺點。信號在鏈路上得每個DIMM上都有很小的延遲,但是這樣會造成疊加。相同容量的內存和相同頻率的前提下FB-DRAM總是比DDR2和DDR3快的。因爲每個通道上只需有一個DIMM模塊。對於大型的內存系統,DDR更是沒有商用組件的解決方法。

2.2.5總結
這一節展示了訪問DRAM並不是一個快速的過程。至少與處理器和訪問寄存器想比並不是快的。CPU和內存的頻率不同是非常重要的。Inter Core 2處理器頻率是2.933GHz,並且FSB頻率是1.066GHz,它們的時鐘比是11:1(1.066GHz總線是一個四泵結構)。在內存總線週期上的每一個延遲都意味着在處理器上延遲11個週期。對於大多數的機器目前的DRAM是慢的因此導致在處理器上延遲更高。 在之後的章節中我們討論延遲的時候還會關聯到時鐘比。

之前讀命令的時序圖已經展示了DRAM模塊能實現高速的數據傳輸。DRAM一整行可以沒有延遲的被傳輸。數據總線可以被100%的佔用。對於DDR模塊這意味着在每個週期內可以傳輸2*64bits,對於DDR2-800和雙通道這意味這12.GB/s的傳輸速率。

但是,DRAM的訪問不都是串行的。在訪問不連續的內存區域時就意味着需要預充電和新的RAS信號,所以這使得速度慢下來,DRAM就需要一些改進。預充電的時間越短,RAS信號發送激活行帶來的負面影響就越小。

硬件和軟件的預充電會創造更多的時序重疊區並且降低了延遲。預取可以提前內存操作的時間,有利於在數據被請求時減少競爭。如果沒有預充電,這一輪產生的數據必須存儲同時下一輪被請求的數據必須讀出是經常發生的問題。但是通過提前讀的時間,這個讀和寫操作就基本不會在同一時間被髮出。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章