Introduction
Due to its weakness in latency, the specific performance balance of Rambus DRAM does not seem to favor current generation uniprocessor desktop PC platforms running traditional applications. But, we should not assume that all CPUs, all platforms and all applications are the same or will be the same in the future. Beyond the mainstream PC platform, Rambus could provide a performance advantage for multiprocessor servers, IA64 systems, for graphics memory applications, for game consoles, etc. Also, depending on new CPU architectures and the evolution of application software, Rambus could become more interesting in the mainstream as well.
Under what circumstances could bandwidth be more important than latency?
There are many factors that will allow future systems to make more effective use of high bandwidth DRAM and help to compensate for latency problems. These include higher CPU bus speeds, deeper pipelining, speculative execution, explicit parallelism, etc. As next generation platform and CPU architectures penetrate the mainstream, increased bandwidth will be utilized much more effectively.
Will Rambus be popular for PC graphics accelerators?
There is no evidence yet. Earlier versions of Rambus DRAM have been used with the medium performance Cirrus Logic Laguna chip, and a commercially unsuccessful product from Chromatic. New Rambus designs will probably surface, but so far, the overwhelming trend is toward larger, wider, faster SDRAM configurations combined with DRAM integration strategies.
Is Rambus suitable for server applications?
It could be. In the future, Intel will offer Rambus server platforms and Compaq has announced plans for an Alpha server that uses Rambus (inherited from DEC). Rambus pipeline performance is a strong advantage for servers, but Rambus also introduces barriers pertaining to cost and maximum memory capacity. Current server platforms use very large configurations of inexpensive SDRAM. It will be challenging to support very large DRAM configurations with Rambus in the near term, plus many expect Rambus DRAM to be costly. Until these and other issues are resolved, there will be a near term preference for SDRAM on most server platforms.
Rambus and Future Plans
Rambus increases peak burst bandwidth to a level that is hard to take advantage of in 1998 and 1999 desktop PCs. At 100MHz and 133MHz SDRAM can satisfy the CPU’s bandwidth requirement. But in the year 2000, Intel may drive the CPU bus speed to 200MHz, skipping 150 or 166Mhz. DDR SDRAM can match the CPU’s bandwidth at this speed, but ordinary SDRAM probably could not. Obviously, such a CPU bus speed strategy would favor Rambus.
Beyond just CPU bus speed, there are other factors that will favor high bandwidth on other platforms. One example is multiprocessor servers. Pentium II processors are able to address up to four CPUs per bus. In order to implement an 8 CPU multiprocessor server, the chip set must support two CPU buses. Even at 100MHz, these two buses can combine to demand the equivalent of a single 200MHz bus (1.6GB/s). Rambus seems well suited to this requirement, but there are barriers relating to power dissipation and maximum system DRAM capacity (described in more detail below).
Also, in the evolution of Intel’s product line, microprocessor architectures are being optimized to better tolerate increasing latencies. In the P6 generation, speculative execution helps to offset latency problems, and explicit data prefetch instructions will be introduced in Katmai that will pre-load the cache – effectively increasing its hit rate. Jumping ahead to Merced, explicit parallelism can reduce or eliminate the occurrence of mispredicted branch instructions in the CPU core. This also has the effect of isolating the CPU from DRAM latency, but it will require a shift to new IA64 VLIW style applications code. This transition will not happen rapidly in the mainstream desktop market.
On the other hand, some types of mainstream X86 applications are not very susceptible to DRAM latency. One good example is highly redundant floating point code using data sets that are small enough to fit in the L1 or L2 cache. Some games and 3D benchmarks fit into this category. However, as games further evolve they will use more sophisticated game logic code, environmental modeling, larger floating point data sets that exceed the cache size, and more challenging multimedia functions that run concurrently. These factors will combine to demand better latencies as well as higher bandwidth.
Is Pipelining Important
Rambus seems to look very good under deeply pipelined conditions. Deep pipelining can allow Rambus to achieve and sustain an extremely high rate of bus utilization. For instance, if the transactions are deeply pipelined, Rambus can sustain transfer rates of up to 1.5GB/s (or around 95% of it theoretical maximum peak burst rate of 1.6GB/s). This is truly an amazing feat for a DRAM and for the system as a whole. SDRAM cannot achieve this level of saturation, nor can a single well cached CPU under most circumstances.
The diagram below shows how pipelining works on the P6 bus. Direct Rambus operates in a similar manner. By allowing different groups of bus signals to operate independently, new transactions are able to begin while previous transactions are still in progress. Using the P6 bus in multiprocessor mode, the diagram below shows how transactions from processors A, B & C can be pipelined on the CPU bus in a very tight sequence.
When transactions are pipelined this tightly, the data bus will burst out data continuously on every clock edge. If you look only at the bottom line (Data), after the latency period for transaction A has passed, all other latencies are hidden. But this does not mean “zero latency”. In fact all requests still experience a rather long latency of at least 9 CPU bus clocks from the time a request appears on the bus until it is resolved.
In order to achieve this elusive state of 100% bus utilization, there will usually be a deep queue of transactions buffered inside of the CPU (or CPUs). These internally buffered transactions will have an even higher effective latency due to the additional waiting period inside the CPU. Thus, high bus saturation (via deep pipelining) is an indication of wasted MIPS. This may be perfectly acceptable in servers that have plenty of MIPS to waste, but uniprocessor systems are different.
Uniprocessor vs. Multiprocessor Requirements
Uniprocessor standard architecture systems with well cached CPUs do not usually saturate the bus and do not use pipeline mode as frequently. For this reason, the pipelining capability of the P6 bus and of Rambus do not have a very significant performance impact.
In a P2/450 uniprocessor PC, DRAM bus saturation typically hovers between 1 and 10%. By adding more processors, bus saturation can easily be driven beyond the bandwidth capacity of 100MHz SDRAM. To solve this problem, multiprocessor platforms may use multiple 64-bit buses, or increase the bus width to 128-bits or 256-bits. These configurations also allow servers to satisfy another important requirement – MEMORY CAPACITY.
Large servers must usually ship with half a gigabyte of DRAM and be configurable up to several gigabytes. This can be accomplished today using wide SDRAM configurations with registered DIMMs. But Rambus presents another problem in this area. Each Rambus interface channel allows only 32 memory ICs. Using 64Mbit DRAMs, the maximum configuration for a single channel is 256Mbytes. Even in late 1999 or early 2000 when 256Mbit components become available, capacity will still be constrained to 1GB per channel. In order to meet platform requirements, multiple Rambus channels will be required, or repeater chips must be used to extend the capacity (and increase the latency) of each channel.
There is still a cost barrier that must be addressed in server platforms. With such large configurations, DRAM cost makes up a very large percentage of the non-CPU cost of a system. This significantly increases the pain associated with DRAM cost premiums. Based on current die size estimates, it is reasonable to expect a Rambus price premium of 25-35% over SDRAM. This figure will probably shrink over time, particularly after the transition to 256Mbit technology.
Where Does Rambus Fit in the Near Term?
It is easiest to evaluate this question is by first examining where Rambus does not seem to be preferred. It does not fit in the low end due to cost. It does not appropriate for notebooks because of power requirements. It does not fit into the midrange because of compromised price/performance. It may be a good solution for server platforms after the year 2000. The near term trends in graphics memory do not lean toward Rambus.
The last PC market segment opportunity for Rambus is the high-end. These uniprocessor systems, using the newest CPUs and the largest caches, will have comparatively low external bandwidth requirements. Even the cached Celeron 333 could demand more external bandwidth than Katmai at 500MHz. Regardless, this market segment is Intel’s key leverage point for the deployment of Rambus. The most important leverage Intel has will be the Camino chip set.
In the high-end market, Direct Rambus will have to survive mostly on hype until the CPU bus clock exceeds 133MHz. Outside of the mainstream, there will be some demand from specialty markets such as 8 or 16 way multiprocessor servers and possibly some high end graphics applications. After the year 2000, Merced workstations will also probably fall into the Rambus camp, but volumes will not reach mainstream proportions until possibly 2004 or so. Regardless of these barriers, it appears that Intel will do all that it can to ensure that Rambus penetrates the market as widely and quickly as it can. This will be interesting to observe.