Introduction
There is no denying it; the last twelve months have been anything but enjoyable for Intel, the mighty chipmaker based in Santa Clara, California. While the chip giant is still making billions and has thus certainly no serious reasons to complain about small profits or even losses, its reputation had to suffer rather badly. What had started with a buggy and therefore delayed ‘Camino’ (i820) chipset for the brand new ‘Coppermine‘ processor in Fall 1999, continued with the ‘MTH-recall‘, the Timna-drop, the bad press about Intel’s peculiar relationship to Rambus Inc. and finally with the recall of the Pentium III 1.13 GHz processor due to instability issues.
At the same time Intel’s archenemy AMD had become stronger and more successful than ever. In terms of processor performance the Sunnyvale-based chipmaker and its Athlon-processor did not only manage to catch up with the once so untouchable Intel, it slowly overtook and forced Intel in the sad role of a pursuer that just isn’t able to keep up with its opponent. Lately, AMD managed to get so far ahead of Intel in terms of processor performance, that even the most fanatic Intel-followers ran out of reasonable arguments against AMD’s products.
Finally, Intel is more determined than ever to take back what once was thought to be in its possession until eternity. The brand new Pentium 4 processor shall make Intel the provider of the fastest, most advanced and thus simply best microprocessor in the world. Intel wants to get back to the top and leave AMD far behind. The mysterious Pentium 4, codename ‘Willamette’, is supposed to be powerful enough to ensure that Intel reaches this juicy goal. Let’s see if this new processor is indeed good enough.
The Architecture of Pentium 4
Intel’s declining image and the success of its opponent AMD forced the chip giant to act different than ever before. While it used to be very difficult to find out any details of upcoming Intel-processors ahead of their release in the past 10 years, Intel was giving major amounts of architectural information about Pentium 4 to whomever was asking for it. Thus I am sure that most of you have already heard loads about Pentium 4’s funky ‘NetBurst-Architecture’, the ‘Rapid Execution Engine’, the ‘Hyper Pipeline, ‘SSE2’ and even the glorious ‘Execution Trace Cache’. However, following a long tradition, I will still dedicate a major amount of time of this article to a detailed explanation of what is really behind Pentium 4 and all its fancy new features. It’s the best way to understand the following benchmark results and it should help you making a decision if Pentium 4 is indeed a product for you.
An Overview
I would like to start with the block diagrams of Pentium 4, Pentium III and AMD’s latest Athlon processor. I spent considerable time with PowerPoint to create those diagrams, so please don’t just disregard them. Even if they might look scary at this stage, I promise to explain them to you in the following text.
This is my personal P4-diagram, which became necessary because Intel wasn’t able to supply one that was good enough. It follows the traditional top-to-bottom flowchart idea and should include all the important units that influence Pentium 4’s performance. Here’s a little glossary:
- BTB = ‘Branch Target Buffer’. In this table you’ll find all the addresses to where a branch will or could be made. Athlon is also using a ‘BHT’ = ‘Branch History Table’, which stores the addresses where branches were made to. A software program is using branches to make decisions. The program asks a question and according to the answer a branch is made or not.
- µOP = ‘Micro-Operation/Operand’. This is the name that Intel gives instructions, which can be directly understood by the execution units of the microprocessor. AMD calls them ‘MacroOPs’, because they are a bit advanced and can contain more information than Intel’s µOPs. Both ‘OPs’ have one important thing in common. They represent very simple instructions that can be quickly carried out by the processor. Unlike x86-instructions, those ‘OPs’ are of a defined size and can thus easily be fed into the execution pipeline. The decoder translates an x86-instruction into one or many more ‘OPs’, unless the x86-instruction was so complex (and rare) that the ‘Micro Instruction Sequencer’ has to produce a sometimes rather longish sequence of ‘OPs’, using the ‘Micro Code ROM’ found in any modern super scalar microprocessor. In average, most x86-instructions get decoded to about two ‘OPs’. Some extremely simple instructions like e.g. an ‘AND’, ‘OR’, ‘XOR’ or ‘ADD’ are often producing only one ‘OP’, while a ‘DIV’ or ‘MUL’, or an indirect addressed operand will produce more. Complex instructions like e.g. trigonometric commands can easily produce up to hundreds of ‘OPs’, coming out of the ‘Micro Instruction Sequencer’.
- ALU – Arithmetic Logic Unit. This is the name of what we call the ‘Integer’-unit. Arithmetic operations like adding, multiplying and dividing as well as logic operations such as ‘OR’, ‘AND’, ‘ASL’, ‘ROL’, … are carried out by the ‘ALUs’. Those operations represent the vast majority of program code in most software programs.
- AGU – Address Generation Unit. This unit is just as important as the ‘ALU’, because it is responsible for the data from or to the correct address to either be loaded or stored. Absolute addressing in programs is only used in rare exceptions. As soon as you’ve got arrays of data the program code is using indirect addressing, keeping the ‘AGUs’ busy.
An Overview, Continued
This is the diagram of the good old Pentium III. You can see that it is a lot less complex than the Pentium 4 diagram. Still, you might spot the few advantages that P3 has over P4. Don’t worry if you don’t. I will point them out further down in the text.
Finally here is Athlon. I omitted the micro code ROM and the micro instruction sequencer because I didn’t find the time to paint them in there. Please be aware of the fact that Athlon is depending on this unit as well though.
What’s Behind NetBurst?
Intel calls the new architecture of Pentium 4 ‘NetBurst’. The idea behind this name is for me just as unfathomable as the ‘Internet SIMD Streaming Extension’, as Intel liked to call Pentium 3’s ‘SSE’ or ‘ISSE’. Believe me, your web pages won’t pop up any faster, downloads will take just as long and the Internet won’t ‘burst’ either. However, Intel is trying its hardest to be trendy and since the Internet is still hip, it is a perfect vehicle to market Pentium 4. The name ‘NetBurst’ could also be a hint towards Pentium 4’s performance characteristics. Professional PC-users might not care quite as much for an ‘Internet-accelerating’ processor, but more for a product that makes them get their work done as fast as possible. Looking at the benchmark results further down in this article shows that Pentium 4 shines a lot more at recreational software than in professional applications.
Another big issue with Pentium 4’s ‘NetBurst-Architecture’ is its obvious focus to deliver highest clock rates. Again ‘NetBurst’ shows its roots in Intel’s marketing department. The avid Tom’s Hardware Guide reader will be aware of it, but the average computer user still hasn’t grasped the fact that clock rate does NOT automatically translate in performance when looking at different processor designs. This is another issue targeted by Pentium 4 and ‘NetBurst’. Intel wants clock rate at almost any cost and the Pentium 4 design is perfect to deliver exactly that. Average Joe is supposed to read those high Giga-Hertz numbers and conclude that this alone is already good enough to make Pentium 4 the fastest processor in the universe. Marketing – that’s what the name ‘NetBurst’ seems to be all about.
NetBurst includes the following goodies that have been implemented into the new Pentium 4 desgin:
- Faster System Bus
- Advanced Transfer Cache (already known from Pentium III)
- Advanced Dynamic Execution (Execution Trace Cache, Enhanced Branch Prediction)
- Hyper Pipelined Technology
- Rapid Execution Engine
- Enhanced Floating Point and Multi-Media (SSE2)
Let’s look at each of those features in more detail, acting like code/data that is being processed by Pentium 4.
The New Processor Bus
The first new feature seen by code or data as it enters Pentium 4 is the new system bus. The well-known ‘FSB’ of Pentium 3 is clocked at 133 MHz and able to transfer 64-bit of data per clock, offering a data bandwidth of 8 byte * 133 million/s = 1,066 MB/s. Pentium 4’s system bus is only clocked at 100 MHz and also 64-bit wide, but it is ‘quad-pumped’, using the same principle as AGP4x. Thus it can transfer 8 byte * 100 million/s * 4 = 3,200 MB/s. This is obviously a tremendous improvement that even leaves AMD’s recently ‘upgraded‘ EV6-bus quite far behind. The bus of the most recent Athlon’s is clocked at 133 MHz, 64-bit wide and ‘double-pumped’, offering 8 byte * 133 million/s * 2 = 2,133 MB/s.
The new bus of Pentium 4 enables it to exchange data with the rest of the system faster than any other x86-processor, thus removing one important bottleneck that Pentium 3 was suffering from. However, the fastest processor bus doesn’t help much unless the system’s main memory can deliver data at an according pace. Intel’s new 850 chipset for Pentium 4, which currently represents the only chipset for this new CPU, is using two Rambus channels and therefore the expensive and unpopular RDRAM. However, these two RDRAM channels are able to deliver the same data bandwidth as Pentium 4’s new bus (3,200 MB/s), making them a perfect match at least on paper. This constellation enables Pentium 4-systems to have the highest data transfer rates between processor, system and main memory, which is a clear benefit. At the same time system cost is impacted by the high price of RDRAM plus the fact that a Pentium 4-system always requires two or even four RDRAM-RIMMs of the same size and spec. One, three or mixed RIMMs are not an option.
Advanced Transfer Cache
The next thing that (most of) the data has to pass is Pentium 4’s on-die L2-cache. Intel calls it ‘Advanced Transfer Cache’ since the days of Pentium III ‘Coppermine’. With 256 KB its size is identical to the L2-cache of Pentium III and both are 8-way associative as well. This is however where the similarities end. Pentium 4’s L2-cache is using 128-byte cache lines, which are divided in two 64-byte pieces. When it fetches data from the system (main memory, AGP, PCI, …) it reads at least 64 bytes in one go, which ensures great performace for burst transfers, especially when talking to RDRAM, but is rather bad if only one byte out of that 64 is actually required. The same is obviously valid for write operations in case the cache line has become ‘dirty’, meaning that the cache data has been altered and therefore needs to be written back to the system (memory, AGP, PCI, …). The read latency of Pentium 4’s L2-cache is 7 clocks, its connection to the core is 256-bit wide and obviously clocked at core clock. After doing the math we get to an impressive data bandwidth between L2-cache and core of 44.8 GB/s for Pentium 4 @ 1.4 GHz and 48 GB/s for Pentium 4 at 1.5 GHz.
Pentium 4’s L1 Cache
After the discussion of the L2 cache it wouldn’t be more than logical to move over to the L1 cache. This is what we will do, but not without a special remark. While Pentium III is equipped with a 16 KB L1 cache for instructions and a 16 KB L1 cache for data, there is only an 8 KB small data L1 cache in Pentium 4, while a pretty nifty feature called ‘Execution Trace Cache’, which I’ll discuss in the next paragraph, replaces the L1 instruction cache of Pentium III.
Intel was probably forced to reduce the size of the L1 data cache down to only 8 KB, which is half the size of Pentium III’s L1 data cache and only an eighth (!!!) of Athlon’s, to enable its extremely low latency of only 2 clock cycles. It results in an overall read latency of less than half of Pentium III’s L1 data cache already in the Pentium 4 at 1.4 GHz, but the small size of Pentium 4’s L1 data cache may be one reason for the performance flaws we will see when we get to the benchmark results.
The L1 data cache of Pentium 4 is 4-way set associative and uses 64-byte cache-lines. The dual-port architecture allows one load and one store operation per clock.
Hardware Prefetch
Intel has added another nifty feature that I want to bring to your attention in the L1/L2 cache context. If you think of the Pentium III launch in February 1999, you might remember Intel’s introduction of the ‘streaming’ SIMD Extensions. The ‘streaming’ bit of ‘SSE’ is actually represented by the prefetch-instructions of Pentium III, which enable software to load data into the caches before it is requested by the processor core.
Those instructions still exist in Pentium 4’s instruction set, but with the new hardware prefetch feature of Pentium 4 a lot of this is done automatically. This new unit is able to recognize data access patterns of the software executed by Pentium 4, so that it ‘guesses’ which data will be needed next and ‘pre-fetches’ it into the cache.
The procedure might sound familiar to you from the complex hard drive cache algorithms and you might also be aware how much this can speed up hard disk accesses under certain circumstances. Pentium 4’s hardware prefetch is probably able to significantly accelerate the execution of software that is using a lot of large data arrays.
Entering The Execution Pipeline – Pentium 4’s Trace Cache
Our code has now passed the system bus, L1 and L2-cache, so that it’s finally time to enter the execution path of Pentium 4. You remember that Pentium 4 is not using an L1 instruction cache, but a much niftier thing instead. Let me first explain what is bad about an L1 instruction cache.
With Pentium III or Athlon, who both have an L1 instruction cache, code is fetched by this cache and stored until it’s about time to enter the execution path. This is done by code entering the decoder unit, which e.g. in case of Athlon consists of 3 ‘direct path’ and 3 ‘vector path’ decoders, which alternatively produce the ‘OPs’ (as explained above) that can get executed by the execution units of the processor. This situation has a few glitches. First of all, some x86-instructions are rather complex, taking a lot of time to be decoded by the slow or ‘vector path’ decoders. In the worst case all decoder units are busy decoding complex instructions, thus stalling the execution pipeline of the processor. Another problem is the fact that x86-instructions that are supposed to be executed repeatedly (e.g. in small loops) need to be decoded each time they enter the execution path, thus wasting a lot of time. Software branches are another wasteful situation for a processor with L1 instruction cache that starts its pipeline at the decoder level.
Pentium 4’s fancy Execution Trace Cache does not suffer from the above-described problems. Once you understood it, the idea of the trace cache is actually rather simple, but it takes quite a bit more silicon resources and design skill to replace the good old L1 instruction cache with something like Pentium 4’s trace cache. Basically, the ‘Execution Trace Cache’ is nothing but a L1 instruction cache that lies BEHIND the decoders. Obviously it’s quite bit more complex, but once you understood this basic fact you start to realize the benefits of the trace cache.
Entering The Execution Pipeline – Pentium 4’s Trace Cache, Continued
As already mentioned in my description of the term ‘µOP’, those simple instructions are the language understood by the execution units. They are of a defined size and thus easier to be sequenced than x86-instructions that are of variable length. Once in the trace cache, Pentium 4 saves the time to re-decode repeating instructions. It can easier check for dependencies required for the branch prediction process. The trace cache ensures that the processor pipeline is continuously fed with instructions, decoupling the execution path from a possible stall-threat of the decoder units. This is particularly important in case of the high clock rate design of Pentium 4. The execution trace cache supplies the next pipeline stage with 6 µOPs every 2 clocks and thus 3 µOPs per clock, which is about as fast as what AMDs Athlon is able to do under ideal conditions.
Now there’s quite a bit more to know about ‘µOPs’, decoders and the trace cache. First of all, those ‘µOPs’ are not exactly small. In fact they are considerably larger than an x86-instruction although they contain less information (most x86-instructions are represented by more than one µOP). The µOPs of Pentium III are known to be as large as 118-bit. Intel never reported the physical size of the execution trace cache and neither the size of Pentium 4’s µOPs. We only know that the trace cache is supposed to contain about 12,000 µOPs. Looking at a die picture that I took by myself from the Pentium 4 die chips supplied by Intel to the press at Comdex and comparing the trace cache area with the L2-cache area it looks as if the trace cache is about 92-96 KB in size. It therefore seems to be a good guess estimating the size of a Pentium 4 µOP in the neighborhood of 64-bit.
96 KB is quite a considerable size and 6 times larger than Pentium III’s 16 KB L1 instruction cache. However, Intel hasn’t been wasteful with the space offered by Pentium 4’s execution trace cache. Due to the fact that the trace cache stores decoded x86-instrutions, Pentium 4 is aware of what it actually does, wants, represents. The decoder units that feed the trace cache ensure that only those µOPs are stored in the trace cache that will actually be executed.
Entering The Execution Pipeline – Pentium 4’s Trace Cache, Continued
The below example shows the actual code in the upper box and the actual content of the trace cache in the lower box. Unused code is not stored inside the trace cache.
From my description of ‘µOPs’ above you may remember the case when an x86-instruction is rather complex. Then the decoder requires the micro code ROM of the processor to produce a sometimes very long chain of µOPs. In this case the trace cache doesn’t get filled up with all of those µOPs. As a placeholder it only contains some kind of flag, which signalizes that the micro instruction sequencer is supposed to supply the µOPs to the next pipeline stage. It is not known how many µOPs per clock the micro instruction sequencer is able to deliver, but it would not be surprising if it is less than the 3 µOPs per clock that the trace cache can send to the next pipeline stage. This can obviously have an important performance impact on the Pentium 4 CPU, which has been tuned for simple instructions, but which seems to suffer from complex ones, as you will see further down as well.
As mentioned in short above, the trace cache can also be of significant benefit in case of a mispredicted branch. In this case the alternative code could already be found in the trace cache. To check if certain code already resides in the trace cache, it has a rather complex structure of tags, indices and cache lines.
The Trace Cache Branch Prediction Unit
Intel is very proud on the branch prediction unit that aids the execution trace cache. Its branch target buffer is 8 times as large as the one found in Pentium III and its new algorithm is supposed to be way better than AMD’s latest G-share algorithm used in Thunderbird and Spitfire. Intel claims that this unit can eliminate 33% of the mispredictions of Pentium III.
Hyper Pipeline
One of the most well known features of the new Pentium 4 is its extremely long pipeline. While the pipeline of Pentium III has 10 stages and the one of Athlon 11, Pentium 4 has no less than 20 stages.
The reason for the longer pipeline is Intel’s wish of Pentium 4 to deliver highest clock rates. The smaller or shorter each pipeline stage, the fewer transistors or ‘gates’ it needs and the faster it is able to run. However, there is also one big disadvantage to long pipelines. As soon as it turns out at the end of the pipeline that the software will branch to an address that was not predicted, the whole pipeline needs to be flushed and refilled. The longer the pipeline the more ‘in-flight’ instructions will be lost and the longer it takes until the pipeline is filled again.
Intel is proud to announce that the Pentium 4 pipeline can keep up to 126 instructions ‘in-flight’, amongst them up to 48 load and 24 store operations. The improved trace cache branch prediction unit described above is supposed to ensure that flushes of this long pipeline are only rare occasions.
The stuff that happens in the trace cache, as mentioned above, only represents the first five stages of the pipeline of Pentium 4. What follows is
- Allocate resources
- Register renaming
- Write into the µOP queue
- Write into the schedulers and compute dependencies
- Dispatch µOPs to their execution units
- Read register file (to ensure that the correct ones of the 128 all-purpose register files are used as the register(s) for the actual instruction)
After that comes the actual execution of the µOP, which I will discuss more detailed in the next paragraph. Of the above-mentioned previous stages the schedulers as well as the register file read are the most interesting. I have still decided against discussing them in detail to keep this article from becoming my next book.
The Rapid Execution Engine
The above picture is actually showing all execution units of Pentium 4, including the ‘Rapid Execution Engine’ as well as the ‘not-so-rapid’ execution units. While Intel is only talking about the four fast execution units, the other four are the actual units that are responsible for Pentium 4’s peculiar behavior in the benchmarks.
Basic part of the ‘Rapid Execution Engine’ are the two ‘double-pumped’ ALUs and AGUs. Each of the four is said to be clocked with double the processors clock, because they can receive a µOP every half clock. Intel never disclosed if those units are now indeed clocked with twice the processor clock or if each of those units is in reality consisting of two identical sub-units running at normal clock that can merely receive the µOPs alternately every half clock. It doesn’t really matter which of the two is actually true, because the result is the same. Simple µOPs that can be processed by the Rapid Execution Engine are executed in half a clock, which is obviously a very good thing.
The story looks a lot different for the instructions that cannot be processed by the rapid execution units. Those instructions or µOPs need to use the one and only ‘Slow ALU’, which is not ‘double pumped’. The majority of instructions needs to use this path, which obviously sounds scary. However, the majority of code is in actual fact consisting of the most simple ‘AND’, ‘OR’, ‘XOR’, ‘ADD’, …. Instructions, making Intel’s ‘Rapid Execution Engine’-design sensible though not particularly amazing.
Things look worse if you have a look at the red boxes, which represent the FPU-part of Pentium 4. Please take the time and compare this part to the Pentium III block diagram. You will see that Intel has actually castrated quite a bit of the SSE/MMX part of Pentium 4. Pentium III used to have two MMX and two SSE units, but Pentium 4 has only got one of each. Intel claims that additional units would not have improved the SSE/SSE2, MMX or FPU performance. However, our benchmark results speak a different language.
SSE2 – The New Double Precision Streaming SIMD Extensions
To conclude this epic piece about Pentium 4’s internal architecture I need not forget to mention SSE2. 144 new instructions are finally enabling everything that SSE was expected to be in the first place. The 128-bit of packed data, which could only be in form of four single-precision floating-point values under SSE can now be operated in all of the following options:
- 4 single precision FP values (SSE)
- 2 double precision FP values (SSE2)
- 16 byte values (SSE2)
- 8 word values (SSE2)
- 4 double word values (SSE2)
- 2 quad word values (SSE2)
- 1 128-bit integer value (SSE2)
The options are vast and the usefulness undoubted. Intel hopes that software developers will soon replace the old x87-FPU-instructions with the double-precision FP instructions of SSE2, so that Intel’s currently false claim that Pentium 4 has the most powerful FPU finally becomes reality. AMD is very impressed with SSE2 as well, which is why it announced to us only a few days ago that the upcoming Hammer-line of x86-64 processors will include SSE2 as well.
I personally have my doubts if SSE2 will be able to replace x87-instructions in scientific software. We should not forget that the original FPU is using 80-bit FP-values, not the less exact 64-bit FP-values offered by SSE2.
Die Size / Package / Socket
Here you see a die size comparison of Pentium 4, Pentium III, Athlon and Duron. The 217 mm² of Pentium 4 are more than double the size of Pentium III and almost double the size of Athlon. This seems rather surprising when you realize that Pentium 4 has got 42 million transistors and Athlon 37 million. Each of the four is manufactured in 0.18µ-process. Bottom line: Pentium 4 is big!
Intel has reacted to the complaints about damaged flip chips of Pentium III and Celeron. The recent flip chip package of the two processors exposes the die to direct forces from outside. A badly mounted heat sink can easily damage the sensitive silicon and destroy the processor. Intel equipped Pentium 4 with a protective metal cover that saves the die from any damage.
Pentium 4’s package and Socket is a bit larger than Pentium III’s, but the difference isn’t much. When you see a Pentium 4 on its own you mistakenly estimate it’s of the same size as a Pentium III.
This is the new Socket423. As you never would have guessed it comes with 423-pins, thus 53 pins more than Pentium 3. The majority of those new pins are required to supply Pentium 4 with the vast amount of power it requires.
The i850 Chipset
There’s not really that much to say about the one and only chipset available for Pentium 4 right now. Basically, i850 has a lot of similarities with i840E. Both chipsets are using a dual-Rambus channel memory architecture, both share Intel’s ‘hub architecture’, both are using the 82801BA aka ICH2 chip for I/O and PCI and both are pretty expensive.
The only difference is the bus to the processor, which is a normal 133 MHz bus for Pentium III in case of i840 and the quad-pumped 100 MHz bus for Pentium 4 in i850. To ensure this paragraph isn’t quite as boring to you, I included this funky picture out of one of Intel’s fancy presentations:
Power Requirements / New Power Supplies / Heat Sinks / Cases
Pentium 4 has got a rather large die, it runs at very high clock frequencies, it’s got a rather long ‘hyper-pipeline’ and a supply voltage of 1.7 V. What does that all come down to? Yes, Pentium 4 needs a lot of power and is able to produce a lot of heat. This requires a good power supply and heat sink solution and because Intel is Intel, these things were properly taken care of.
Although Pentium 4 doesn’t really need much more power than AMD’s latest Athlon, Intel decided to avoid the mess that happened to Athlon-owners who used underpowered voltage supplies in their systems, resulting in frequent system failures. Intel is not following AMD’s basically ignored compatibility list, which is hardly worth the paper its not written on, proven by e.g. Asus’ A7V motherboard, which wasn’t officially supported by AMD for a long time while AMD shipped review systems with exactly this board to the press. Intel is well-known for going ahead and establishing new industry standards, and as much as this may bug a lot of us who don’t own the hardware required by the new standard, it assures that systems which accord to this guidelines will actually work without any glitches.
Pentium 4 requires a new kind of power supply that ensures the delivery of 10-12 A from the 12V line of the power supply. This results in the need for additional connectors that can carry this current. Although we managed to run all of our Pentium 4 test motherboards with a normal power supply as well, we encourage every Pentium 4 owner to ensure he’s got one of the new power supplies that come with two additional connectors that need to be plugged into the motherboard. The two new power supplies available to us were from AOpen and Delta Electronics.
Power Requirements / New Power Supplies / Heat Sinks / Cases, Contuinued
Pentium 4 burns some 55 W and therefore it needs some good cooling devices. Athlon owners know about the difficulty of mounting reasonably sized heat sinks on their 1-1.2 GHz Athlon processors without destroying the die of the processor or the notches of the SocketA. Top-notch heat sinks are often made of copper and can easily weigh up to 500g (more than a pound). In a worst-case scenario the heat sink could rip either only the socket, but possibly even the motherboard apart. Once again Intel decided not to make the same mistake as its competitor with the green logo.
Pentium 4 heat sinks are supposed to be mounted through four holes around the Socket423 right to the computer housing, thus avoiding any possible damage to the socket or motherboard. This will require new housings in many occasions.
Power Requirements / New Power Supplies / Heat Sinks / Cases, Contuinued
Intel even defined the place where the processor is supposed to be placed on the motherboard (right at the edge), but many motherboard makers preferred to go their own ways. Asus for example considered it to be technically better to have the processor more in the middle of the board. In return Asus ships those board with a metal back plate, which makes the motherboard boxes surprisingly heavy. The heat sink is mounted though the board to this back plate, thus ensuring stability. Asus ensures that its P4T motherboard can still be mounted into any kind of computer case, but I’ve got my doubts if all of the el-cheapo cases will indeed be able to host it.
Gigabyte’s beautifully blue GA-8TX is placing the Socket423 where Intel wants it, but it ships the board with heat sink mounts that are fixed to the motherboard with some plastic nipples.
Future Pentium 4 processors that run at 1.6 GHz and more require their heat sinks to be grounded (connected to Vss) to avoid EMI. To ensure this board makers have to place ground pads around the mounting holes of the heat sink, as you can see in this picture:
Motherboards
We were lucky to receive three different motherboards for this test. The one from Intel was the ‘D850GB’ aka ‘Garibaldi’:
Then Asus supplied us with its P4T motherboard that shipped with this funky metal plate, making the motherboard box three times as heavy as usual.
Finally Gigabyte was also sending us their Pentium 4 platform called ‘GA-8TX’.
Each of the three platforms performed reliably. Only Asus supplied us with a brand new ‘performance’ BIOS which made the P4T our fastest P4-board. It also offered excellent overclocking features, which is why I decided to make the P4T our Pentium 4 test platform for this article.
Overclocking
Many of the overclockers of this world were afraid that Pentium 4’s quad-pumped 100 MHz bus would make bus overclocking of this processor as difficult and restrictive as with Athlon and its dual-pumped 133/100 MHz-bus. I can bring you the surprisingly positive news that Pentium 4 is as overclockable as Intel processors always have been. You can imagine that the multiplier of official Pentium 4 processors will be locked, but with a good P4-motherboard you won’t have any problems overclocking the bus.
I took advantage of the jumperless-mode of the Asus P4T-motherboard and managed to let two different Pentium 4 processors run at up to 125 MHz bus clock. I even included a 1.4 GHz Pentium 4 overclocked to 14 x 115 MHz = 1610 MHz as well as the evaluation 1.5 GHz Pentium 4 overclocked to 16 x 108 MHz = 1728 MHz in the benchmark results. I only had to raise the voltage from 1.7 to 1.8 V. There was no thermal issue, as Pentium 4 heat sinks are already designed for much higher heat dissipation than what current Pentium 4 processors are actually able to produce.
Benchmark Considerations
Due to time constraints we were only able to do run full benchmark suite under Windows 98, but I also added the Linux Kernel Compilation. We have already done a major part of the Windows 2000 benchmarks and will supply them shortly. Intel supplied a lot of special benchmarking software for Pentium 4, which we will evaluate, run and publish in the next few days.
After what we have learned in the architectural part of this article, we should expect Pentium 4 to show excellent performance in all benchmarks that are heavily integer based and the ones that take great advantage of the new high-speed bus between processor, system and memory. SSE2-optimized software should obviously run very fast on Pentium 4 as well. Although Intel claims that Pentium 4 has the worlds best floating point performance we know that in reality the normal FPU of Pentium 4 is hardly even able to live up to Pentium III standards. Only floating point applications that use SSE2 could possibly support Intel’s bold claim. Today’s standard software is obviously not yet SSE2-optimized, so that standard FPU-intensive software will probably run rather slow on Pentium 4 systems.
Benchmark Setup
To enable Pentium 4’s SSE2 we installed DirectX 8 on all the test platforms. We were using our standard NVIDIA GeForce 2 GTS graphics card, but had to find out that the latest available and DX8-enabled driver rev. 7.17 is performing very poorly in 3D as well as 2D applications on all of the test systems. Therefore we decided to use the reliable 6.31 driver.
Hardware Setup | |
I850 Socket423 Pentium 4 Platform | ASUS P4T, BIOS |
Rambus Memory | 2 128 MB Samsung PC800 RDRAM RIMMS |
SDRAM Socket A platform for AMD Athlon and Duron Processors | ASUS A7V, BIOS 1004D final |
SDRAM Socket 370 platform for Intel Pentium III and Celeron processors | ASUS CUSL2, BIOS 1004.003 |
SDRAM Memory | 128 MB Wichmann Workx PC133 SDRAM CL2, setting 2-2-2-5/7 |
DDR Socket A platform for AMD Athlon processors at 133 MHz Front Side Bus | Gigabyte GA-7DX Rev.1.3, BIOS Rev. |
DDR Memory | 256 MB Micron CL2 |
Hard Drive for Windows 98 Tests | IBM DTLA-307030 ATA100 IDE, 30 GB, FAT32 |
Hard Drive for Linux Test | Seagate ST320430A ATA66 IDE, 19 GB, ext2 |
Graphics card for Sysmark2000, Quake 3 Arena, Unreal Tornament and 3D Studio Max 2 | NVIDIA Geforce 2 GTS Reference Card Core Clock 200MHz Memory Clock 333 MHz Driver 6.34 |
Graphics Card for SPECviewperf | NVIDIA Quadro2 Reference Card Core Clock 230 MHz Memory Clock 400 MHz Driver 6.31 |
Software Setup | |
Windows Version | Windows98SE, 4.10.2222A |
Windows Resolution for Sysmark2000 | 1024x768x16x85 |
Windows Resolution for SPECviewperf | 1280x1024x32x85 |
Linux Version | SuSE Linux 6.4, Kernel 2.2.14, THG benchmarking kernel, gcc 2.95.2 |
Quake 3 Arena | Retail Version Setting Normal, 640x480x16 bit color, no sound |
DirectX Version | 8.0 |
Unreal Tournament | Version 4.28 (patched) Setting 640x480x16, no sound |
SPECviewperf | Rev. 6.1.2 |
Memtime | Intel Memory Transfer Timing Utility |
MPEG4 Encoding Software | FlasK MPEG, ver. 0.594 DivX 😉 3.11alpha |
Sysmark 2000 under Windows 98
This benchmark discussion starts with a big bang. Only a heavily overclocked Pentium 4 at 1728 MHz is able to surpass AMD’s ‘older’ Athlon with SDRAM and 100 MHz FSB. The ‘newer’ Athlon with DDR-SDRAM and 133 MHz FSB is simply untouchable in this benchmark right now. Even Pentium III is able to leave both official Pentium 4 models at 1.4 and 1.5 GHz behind!
Whatever it may be, Pentium 4 can’t handle current office software very well. One of the reasons may be its small 8 KB L1 data cache, but that’s certainly not all. In some way I’ve got to say that Intel has guts releasing a processor they call ‘the world’s best performing microprocessor’ although Intel is well aware of the fact that one of the major standard benchmarks shows this processor in a pretty bad light.
What does this result mean to us? Not much really. Office applications haven’t been a challenge to any of the processors that were released within the last 12 months. When you are using your office suite the computer is waiting for you rather than vice versa. From this point of view it shouldn’t really bother us that Pentium 4 doesn’t perform well in this kind of application. However, you may wonder if a product that cannot beat its older, cheaper and slower clocked competitor in this benchmark is worth your hard earned money.
Quake 3 Arena Demo001
The story looks very different when we look at those results here. Quake 3 Arena is able to take great advantage of Pentium 4’s high memory throughput. This is why the overclocked Pentium 4 at 1.61 GHz and 460 MHz Rambus clock performs almost as well as the Pentium 4 at 1.73 GHz. Anyways, if Quake 3 Arena is really important to you, Pentium 4 might be worth its money.
Quake 3 Arena NV15Demo
The lead of Pentium 4 is reduced but not lost when it’s running the complex NVIDIA NV15-level. The processor needs to compute a whole lot of FP-stuff and so far Quake 3 doesn’t take any advantage of SSE2. Still, Pentium 4 is looking good.
Unreal Tournament UTBench
If you thought that Pentium 4 would kick butt in any 3D-game you are in for a surprise. Unreal Tournament is currently ruled by Athlon. Only the highly overclocked versions of Pentium 4 are able to surpass the AMD processor.
This benchmark shows how sensitive Pentium 4 seems to be. While most processors that I know perform similarly under Quake 3 as well as Unreal Tournament, Pentium 4 clearly prefers Quake 3.
MDK2
In the last tested 3D-game, MDK2, Pentium 4 needs to be at (currently unavailable) 1.6 GHz to outperform Athlon 1200/133. Again, Pentium 4 doesn’t look that good.
MPEG4 Encoding With Flask / DivX
You might remember the recent article about AMD’s 760 chipset. I wrote that I consider the MPEG4-test the overall most important processor benchmark, because the scores in this benchmark make a really noticeable difference to the computer user. We are talking of hours of waiting that can be saved with a well performing processor.
I bet that Intel loves this benchmark, because Pentium 4 performs really well here. Even the slowest Pentium 4 is able to beat the competitor from Sunnyvale.
This result shows what Pentium 4 was designed for. It seems that Intel is taking more care of 3D-gamers and DVD-Rippers than of office workers. Pentium 4 doesn’t want to be working class!
Linux Kernel Compilation
Now here we go with yet another working-class kind of benchmark.
Once more Pentium 4 looks rather embarrassing. Even Pentium III is just about able to beat the P4 @ 1.4 GHz in this good old classic Linux kernel compilation. Athlon smokes P4 big time.
Floating Point Performance Under 3D Studio Max 2
Intel followers hate me using this benchmark, because it makes Athlon look so good. However, if those guys had a bit of a memory, they’d remember that I am using 3D Studio Max now for 4 years and in the first three of those years Intel always beat everything in this benchmark. 3D Studio Max is still one of the best ways to test the classic 80-bit precision x87 FPU. If Intel has the nerve to claim Pentium 4 has the most powerful floating-point performance in the world, it better look good in this test.
Well, well, isn’t this just about slightly shocking my dear? Pentium 4 gets badly kicked even by its supposedly weaker brother Pentium III. Athlon leaves both Intel processors in the dust. Now we know HOW badly the x87 FPU of Pentium 4 was designed. Intel is completely relying on an urgent implementation of SSE2. I don’t consider that as bad. SSE2 will most likely be a success. However, couldn’t Intel start cutting the crap by now? Does it really have to be that every Intel marketing person claims Pentium 4 has the best FPU-performance in the world, while in actual fact it really blows at it?
Clock For Clock Comparison Of Pentium 4 And Pentium III
Yes, I thought I should really do this, although Intel certainly despises me for it. I underclocked one of their great Pentium 4 processors to 1 GHz and compared it with Pentium III 1 GHz.
The P4-sample provided by Intel was not multiplier-locked, so that I could run it at 10 x 100 MHz = 1 GHz without any need to alter the system bus clock.
Ouch! Clock for clock Pentium 4 gets beaten by Pentium III in all but the Quake 3 and the MPEG4 benchmarks! If the Pentium III at 1.13 GHz were available and Intel would have launched Pentium 4 at 1.3 GHz, the difference between the two would have been hardly noticeable. Nevertheless, please don’t forget that Pentium III is pretty much at the end of its days. It can’t get much faster. Pentium 4 however has a long way to go. Clock speed will soon equalize this nasty defeat.
Clock For Clock Comparison Of Pentium 4 And Athlon
What was done with Pentium III can obviously be done with Athlon as well. In this run an underclocked Pentium 4 at 1.2 GHz (12 x 100 MHz) had to show how it performs against AMD’s current flagship.
Compared to Athlon clock-by-clock Pentium 4 is looking even worse. It is barely able to beat Athlon in MPEG4 and the lead in Quake 3 Arena Demo1 is very thin as well. Clock speed and compiler optimization is the only way Pentium 4 will keep the Athlon at arms length.
The Impact Of the Memory Speed
Finally I wanted to find out if it isn’t just the excellent data throughput of Pentium 4’s system bus and memory interface that makes it perform so well in Quake 3 and the MPEG4-compression. In ran Pentium 4 twice at 1.6 GHz and 100 MHz system bus, once with the RDRAM clocked at 400 MHz, once clocked at only 300 MHz, reducing its bandwidth from 3,200 MB/s down to 2,400 MB/s. Then I ‘overclocked’ a 1.4 GHz Pentium 4 to 1.610 GHz by increasing the system bus clock to 115 MHz and thus the Rambus block to 460 MHz, increasing the memory bandwidth to 3,680 MB/s.
You can see that the memory speed does indeed have a major impact on all the benchmark results except of the 3D Studio Max scores. In some cases the difference between the slowest and the fastest score is more than 10%! This proves clearly that Pentium 4 lives from the high memory bandwidth that RDRAM is finally able to deliver. Keep that in mind in case someone wants to sell you PC600 RDRAM!
Conclusion
Well, well, I wonder how Intel will be able to motivate its OEM customers after they read the numbers that we’ve just seen. This is what Intel wants them to do:
It’s not going to be easy I guess!
Let’s get serious now however. We have learned that Pentium 4 has got a rather exciting and interesting brand new design that comes with a whole lot of potential. However, the benchmark results might seem a bit sobering to the majority of you. Whatever Pentium 4 is right now, it is certainly not the greatest and best performing processor in the world. It’s not a bad performer as well though.
Intel seems very determined to make Pentium 4 a success and I have the feeling that it will succeed. The implementation of SSE2-instructions into future software as well as the usage of code-optimizing compilers for Pentium 4 will make sure that Pentium 4 will be standing in a much better light very soon. However, I believe that Pentium 4’s strongest side is its clock speed potential. Just realize that I overclocked this brand new 1.5 GHz Pentium 4 to beyond 1.7 GHz without any problems. I don’t care whatever the latest roadmap of Intel may be saying. I am certain that Intel will deliver very fast Pentium 4 processors very soon. Intel has finally won back the ability to make AMD’s life a lot harder.
What do I think of the components around Pentium 4? I have got to admit it, but with Pentium 4 Rambus is finally able to deliver for the first time. If you look at Pentium 4’s design closely enough, you can see that it’s engineered to live with RDRAM in perfect harmony. The memory benchmarks from above show that Pentium 4 really requires the 3,200 MB/s of data bandwidth supplied by the two Rambus channels. I doubt that it will perform as well with DDR-SDRAM, unless two channels will be used. One DDR-SDRAM channel offers ‘only’ 2,122 MB/s of data bandwidth, which might make quite a difference with Pentium 4.
The new power supply and housing requirements for Pentium 4 might be a nuisance to some, but they make perfect sense. I hope that AMD will follow Intel’s example and come up with some solid new specifications for Athlon-platforms as well.
I personally really like Pentium 4. It’s a bit like getting designer furniture. You don’t really need it, but it’s damn cool to have it. Don’t buy Pentium 4 unless you feel like this. If you can spend the extra bucks and like the strengths of Pentium 4 without minding the little weaknesses (e.g. x87 floating point applications) you maybe want to consider it. If you are on a budget or your system is a hard working platform that’s required to make you money, I’d rather go for the really working class kind of processor by the name of Athlon.
Pentium 4 at 1.4 GHz goes for $644, Pentium 4 at 1.5 GHz costs $819 right now. It’s not exactly a bargain, but, hey, who really cares about price if it really is all about style?
Please don’t forget to read the Intel roadmap update from October 13, 2000 to understand Intel’s strategy with Pentium 4.
Please be sure to read the follow-up article, Important Pentium 4 Evaluation Update