The New Athlon Processor – AMD Is Finally Overtaking Intel
Back in October 1998 at the Microprocessor Forum in San Jose, California, the PC-world watched and listened in amazement to Dirk Meyer’s first presentation of K7’s or now Athlon’s architecture. It was quite obvious to experts as well to most other listeners, including Intel employees, that this new AMD processor would mark a new era in the processor world, if AMD could make its promises come true. Now finally, the waiting is over and we can look at a new processor that is indeed living up to all the positive expectations that arose at the end of last year.
Later on in this article you will find that the AMD Athlon beats the Intel Pentium III in virtually any benchmark we’ve ran, but before we get into those benchmark numbers, I’d like to take the time and explain why the concept of Athlon is indeed more than ‘just another new CPU’, but a milestone in the whole processor scene.
Athlon’s Micro-Architecture
The AMD Athlon is manufactured in 0.25µ technology and consists of no less than 22 million transistors.
I will try and make a comparison between the architecture of Athlon and Pentium III as far as it’s possible, so that we can see why Athlon beats Pentium III in pretty much any benchmark. I’ll discuss the internal units first, followed by the caches and the bus protocol, finally closing with Athlon’s chipset ‘Irongate’.
Block Diagrams
For illustration purposes I’m offering you the block diagram of Athlon as well as Pentium III. Please refer to those diagrams to make following my explanations a bit easier.
Glossary:
AGU: Address Generation Unit
IEU: Integer Execution Unit
SSE: Intel’s Streaming SIMD Extensions Execution Unit
BTB: Branch Target Buffer
BHB: Branch History Buffer
This is a die picture of the Pentium III die, which is not the same scale as the Athlon die-picture; the Athlon die is larger. The Pentium III (Katmai) is also manufactured in 0.25µ technology and it consists of 9.5 million transistors. Here you’ll find some more facts.
3-way Instruction Decoder – It’s not quite the same, baby!
For executing software, the very job of a processor, each CPU begins with the decoding of the program’s machine instructions, ‘translating’ it into operations or OPs that the microprocessor can handle internally. Loosely comparable, AMD calls those OPs ‘MOps’, for Macro Operations; Intel calls them micro-OPs or short µ-OPs. In fact, AMD’s MOp actually contains two operations compared to Intel’s one for one uOP. Modern processors can directly decode common and frequently used instructions extremely fast into these OPs, and execute very quickly as well (typically in one clock). Less common or very complex instructions need to be decoded in a slower process, which involves looking up the OPs in a ROM within the CPU, and the amount of resulting OPs is often more than only two. The part of the Athlon decoder that deals with the directly decodable instructions is called ‘direct path’, the part for the complex instructions is called ‘vector path’ The P6 architecture (PentiumPro, Pentium-II and Pentium III) is similar but less flexible, using only one path for both types of decodes. Why is all this done? For Athlon it is speed! For the P6, it’s design simplicity.
Let’s compare the Athlon’s decoders to the well known P6 decoders. Intel’s Pentium III has three parallel decoding units, they are known as complex and simple and simple. Without going into the boring super technical detail, Intel has strict rules in using these decoders. This means that ideally three instructions can be decoded at the same time, if and only if one of them is a complex instruction and the other two are simple instructions. Intel defines complex as an instruction that can be represented by no more that 4 uOPs. Simple is defined as an instruction that can be translated in to a single uOP. Athlon can also only decode three instructions at the same time, but it comes with three fully capable decoders. This means that Athlon will decode virtually any combination of instructions with any of its decoders. It has no special rules like the P6 architecture. Let’s say that this is performance advantage No.1.
Link to AMD Slide from Microprocessor Forum that explains this.
The Instruction Control Unit
As you can see from the processor block diagrams, the next stage, once an instruction is decoded, is in Athlon’s case the Instruction Control Unit. This Unit can hold up to 72 MOps (because a MOp can equal an x86 instruction, this means Athlon can have up to 72 in-flight instructions) before they’re dispatched to the schedulers. This is a lot more than the 20 µ-Ops (if you take an average of say 1.5 uOPs per instructions, then the P6 archtecture has approximately 13 in-flight instructions) that can be held in Intel’s Reservation Station, which is already the next advantage of Athlon over PIII, but let’s not even count that. The next step is where it gets really interesting.
The Execution Ports
You certainly agree that the most important thing a microprocessor has to do is to actually execute the instructions of the software it’s running. Thus it’s about time that we are getting to this stage. You cannot really see it in the block diagrams, but Pentium III has 11 (+1) parallel execution units, Athlon has even more. Those units are executing the OPs, and since it’s so many in parallel, you can imagine why we are talking of ‘out-of-order’ execution here. Executing one OP after another would obviously not make any use of parallel execution units. To make sure that the out-of-order execution is actually working, Intel is using the ‘Renamer & Allocator’ as well as the ‘Reorder-Unit’. The ‘Integer/FP Renamer/Allocator’ is found before the Reservation Station, and as the name already says it, this unit is responsible for integer as well as FP and multimedia OPs. Athlon does this work a bit more sophisticated. The units that take care of the out-of-order execution are the Integer Scheduler and the FP Scheduler, both able to hold a quite impressive number of OPs (18/36).
Athlon’s Integer Execution Path
Athlon’s Floating Point and Multimedia Execution Path
The Execution Ports Continued
Now to get the actual ‘work done’, the OPs have to be dispatched into the execution ports. Here’s where Athlon really shines. Pentium III has only got 5 execution ports (two of which are dedicated to memory stores), Athlon comes with no less than 9. This means that Pentium III can only dispatch 5 OPs per clock, Athlon can dispatch 9 at the same time. Let’s get back to the execution units a bit. Pentium III has 11 of it and three units represent three of the five ports, the two address generation units (load/store address) and the store data unit. Then there is execution port 0, including the IEU (integer execution unit) 0 and the Integer Shifter, MMX execution unit 0, SSE Multiplier, FADD (floating-point add), FMUL (floating-point multiply) and FDIV (floating-point division), the latter is not pipelined. Execution port 1 hosts IEU 1, MMX 1 and the main SSE execution unit. Those execution units can all more or less work in parallel, and most of them are pipelined. It still doesn’t change the fact that Athlon can dispatch almost double as many OPs at the same time, because it’s got those 9 instead of only 5 ports. Athlon’s execution units are the following. There are three IEUs, each of them has its own port, so that three integer OPs can be executed at the same time. Athlon comes with three parallel AGUs (address generation units) as well, which also have their own ports. Then there are the three FP/multimedia-ports, one used for FSTORE (storing floating-point data), one used for FADD, MMX 0 and 3DNow! 0 and another one used for FMUL, MMX 1 and 3DNow! 1.
Summarizing we can say that this is definitely performance advantage No. 2, the parallelism inside Athlon is definitely ahead of Pentium III.
Most of that stuff and a bit more you’ll find summarized on this AMD-slide.
The Pipeline
Before I start discussing the execution pipeline of Athlon, I’d like to take the time and explain it in some simple words. A good comparison with a processor pipeline might be a car manufacturing plant, that you find so many of in Detroit or around the corner from my home town, in Sindelfingen, where the main Mercedes fab is found. A processor without a pipeline is like a car fab where only one person or team builds a car at a time. The person or team starts building the car, and he/them won’t do anything with any other car before he or the team is finished with it. It takes different steps for building this car, but it’s always the same guy or team that does it. The effect is, that it’s taking a long while for each car to be built, but if a mistake was made somewhere in the building process, it’s only one car that was built wrong. Now pipelining is what you find in any modern car manufacturing plant. From when the fist piece of metal is being pressed and welded until the car drives away from the final check, there are a lot of different stages involved, each done by a different person or team. The big advantage of this is that as soon as the first stage of a car is finished, it moves over to the next stage, freeing up stage one for starting to build another new car. This way, the frequency of cars produced is a lot higher, and the frequency that new cars can be started building is just as high too. There’s only a nasty problem if it turns out that at some stage inside the manufacturing process something was done wrong, a delay is incurred and the production line is stalled and no more cars are produced.
This analogy shall give you an idea about processor pipelines and the importance of the length of a pipeline. A short pipeline, which equals a pipeline with only a few stages, means that each stage can take quite a long while, so that it cannot be fed with very high speed. It also won’t buy you much to put several short pipelines in parallel, since, due to the short pipelines, the execution times of the pipelines can be very different, creating a mess when the executions have to be put in order again. Thus you want a longer pipeline for high clock rates and for good parallelism. However, if a branch prediction turned out to be wrong in a long pipeline, all following OPs depending on this wrong prediction have to be flushed out of the pipeline and it has to be reloaded again, which wastes a lot of time. To make it as simple as possible: You want a long pipeline for high speeds, but don’t let it get too long or you get a horrible penalty for wrongly predicted branches.
The integer pipeline of Athlon is 10 stages long, which is considered as almost ideal length for clock speeds of 500 – 1000 MHz. Pentium III’s integer pipeline is 12 to 17 stages long and thus more sensitive to wrongly predicted branches. As you can see from the picture above, the floating point pipeline of Athlon is 15 stages long, standing against an estimation of over 25 stages in Pentium III.
The Pipeline Continued
The next thing we want to find out about is the branch prediction of the two contestants. In the block diagrams of the two CPUs you’ll find that Athlon has a BTB (branch target buffer) with no less than 2048 entries, which means that Athlon can store 2048 different branching addresses. The BHT (branch history table) can store 4096 entries. This stands against Pentium III’s Dynamic Branch Predictor with only 512 entries. AMD claims that Athlon makes a correct branch prediction with a probability of 95%, which is very high. Intel’s Pentium III is estimated to have a probability of 90-92% for correct branch predictions.
Whilst talking about buffers and jumps and predictions we shouldn’t forget another nice way for saving execution time, the Return Stack once introduced by Cyrix several years ago. Whoever knows a little bit about machine code or Assembler programming, will certainly remember that each time a function, procedure or other subroutine is called, the program address counter gets pushed onto the stack. Once the procedure or function is finished, the processor pulls the program address from the stack and returns to where it came from. These stack operations may be a nice thing, since they don’t require special CPU registers, but they are always very slow and should thus be avoided. The ‘Return Stack’ is a special storage area inside the processor, which is accessed very quickly (the normal stack is found in main memory). By using the special return stack, the start and finish of a subroutine can be sped up quite nicely. Athlon is equipped with a whopping 12 entry return stack, Pentium-III’s return stack is not documented, but it’s probably less than half of Athlon’s.
Summarizing the length of Athlon’s integer and floating point pipeline with its excellent branch prediction unit and the 12-entry return stack earns it performance advantage point No.3.
FPU
You may remember my article from last year’s Microprocessor Forum ‘AMD moves onto the overtaking lane‘. In this article I already postulated that K7 or Athlon would have a significantly faster FPU than any Intel CPU. Of course I was criticized for that, since it seems one of the rather popular things to do nowadays. Nevertheless Athlon has shown that its FPU is one of its strongest parts. Let’s check out why.
The number one reason why Athlon can play in the same ballpark as the Intel CPUs is the fact that Athlon’s FPU is now fully pipelined vs. the unpipelined FPU of K6, K6-2 and K6-3. That’s not all however. Athlon has got three parallel FP execution units and, as we know from above, the three execution units can be fed at the same time, since each of them has its own port. Pentium III has also got 3 FP execution units, but unfortunately they’re all behind one port. What is so great about the Athlon FPU is that it can execute two 80-bit extended operations a clock to Intel’s one.
I can still remember the old discussion when K6 came out. People claimed that K6’s FPU wouldn’t be bad at all, since it had a lower latency than PII. This was right and wrong at the same time. The latency of many K6-FPU instructions is indeed lower than of Pentium Pro and PII, but this is not good enough without the pipelining, especially in software written for Intel CPUs. Athlon’s FPU has got an average latency that’s also less than PIII’s. The result is that with the lower latency, the FPU pipeline and the 3 ports, Athlon can score significantly higher with its FPU than PIII can, which you will see particularly nicely in the results of the 3D Studio Max rendertime. This benchmark used to be Intel’s domain, but that time’s over now. Anyway, the FPU is good enough for Athlon’s performance advantage point No. 4 for the speed and No. 5 for beating Intel on its favorite battleground.
MMX
There’s not quite that much to say about Athlon’s MMX-units, except that AMD added the new MMX-instructions that were introduced by Intel with the Pentium-III. Those instructions include min, max, average and shuffle commands and will mainly be interesting for image processing and video compression. Athlon’s latency of many MMX-instructions is double of Pentium III’s, so that PIII could almost score a point here. However, in Intel’s very own Media Bench, the Athlon was still scoring better and only in FutureMark’s MultimediaMark99 Pentium III can beat Athlon. This is not too surprising however, because MMMark99 has so far only been optimized for Intel’s SSE (streaming SIMD extensions) and not for 3DNow!. Thus we cannot really count this benchmark, especially since it tests more than only the MMX-part of a CPU, it includes ISSE as well.
3DNow!
AMD added some 5 new 3DNow!-instructions to Athlon’s instruction set, which can make quite a difference when used, as seen in 3Dmark99 with AMD’s special Athlon-DLL. Using this DLL with a K6-3 or K6-2 led to crashes, which proved that this DLL did not properly recognized the cpus. With the usage of the Athlon-DLL, 3Dmark99 processor-results were increased by some whopping 20%. Running Quake2 with 3DNow!-support turned on and off doesn’t show as much of a difference as it does with K6-2 or K6-3. The reason for this is most likely the fact that Athlon has a latency penalty on 3DNow!-instructions of 100% over K6-3, due to its high speed design in comparison to a low to mid-speed design of the K6-series. This means that Athlon doesn’t benefit as much from 3DNow! as the K6-2/3-series did. No special point for that then …
The Second Level Cache
Let’s move from outside to inside. The second level cache of the Athlon models we’ve got in our lab is 512 kB, running at 1/2 the core clock speed, just the same as Pentium III. However, Athlon was designed with a very flexible second level cache in mind. The L2-cache size can range from 0.5 to 8 MB, the L2-speed can range from a ratio of 1/3, 1/2, 1/1.5 to 1/1 L2-cache clock / processor core clock. This opens up a very wide field of possibilities for segmentation of Athlon. The low cost versions could get small and slow caches, workstation and especially server version could be equipped with up to 8 MB of L2-cache running at core clock.
In my eyes the L2-cache is also one of Athlon’s weaknesses. Due to the fact that Athlon’s L2-cache is ‘external’, meaning ‘not on the die’, limits it to the SlotA-SEC-package. Intel is on the way to leave Slot1 and the single edge cartridge, simply because it’s more expensive and less practical than a socket. Intel is only able to move away from Slot1 because the upcoming and delayed ‘Coppermine’ will have its 256 kB L2-cache integrated on the processor-die. AMD will have to go in this direction (and thus to .18 µ-process) very soon too, if it wants to stay competitive.
Anyway, the L2-cache of Athlon is a very flexible thing and as long as Pentium III is also using external L2-cache, there’s no reason to complain.
The First Level Cache
Well, what can I say, Athlon really kicks butt with its L1-caches. The instruction as well as the data caches are with 64 kB no less than 4 times as big as Pentium III’s. This will make sure that Athlon will scale beautifully with clock speed, so that its pipeline and all the execution ports can be fed at 1 GHz just as nicely at 600 MHz. I can’t help it, but the L1-cache earns Athlon at least one more performance point, which is No.6 if I’m not mistaken.
Memory Streaming and Write Combining
You may remember Intel’s introduction of Pentium III. One of the new features of PIII was the ‘streaming instructions’. Those instructions are implemented into Athlon as well. Athlon’s 64-byte deep write buffer plus the five streaming instructions makes sure that Athlon is also able to pre-fetch data in a defined way from, into or around the two (L2 and L1) data caches and it can write directly to memory and thus around the caches as well.
Another nice addition is the inclusion of the Memory Type Range Registers of Athlon, which are now compatible to Intel’s P6-architecture as well. This enables write-combining, which is particularly useful for writes to the graphic card’s frame buffer, as the ones of you may remember, who used ‘fastvid’ in the earlier days of Pentium Pro. K6-2 CXT and K6-3 were also able to do write combining, but it took a special programming of the video-card driver to make use of it. These days are over with Athlon. Still NT 4 needs a special DLL-file for enabling write combining, and this DLL is supposed to be included into service pack 6 very soon.
Athlon’s EV6 Bus
One of the most important things to consider when talking about Athlon is its bus. Although Athlon’s connector may look like Intel’s Slot1-connector at the first look, those two CPUs are NOT compatible! AMD’s ‘SlotA’ uses a completely different bus protocol, based on the DEC Alpha protocol ‘EV6’. Many people thought that this different protocol would be a problem for Athlon and its acceptance on the market, but it seems as if the opposite is the case.
First of all I’d like to note that the EV6 or as AMD calls it Athlon front side bus is running at 100 MHz. Well, that sounds pathetic, doesn’t it? However, the bus takes advantage of the rising as well as the falling edge of the bus clock for transfers, so that the technical ‘speed’ is 200 MHz. Thus EV6 and so Athlon’s bus is currently the fastest system bus in x86-systems, transferring data at up to 1.6 GB/s. This is quite a bit more than the 1.06 GB/s that Pentium III will reach once ‘Camino’ or ‘i820’ is out and it’s exactly double the bandwidth of current BX-systems. AMD says that EV6 is scalable up to 200 MHz = 400 MHz effective speed = 3.2 GB/s peak bandwidth and before this speed is reached there’ll be 133 MHz = 266 MHz effective speed = 2.1 GB/s peak bandwidth. From this point of view it’s obvious, EV6 is more than future proof and certainly superior to the P6-bus. But that’s not all. Another beauty of this bus is the fact that in multi-processor environments each processor has its dedicated path to the chipset, because EV6 is a point-to-point connection of CPU and system. The P6-bus is a shared bus. This means that all processors have to share its bandwidth, leaving less for each processor, the more processors are used. EV6 offers the full bandwidth to each processor and it supports up to 14 processors in SMP-environments. The address space of EV6 is also higher than that of the P6-bus, it’s 43 bit deep vs. 36-bit address depth of P6.
System Bus
One thing shouldn’t be forgotten however. Getting EV6 up to 200 MHz won’t be an easy thing to do. We known that many motherboard makers are currently having a hard time with Intel’s Camino chipset, which is also running very high speeds (up to 400 MHz) on the motherboard. Implementing 200 MHz CPU-to-Chipset lines on a motherboard make the board design pretty complicated as well. Things get worse for multi processor systems. We’ve learned that EV6 is a point-to-point protocol and this means that each processor is connected to the chipset with its own 140 lines. This condition is already quite tough for dual or quad processor motherboards and it will get even worse if those x times 140 lines are even running at a 200 MHz clock.
I’d still say that the bus technology obviously rocks, which makes sure that Athlon gains performance advantage point No. 7.
Athlon’s First Chipset AMD 750 ‘Irongate’
We’ve just heard that Athlon is using the proprietary SlotA and the in x86-circles so far unknown EV6 bus protocol, thus we conclude that Athlon needs a dedicated chipset as well. The first one comes from AMD and is called ‘Irongate’ or the AMD-750 chipset. A closer look at it doesn’t reveal anything exciting, it’s pretty similar to Intel’s BX-chipset, equipped with all the usual features we know and like. It does without any fancy ‘Intel Hub Architecture’, but it offers all we need. You may expect a new memory type or a faster memory bus, but so far Irongate is conservative. Why shouldn’t it be? The performance with PC100 SDRAM is already better than PIII 600 on BX, and I doubt that Camino will make much of a difference. AMD is planning to implement PC133 support as well as DDR-SDRAM support, which would go hand-in-hand with the DDR-speed of the EV6 bus just fine. Even RDRAM seems to be a future option. AGP4x will of course also be included and ATA-66 is already supported by the current Irongate-version.
Currently it looks as if Irongate’s memory and AGP-performance is not quite up to speed with Intel’s BX-chipset. AMD is working on this issue, which should make us confident, that Athlon will beat Pentium III even more once those issues are resolved.
Those things will most likely not be designed by AMD anymore though. AMD wants to focus back onto the CPU-business and the Taiwanese chipset manufacturers lead by VIA will take over for the Athlon chipsets. The only thing that is not quite clear to me yet is the issue with SMP-chipsets. I couldn’t recall that any of the Taiwanese chipset makers has any experience with SMP. Thus I would imagine that DEC might change its ‘Tsunami’ chipset to work with Athlon. After all the performance of Athlon is so high, that it would be a perfect CPU for workstations and servers. This will require an SMP-platform.
Future Athlon Models
Although AMD keeps everything pretty secret still, we could still get some information.
According to the above two slides there’ll be indeed a low-cost version of K7, probably with smaller and slower L2-cache than the normal Athlon, then there’ll be ‘Athlon Professional’ and ‘Athlon Ultra’, the Professional could possibly come with a faster L2-cache, the ‘Ultra’ might have larger as well as faster L2-cache. Athlon’s server-version is supposed to come with a serial number, known from Pentium III. The normal Athlons won’t have that number though, because AMD can do without the trouble that Intel had with the PIII serial number just fine.
As long as AMD hasn’t disclosed more, I’m left to do wild speculations, and that’s something I’m really not good at. I prefer leaving this to others.
Architectural Summary
If I have seen it correctly, Pentium III was beat 7:nil in the architecture comparison, which seems to be a pretty clear result to me. Athlon’s architecture that’s focussed on high clock speed opens a bright future for this new AMD processor. At 500 MHz this CPU is hardly making use of its abilities and it’s not surprising that Kryotech has already been able to run some Athlon-samples at 1 GHz. The architecture is a clear winner and a perfect base for AMD to steer into the year 2000. Intel has for the first time all reasons to become paranoid to survive. This time it’s not a close catch-up of an AMD that’s completely out of breath, this time Intel won’t be able to just wait a couple of months and then release a new product that will beat the fastest AMD processor. Athlon is simply the more modern and technical advanced product and Coppermine, which has already been delayed due to speed-path issues, won’t change that. Coppermine’s architecture is still based on the architecture of Pentium Pro. This architecture won’t be good enough to catch up with Athlon. It will be very hard for Intel to get Coppermine to clock frequencies of 700 and above and the P6-architecture may not benefit too much from even higher core clocks anymore. Athlon however is already faster than a Pentium III at the same clock speed, which will hardly change with Coppermine, and Athlon is designed to go way higher than 600 MHz. This design screams for higher clock speeds! AMD is probably for the first time in the very situation that Intel used to enjoy for such a long time. AMD might already be able to supply Athlons at even higher clock rates right now (650 MHz is currently the fastest Athlon), but there is no reason to do so. So AMD will wait until Intel releases a faster Pentium III and then AMD releases an even faster Athlon. It could go on like this until Intel has finished Willamette, which doesn’t look as if it was anytime soon. Until then the world will be upside down, Intel will be the No.2 and AMD will supply the high end CPUs. It’s a once in a lifetime chance for AMD, let’s hope they’ll make the right use of it!
Let’s not forget the problems that Athlon is facing though. First of all AMD has to be able and supply enough Athlons to satisfy the market. Then AMD will have to rely on Taiwanese motherboard and chipset makers to provide reliable Athlon-platforms. The Dresden-fab will soon have to start producing Athlon in .18µ-technology, to produce higher clock speeds and possibly an on-die L2-cache. Those are the major issues that I can see right now, let’s hope that AMD can solve them without Atiq Raza.
Benchmarking AMD’s Athlon
In this part Athlon will have to show what it can really do when racing through several benchmarks. We decided that for this article a comparison with Intel’s Pentium III would suffice, but we’ll make an overall comparison of all available CPUs very soon too.
Those are the two contestants when they’re all dressed up.
Is Intel Losing the x86-Floating Point Crown?
It’s long ago that a a non-Intel x86-processor was able to beat its competitor from Intel in all possible benchmark areas. We have to go back to the days of 486 and 386 to find AMD and Cyrix processors that were able to beat the top Intel-product in integer as well as floating-point benchmarks. Especially the latter became Intel’s strongest force once Pentium turned up on the scene and until today there was no x86-CPU that came even close to Intel’s reign of the floating-point arena. One of the beauties of Athlon is that it kicks the butt of Intel’s flagship right there where it hurts, in the floating point area. This is very remarkable, because the good old x87-architecure is not exactly designed to reach high performance easily. Still each floating point operation needs to use the stack and only two operands can be used. The design of the PowerPC processors is a lot better here, it has 4 times the amount of floating point registers and it can use three operands in its instructions. Still Athlon’s floating point performance is even able to put a PowerPC’s number crunching ability to shame and that’s really not shabby.
Intel’s Defensive Early Release of Pentium III 600
Intel saw it coming, at least a little bit. That’s why Pentium III 600 was rushed out at the end of July, so that there was a Pentium III that runs at least at the same clock speed as the expected fastest version of Athlon. Unfortunately AMD did something that Intel used to do in the past, they just decided to release an Athlon one speed grade higher, and even though Athlon 600 is already a very bitter pill for Pentium III 600, AMD can now even claim to have not only the fastest, but also the highest clocked x86-processor on the market. That must hurt!
Has Pentium III 600 Got Serious Temperature Problems?
There’s even more to say about PIII 600. It’s strange enough that Intel is using the old overclocker-trick of raising the core-voltage a little bit. PIII 600 is using 2.05 instead of the 2.00 V used by PIII 450 – 550, which makes you wonder since when 0.05 V are supposed to make the big difference. Well, a lot of people are currently reporting problems with heat-induced crashes of PIII 600 and you start wondering if the release of PIII 600 was some kind of desperate move of Intel to save it from major embarrassment. The opposite could be the case now, if it indeed turns out that Pentium III 600 is not running really stable. The Pentium III 600 processors in our lab have failed several times too at 600 MHz, and this although our systems are always open. We’ll have a close look into this issue right after the Athlon-review.
The Benchmark Mix
I decided that we should give Athlon a really hard time. Thus we tried to run a lot of benchmarks that used to be loved by Intel, to show if Athlon can beat Pentium III even in those. For the Integer benchmarks we used BAPCo’s Sysmark98 under Windows98 as well as Windows NT 4. BAPCo is well known for being pretty close to Intel, which is why AMD didn’t use to like this benchmark in the days of the K6-series. For the FPU I used my beloved 3D Studio Max render time benchmark. 3D Studio Max 1 is neither enhanced for SSE nor for 3DNow!, so that it’s perfect for testing the pure FPU-performance. Of course it’s optimized for Intel’s P6-architecture, but Athlon shouldn’t have a problem with that. For the multimedia performance we used Intel’s good old own ‘Intel Media Bench’, a benchmark where it hasn’t been bet by any CPU so far. I wanted to take it to the top and even ran FutureMark’s questionable MultiMediaMark99, which is not available officially, but was only given out to press people at the Pentium III launch. MMMark99 is enhanced for Intel’s SSE and not for 3Dnow!, which makes this benchmark kinda cheesy. I was wondering if Athlon could still beat Pentium III, even under highly unfair conditions.
AMD was supplying us with their own benchmarking software and we thought that we might as well use it, although we hardly expected Athlon to look bad in it.
Intel’s Super Benchmarks
I also planned on using Intel’s Business, Consumer Application and Game Launcher benchmarks, but those benchmarks are really too focussed onto SSE and Pentium III. I really couldn’t help it and decided against using them eventually. What’s funny though was the double occurrence of Dragon’s Naturally Speaking and Adobe’s Photo Deluxe in both benchmark suites from Intel as well as from AMD. The results are exactly opposite and show how much can be done with benchmarking. I don’t want to accuse anybody, but when you take a look at Intel’s special benchmark suites and compare it to AMD’s application suite, it seems pretty obvious that the Intel-benchmarking software is favoring PIII in an almost disgusting manner, whilst AMD’s suite is rather modest. What catches the eye in particular within the Intel benchmarks is the double occurrence of ‘Dispatched’, a technology demo from Rage that never made it to a 3D-game. In this ‘game’, Pentium III scores double the results of Athlon. Well that’s what I call realistic and reasonable, dear Intel! You wouldn’t see the same result in Quake3, it’s rather the other way around. Is that supposed to mean that Id-Software cannot optimize their code properly? I doubt it! AMD you’ve still got a lot to learn! Playing fair is something that doesn’t really get you far in this business! I am just thinking of a well-known 3D-chip maker that could give you some lessons for the future in case that Intel should turn you down.
The System Setup
Athlon in AMD’s own ‘Fester’-motherboard.
This was the hardware configuration of the test platforms:
Platform | AMD | Intel |
Motherboard | AMD Fester B3 BIOS AFTB00-06 8/2/99 |
Abit BX6 2.0 BIOS date 7/13/99 |
Memory | 128MB Viking PC100 CAS2 |
128MB Viking PC100 CAS2 |
Graphic | Diamond Viper V770Ultra | Diamond Viper V770Ultra |
Hard Disk | Western Digital WDAC4180000 EIDE DMA mode enabled |
Western Digital WDAC4180000 EIDE DMA mode enabled |
Network | Netgear FA310TX | Netgear FA310TX |
And here’s the software configuration:
Software/Driver | AMD | Intel |
AGP driver | v4.45 | AGP miniport dated 5/11/1998 |
BM driver | v1.11 | IDE driver dated 5/11/1998 |
Graphic driver | NVIDIA reference drivers 2.08 (Win 98/NT) VSYNC disabled |
NVIDIA reference drivers 2.08 (Win 98/NT) VSYNC disabled |
Windows98 | Windows 98SE 4.10.2222 | Windows 98SE 4.10.2222 |
WindowsNT | Windows NT SP4 | Windows NT SP4 |
Desktop Resolution | 1024×768, 16 bit color |
1024×768, 16 bit color |
Refresh Rate | 100 Hz | 100 Hz |
Quake2 Version | 3.20 | 3.20 |
Quake3-Test Version | 1.08 | 1.08 |
Shogo Version | 2.1.4 | 2.1.4 |
Halflife Version | 1.0.0.9 | 1.0.0.9 |
Here’s another close up view of AMD’s Fester-motherboard for Athlon in the Californian sun:
The Results A – Office Application Benchmarks
As already said, I used Sysmark98, because it used to be the office benchmark that was preferring Intel CPUs. Winstone 99 doesn’t include as many applications and it doesn’t give quite as reliable results as Sysmark98 in my humble opinion.
The lead of Athlon in Sysmark98 under Windows98 is pretty obvious, so I don’t think I’ve got to say much about it. Athlon 650 gains some 7.1% over Athlon 600, which shows how well Athlon scales. It’s coming pretty close to the 8.3% clock speed gain that 650 MHz has over 600 MHz. Pentium III scales pretty well in this benchmark as well, the PIII 600 scores 7.2% better than PIII 550, the clock speed increase is 9% though.
The picture is almost identical under Windows NT, Athlon doesn’t have the slightest problem to leave Pentium III pretty far behind it as well, the Athlon 600 is some 10.1% faster than Pentium III 600, which means that even a Pentium III 666 won’t be able to catch the Athlon 600 in Sysmark98.
If you’re not familiar with BAPCo’s Sysmark98, here is a list of the applications that are ran in there:
- Bryce 2
- Corel Draw 8.0
- Elastic Reality 3.1
- Excel 97
- Extreme 3D 2
- Naturally Speaking 2.02
- Netscape 4.05
- OmniPage Pro 8.0
- Paradox 8.0
- Photoshop 4.0.1
- PowerPoint 97
- Premiere 4.2
- Word 97
- Xing MPEG Encoder
The Results A – Office Application Benchmarks Continued
AMD put two benchmarks together, one for Windows98 and one for Windows NT. Those benchmarks are also application launchers and they consist of several different applications.
The lead is similar to what we’ve seen above with Sysmark98, maybe Pentium III is a little bit further behind. Still this ain’t no benchmark that tries to unfairly put Pentium III at a serious disadvantage. I calculated the numbers so that Athlon would score 100 points. This way it’s easiest to compare between the CPUs.
Those are the apps used in this benchmark:
- Geomtrix 3Scan
The AMD performance tester for Geometrix takes a series of images captures with a standard CCD video camera, proprietary Geometrix hardware and converts these images into a polygonal textured model of the object. The time to process the 40 images and covert them into a 3D model is measured. - Lizardtech MrSid
The AMD performance tester loads the washdc.TIF image file which is one of the samples included in the application and measures the time it takes to compress the file. The file is a panoramic, high-level view of Washington DC taken from an aerial perspective and is 400 Mbytes in size. The compressed format is 20 times smaller or 20Mbytes. - Ligos LSX-MPEG Encoder
The AMD performance tester utility measures the time to convert a 30-second AVI file to the MPEG-2 format using the Logis GO-Motion LSX-encoder. - Windows Media technology (Netshow encoder)
The AMD performance tester utility measures the time to convert a 30-second AVI file to the MPEG-4 format using the Windows Media Technology MPEG-4 encoder. - Adobe PhotoDeluxe 3.0
The AMD performance tester for Adobe PhotoDeluxe measures the total time the system takes to manipulate an image using the sizing functions, image rotations commands, and a variety of filters. Specific functions employed are image enlargements, rotations, colored pencil, blur, accented edges, funnel, ripple, despeckel, dust and scratches, page curl, crystalize, facet, pointilize, cloud texture, sharpen, sharpen edges, unsharp mask, diffuse, fine edges, glowing edges, wind and patchwork. After each function is performed, the action is undone so that all the filters are applied to an unfiltered image. - Quake2 – crusher.dm2
Quake II Crusher demo is a performance test, in frames per second, of ID Software’s Quake II engine. The Crusher demo is the most graphically intensive situation that a user would be in, and it reflects computer formance in the worst possible scenario. - Half-Life – Smokin’
The AMD performance tester runs the demanding Smoking demo in HW acceleration mode in OpenGL.
The Results A – Office Application Benchmarks Continued
The result under NT is almost identical, although the AMD-application launcher runs completely different software:
- Adobe Photoshop 5.02
The AMD performance tester measures the time it takes to apply twenty Photoshop imaging functions and filters on two different typical Photoshop images. The tests include Adobe functions such as adjusting image sizing, making color changes, performing special effects, and high-end specialized filters. Performance demanding plug-ins are also included from the following third parties: Xenofex, Eye Candy plug-ins from Alien Skin and PhotoTools from Extensis. This test uses commercially available versions of both Photoshop and the third party plug-ins. - Kinetix 3D StudioMax 2.5 – R2.5.0.0
The AMD performance tester features a benchmark utility created by Kinetix to measure the application performance of 3D Studio Max 2.5. The utility measures the time to load a 1-Mbyte 3D Studio Max file with 2 Mbytes of texture files, perform various operations, and render a picture. In the scene supplied with this utility, three different types of materials are used: raytraced material, a prodedural material, and some bitmap materials. There are four lights in the scene-rendering lights are a good metric of CPU power. The various operations used in this benchmark utility were specifically chosen to tax the system in most ways: loading an application, loading a file, manipulating the file, and rendering the file. The rendering operation is processor and RAM intensive, and therefore this is a good benchmark for any system. - Dragon Naturally Speaking
The AMD performance tester features a benchmark utility created by Dragon Systems to test the speech engine from the upcoming family of Naturally Speaking products scheduled for release July 99. This utility measures the time to convert a .WAV file into text.
It is interesting to note that in this benchmark Athlon scores slightly better than Pentium III in Natural Speaking and in Photoshop 5. Athlon is even quite a bit ahead of its competitor from Intel. The same two applications can also be found in Intel’s Business Application Launcher, and there Pentium III beats Athlon really badly. I’d suggest you make your own rhyme on that.
The Results B – Floating Point Benchmarks
I’m still convinced that there’s hardly a better floating point benchmark than 3D-rendering software. 3D Studio Max has served us well for quite a while and although it is optimized for Intel’s P6-architecture, it’s still a highly realistic test. 3D Studio Max was of course ran under Windows NT.
Well, you may see why this is my favorite benchmark out of them all. Athlon 600 is no less than 45.3 % faster than Pentium III 600! If rendering of a complicated scene should cost you 36 hours with Pentium III 600, it will take only 24 hours and 47 minutes with Athlon 600. I guess that’s a serious difference! Athlon’s FPU is roughly about 1.5 times as fast as Pentium III’s. Intel has not only just about lost its FPU-crown, it’s way behind AMD now. You will see that Athlon’s performance in 3D-games is also very remarkable as a result of its powerful FPU.
The Results C – Multimedia Benchmarks
The first benchmark that you have to use if you want to know about MMX-performance is the good old Intel Media Bench or IMB. Let’s see how Athlon does in this one.
It is somehow a bit surprsing how well Athlon does in this benchmark as well. Athlon beats Pentium III in all of the four tests, although we wouldn’t have expected that after what I said in the architecture article. This results has probably to be explained with Athlon’s superior parallelism and pipeline, the latency of the MMX-instructions is worse than Pentium III’s
I was being rough on Athlon and let it go through Futuremark’s MulitMediaMark99, a benchmark that’s only handed out by Intel, because it’s full of SSE-enhancements and lacks 3DNow!-support completely. Thus it’s a highly unfair benchmark towards Athlon.
I have to admit, I was hoping that Athlon would still beat Pentium III, but the usage of SSE and the lack of 3DNow! didn’t really leave much of a chance for Athlon. It still doesn’t give a shabby result though.
The Results D – 3D Games
The anticipation is big, gamers want to have the best hardware and they don’t care if their CPU comes from Intel or from AMD. Let’s see if Athlon can satisfy our expectations here.
We’ve used low resolutions for those benchmarks, because we don’t want interference with the graphics chip performance. Thus we tried to make sure that the bottleneck is the CPU-performance, as you typically find it at low resolutions. Fast CPUs can easily get 3D-chips to the limits at high resolutions.
In the latest Q3test-version Athlon looks damn good, the unit is actually [fps], sorry that I forgot that. Now Quake Arena is supposed to have ISSE as well as 3DNow!-support, but all in all I guess that this game simply likes a powerful FPU. That’s what Athlon has got and so Pentium III is left quite far behind.
In Quake2 there is the option to run it on an AMD-processor with or without 3DNow!-support. You can see that Athlon easily smokes Pentium III, but you also see that Athlon doesn’t benefit much from the 3DNow!-enhancements. This is what I already explained in the architecture artcile.
The Results D – 3D Games Continued
Half-Life doesn’t make much of a difference, it runs a lot faster on the Athlon than on the PIII. We are seeing frame rate differences that come close to the results in the 3D Studio Max comparison. Athlon is almost 50% faster than Pentium III.
In Shogo the difference isn’t quite that big, but Athlon is still ahead of Pentium III. Shogo gets you close to the 3D-chip’s limits, which is one reason why the difference is as small.
Summarizing we can say that Athlon’s powerful FPU plus its 3DNow!-unit make it the best gaming-processor currently available. I’m sure that Intel will love to hear that.
Benchmark Summary
The results speak for themselves, Athlon is able to beat Pentium III in almost every benchmark and it leaves PIII particularly far behind in Intel’s former strongholds. If Athlon would only be faster in office applications most people wouldn’t have said much about it, but its strength in the floating point department makes it a great gaming CPU, it offers a wonderful platform in workstation environments, especially for CAD and 3D-rendering, and in SMP-environment with a large L2-cache it would be a beautiful server CPU as well.
AMD will have a hard time satisfying the demand of Athlon, because whoever can afford it will try to get one as soon as possible, including myself. Let’s hope that the platforms are reliable and that SMP-systems will be ready soon enough too. My system has hosted Intel CPUs since I started this website, it will hopefully not take too long and an Athlon will take its place.
Intel can now only hope for AMD’s old delivery-weakness, Athlon has shown Intel that it takes a bit more than some little changes and additions to an old design here and there. I can only repeat that I doubt that the delayed Coppermine and Camino will be Intel’s way out of the second place. It will take Willamette to regain the lead. Until then AMD may have some nice ideas too though.