Original URL: https://www.theregister.com/2001/01/10/p4_mihocka_has_attacks1/

P4 Mihocka has attacks on him shocka

Everything not as it seems

By Mike Magee

Posted in Bootnotes, 10th January 2001 13:48 GMT

Readers Write Mr Mihocka's lengthy piece slamming the Intel Pentium 4 platform as a crock of doodoo at the end of a sepia rainbow yesterday drew a big response from our readership.

Even the Citizen himself pitched in, in response to a detailed rebuttal prepared by a reader. And if you're not interested in programming, you can stop reading here, OK?

Also, if you're a little confused as to whether to buy a Pentium 4 or not, and are just an ordinary punter on the Clapham Omnibus, then much of the following won't help, we're very much afraid. No sir. You should read our rather excellent Buyer's Guide.

Pentium 4 Xeons but who needs the P4?
There has to be a P4 Xeon sooner rather than later, and it has to kick massive butt over the earlier Xeons. The only way I see to do it is to put back into the P4 all that was stripped, and maybe more.

But I'm not holding my breath. My current machine, with dual Celeron 366s overclocked to 550, will last until I get my next machine, which will have dual DDR Athlons (when they finally arrive). And that, in turn, should tide me over nicely until the Sledgehammer debuts.

Since I run only Linux these days (with Wine/Win4Lin/Plex86/VmWare when needed for my few Windows apps, mainly Office 95 and MathCad 8), and since all the next-generation processors (Alpha, Itanium, Sledgehammer) will run Linux before they run any native MS OS, who needs the P4?

Citizen Mihocka replies to a senior programmer at a big games firm

Programmer Just a couple of points about the P4 article. Mistake #2 - The memory bandwidth of the P4 is far and above any > x86 cpu todate (2-3 times more than an Athlon or P3) - it makes little sense to have a L3 cache at the present time for the relative performance gain it would give for most applications - the P4 is not intended to be a replacement for a Xeon.

Mihocka Why? The original spec for the Pentium 4 sure as heck made it a Xeon class processor. Now it's basically a very expensive consumer-level non-SMP game machine.

Programmer And your opinion is biased because a large L3 cache is good for an emulator such as yours. I'm sorry but memory bandwidth is where its at for multimedia - just take look at the design of PS2 or XBOX and understand!

Mihocka Not at all, my emulators have an extremely small working set because I've tuned them in assembly. There is virtually no performace degredation in my emulators between a Celeron with 128K cache and 66 Mhz FSB and a Pentium II with 512K cache and 100 MHz FSB. I verified this between a 366 Mhz Celeron and a 400 MHz Pentium II. So the L2 or L3 cache size is pretty much irrelevant as far as my emulators go since they are not heavily memory intensive. The code is under 400K in size and hand optimized to have a lot of locality. The emulator core code and the data working set easily fit in 128K of memory. Remember, I come from the days when real programmers coded in assembly and you could fit an entire video game on an 8K ROMs.

Programmer Mistake #3 - "Decoder is crippled" - the trace cache provides a 95% instruction hit rate - so given a single decode per clock the average slowdown for this 'feature in the real world is about 1.66% (5/3) ..... Decoders take a lot of silicon space, so the trace cache is a very good design compromise, not a bad one - (with the die shrink to 0.13 micron I would imagine the trace cache will probably be increased in size or the compression of the micro-ops improved to reduce this 'massive' slowdown).

Mihocka I have to totally disagree with that math. Using the 4-1-1 rule and optimization technique, the Pentium III can decode 3 instructions and 3 micro-ops per cycle. That same code takes 3 cycles to decode on Pentium 4. Now, when you have a cache miss, you don't just miss one instruction, you typically miss an entire function or at least an entire basic block. Say you miss a basic block of 10 instructions. That will decode in 4 to 10 cycles on Pentium III (closer to 4 if the code uses mostly 1 micro-op instructions) vs. a fixed 10 on the Pentium 4. That's an average of say, 3 cycles per 5% of the instructions. Even at an execution rate of 1 instruction per cycle, that's a 3% penatly. More optimized code that executes multiple instructions per cycle that translated to a bigger penalty. So the better you optimize your code, the more the decoder limitation hurts you.

Yes, if all x86 code consisted of complex instructions that force a Pentium III to decode one instruction per cycle, then there would be no penalty. But using your 95% cache hit number, the penalty is on the order of 3% and higher depending on how well optimized the code is. A few percent here, a few percent there. A clock cycle here, a clock cycle there. The various bottlenecks I describe easily add up and account for the kind of 30% speed degradation that could drop the Pentium 4 to below the speed of a Pentium
III or Athlon.

Programmer Mistake #6 - your example is incorrect. A multiply by 10 is achieved by
; ebx -> value to multiply by 10
lea ecx,[ebx+ebx]
lea ebx,[ecx + ecx * 4]
which involves the address generation unit and is exactly equivalent to the speed of the multiply by your definition.

Mihocka (getting annoyed) Martin, you're an IDIOT, because not only is my example correct, yours is a perfect exactly of how code will need to be rewritten to take advantage of Pentium 4. I even gave you the answer and you botched it up. Why do you say my example incorrect? I state that to multiply by 10 you use shifts to quickly multiply by 2 and 8 and then add the results. Your piece of code works but is actually unoptimal for Pentium 4. Stop and think why, or just read the answer below.

OK, my words taken literally would translate into this code:

; ebx -> value to multiply by 10
lea ecx,[ebx*2]
lea ebx,[ecx + ebx * 8]

Now, as good assembly language programmers, we know that on both Athlon and Pentium III this code takes 1 instruction per cycle (due to the data dependancy in ECX), and thus will execute in two cycles on these chips.

We also know as good little programmers that [ebx+ebx] is a shorted form encoding of [ebx*2]. Same time to execute, 1 cycle, since the AGU and barrel shifter take care of that. So my optimal code for multiplying by 10 is:
; ebx -> value to multiply by 10
lea ecx,[ebx+ebx]
lea ebx,[ecx + ebx * 8]

which again takes 2 cycles on both the Pentium III (and whole P6 family) and the Athlon.

Now, why is your example different from my example on the Pentium 4? Well, if you read the interview with the Intel engineer, one of the last minute cuts were the two AGU's. An effective address calculation is now broken up into micro-ops and fed through the integer units. Your code breaks down as
ADD EBX+EBX into ECX
SHIFT ECX by 2 into temp
ADD ECX+temp into EBX

there are two data dependencies there so the instructions have to execute serially over 3 cycles and this takes a total of about 4 cycles on the Pentium 4. How to I know this, I tried sequences like this already and found this problem. This is why I bitch in my article, because had we tried to use a more complex constant that required two shifts we'd have taken the 6 or 7 cycles.

Now, my example, to multiply by 2 and 8 and add results in a single data dependency, meaning two of the micro-ops execute at the same time, one in the double speed ALU, one in the shifter ALU
ADD EBX+EBX into ECX SHIFT EBX by 3 into temp ADD ECX+temp into EBX

and sure enough, executes in 3 cycles instead of 4 on the same Pentium 4. The lack of the barrel shifter and AGU now means that new data dependencies are created where none exist before since the address generation has to be broken up into several smaller micro-ops. What looked like perfectly valid "multiply by 10" code, and is on older processors, is slower on Pentium 4 than an apparently identical code sequence.

I did a whole slew of these address mode and shift tests before I added Mistake #6 to my lists. Scaled index registers are NO LONGER FREE OPERATIONS like they have been on every single x86 processor released since 1986. My code sequence runs 50% slower, yours runs 100% slower than it should. That's the penalty you now pay on every shift and every table lookup that uses a scaled index register. You wanna tell me that scaled addressing modes aren't used in regular x86 code? You still wanna tell me shifts don't show up in typical code (now that you have to take address generation into account, plus high registers, plus bitfield operations)?

Programmer Mistake #7 - the solution is certainly not worse for 99% of software in existence - the shifting slowdown you complain about happens on access to high parts of x86 registers(such as AH, CH etc) - this is *extremely* rare in code at this time.

Citizen Mihocka Not quite as rare as you think. Every time a bit in the status register is checked after a LAHF or FNSTSW AX instruction? For example, every time you do a floating point comparison using the Visual C++ 6.0 compiler it uses the FNSTSW AX code sequence follow by a test on AH. For example:
if (d != 0.0)
dd 45 f4 fld QWORD PTR _d$[ebp]
dc 15 00 00 00
fcom QWORD PTR __real@8@00000000000000000000
df e0 fnstsw ax
f6 c4 40 test ah, 64 ; 00000040H

The VC++ compiler also like to pack two byte values into upper and lower halved of the registers, such as when dealing with variables of type char.

And worst of all, when performing bitfield operations on small fields, the compiler will determine that if the bitfield is in bits 8 through 15 that it can quickly use the high byte register to perform masking operations on the bitfield instead of using a 32-bit mask on the entire register. This code is common, and also bad because it generates a partial register stall on Pentium III and a shift operation stall on the Pentium 4. No penalty on 486 or classic Pentium, and once again the Athlon is the only gigahertz processor that also does it penalty free.

Just demonstrates my point that in solving one problem Intel merely introduced another, and Microsoft's compilers are sadly out of date.

[Cut a heap of this stuff out will you Mike, it's getting tedious, Ed]

Mihocka... snipped ...rewrite or recompile legacy code. As developers we have better things to think about than rewriting out code every year to make up for shortcuts in Intel's chips.

Programmer From my experience, the only type of partial register access occurs is in legacy 486/Pentium assembly code (such as an 8-bit software render). I could go on but while I would quite agree that there have been some compromises made with the P4 you seem to lack the insight to realise what a great design the P4 is for the future. (In fact I suspect the article is just sour grapes because your emulators perform poorly on the P4 because a lot tricks you have used for previous generation CPUs no longer work on this CPU).

Mihocka If it's a great design "for the future" then you're agreeing with the fact that the current implementation is inadequate and no one in their right mind should be fool to spend $4000 on a machine. That's what I'm saying. You agree or disagree? Sour grapes? [Snipped... ]
Whew.

Pentium 4 is really OK

I won't go through the entire article and point out everything I disagree with. Here are the main points, IMO, relating to the Pentium 4:

1) Citizen thinks Intel could not possibly be stupider for giving the P4 a small (8k) L1 cache. The fact is that the small size of the cache allows it to have a 2 cycle latency, instead of the 3 cycle latency found in the P3 and Athlon. That means that programs with a working set of 8k or less run much faster. (And this is probably a s***load of programs.)

2) Citizen rips on the single decoder, too. The fact is that the vast vast vast majority of code runs out of the P4's L1 (trace) cache, so it never goes through the decoder. If the desired code isn't in the cache, it must be read from memory (probably main memory). The P4's single decoder can decode instructions much faster than they can be read in from the dual RDRAM channels, so this "bottleneck" is a complete non-issue.

3) It's true that many instructions take longer to execute on the P4, but this is almost certainly due to Intel's goal of clock speed headroom (which I didn't see mentioned). Oddly enough, Citizen doesn't mention the double pumped ALUs in the P4, which decrease the latency of many common instructions. This stuff is all a big trade off, and Citizen only complains about the parts that were, well, traded off. Anyway, I'm not a big fan of the P4, but it isn't nearly the P. O. S. that Citizen says it is.

Stop this Marmosetzilla Garbage

I'm sure I can't be alone in thinking that your articles would be rather more informative if you dropped, or at least explained, the cutesy nicknames for companies and their products. You write well enough to be entertaining without dropping "Chipzilla" (Intel?), "Hipzilla" (AMD?), "Marmosetzilla" (VIA?) and the like into your prose. I realise that this is The Register, and not the Economist, but please, write for comprehension.

(Strange, we're sure we saw the Economist using the term Chipzilla itself not so long ago, ah well. And like we already said before, whew...) ®