P4: total dog or really cooking?
Reg reader puts Willamette on trial
Is Pentium 4 any good?
Some say no because its FPU doesn't have enough grunt. Others say yes because that FPU is optimised for 144 new SSE2 instructions and performs extremely well - when code has been optimised to use them.
What's been missing up to now has been a before and after example of a real world application showing what difference SSE2 optimised code makes.
Reader John Welter of North West Group, a Canadian Geomatics firm specialising in orthophotography - stretching accurate photographs of the Earth's surface over elevation models of the same area - volunteered us some interesting information on his company's experiences with an early P4 system.
When using the original code, a P4 system took a glacial 19 hours compared with just under 13 hours for a 933MHz PIII. But with code recompiled to use SSE2, the P4 galloped through the test in a shade over seven and a half hours.
"It all comes down to the fact that running today's code the P4 is a dog," Welter told The Reg. "But once the code is optimised for it then it really can wake up and perform quite nicely.
"A P4 at 1.5Ghz is now faster when running optimised code then our Alpha production boxes by a sizable margin, where those same Alpha boxes outperformed all our P3 based systems.
"Intel did not take the x87 FPU performance as a prime design goal in the P4. They focused on the SSE/SSE2 unit much more and made sacrifices to the X87 FPU side of things to gain more SSE2 performance. Some may argue this was a bad trade-off but the improvements they have managed on the SSE2 are very impressive.
"Geomatics is extremely CPU intensive and pretty much 100 per cent bound by CPU performance. For this reason we obtained an early 1.5GHz P4 despite the inflated costs in an attempt to determine how much added performance it would give us in reducing our production times.
"The results are a bit staggering and maybe of interest to you: Baseline: Intel OR840, PIII-933, 1GB RDRAM (4 x 256MB, 800MHz), 144Gb of RAID0 storage (4 x 36GB 10,000rpm U160 SCSI drives off an Adaptec 29160 controller)
"Process the "Calgary" test data set on this machine using original binary: 12.8 Hrs.
"Intel 850 motherboard, P4-1.5GHz, rest of system exactly the same as above. Process the "Calgary" test data set on this machine using original binary: 19.4 Hrs.
"Process the "Calgary" test data set on this machine using a recompiled P4 optimised binary (Intel's V5 compiler plug in for Visual Studio): 7.6 Hrs. (All testing was done under Windows 2000 with SP1.)
"As you can see once SSE2 optimisation is enabled on the P4 it can really cook performance-wise. But, when using the old X87 FPU instructions it is a total dog that even a Celeron could possibly outperform.
"It's too bad Intel did not keep X87 FPU performance as a prime goal and improve it as well as SSE2 as it would have really helped out with legacy code that can't easily be optimised. By not doing this the P4 is a processor for 'new' applications and not a good solution for legacy applications."
Screaming Sindy's second set of extensions
SSE2 extends the SIMD capabilities that MMX technology and SSE provided by adding 144 new instructions including 128-bit SIMD integer arithmetic and 128-bit SIMD double-precision floating-point operations.
The aim of the new instructions is to reduce the overall number of instructions required to execute a particular program task and as a result can contribute to an overall performance increase. They can accelerate a broad range of applications, including video, speech, and image, photo processing, encryption, financial, engineering and scientific applications.
The Single Instruction Multiple Data (SIMD) integer introduced with MMX has been extended from 64 bit to 128 bit registers, which doubles the effective execution rate of the SIMD integer type operations.
In addition to the new SSE2 instructions, the original (Katmai) SSE instructions have been enhanced to support arithmetic operations on multiple data types including double and quad words. SSE2 instructions are principally-aimed at providing better performance when running software such as MPEG-2, MP3 and 3D graphics.
Intel released new compilers a few weeks ago New compilers for P4, Itanic adding support for P4 and SSE2.
AMD likes the cut of Sindy's gib
AMD has its own set of SIMD instructions, 3DNow!, which competes with Intel's original Katmai SSE instructions. But AMD has now decided that it will make the most of Intel's work on SSE/SSE2, and instead of extending 3DNow!, will use SSE2 in its forthcoming Hammer processor family.
There had been some speculation that AMD would only use a subset of SSE/SSE2 in Hammer, but speaking to hardware site Tom's Hardware Guide at Comdex last week, an AMD spokesman said:
"We anticipate our implementation of SSE and SSE2 in the Hammer family [will be] complete and fully compatible. The Hammer family will continue to support 3DNow! and the x87 instruction set used in the current Athlon family."
Chimpzilla's adoption of Screaming Sindy is not only good for AMD: it will also speed the efforts of software vendors to produce SSE2 optimised code and provide a more level playing field on which to compare the relative performance of AMD and Intel processors. ®