Intel tweaks SSE 4 to speed text processing
Intel's 45nm 'Nehalem' processor design will incorporate the second generation of the chip maker's SSE 4 technology. For now, the company is calling the post-'Penryn' Streaming SIMD Extensions instruction set SSE 4.2.
Nehalem's implementation of SSE 4 essentially matches that of Penryn. The key additions centre on the Application-Targeted Accelerators (ATAs) Intel introduced as part of SSE 4. Penryn got two of these, Nehalem will get seven more.
Nehalem's ATAs centre on text and string processing, Intel said yesterday. The need to accelerate text handling may sound rather unnecessary in this era of pervasive multimedia and intensive 3D graphics apps, but Intel claims the ATAs will benefit a range of important tasks, from virus signature scanning to parsing XML files.
Its pitch is that these are everyday routines, and the faster Nehalem can run them not only the quicker the tasks will be completed but the sooner it can close down on-die components to conserve energy.
Intel's 'Nehalem': better at text processing than Penryn
Not that Nehalem's design ignores more advanced data types. Chips based on the design will also speed access to data that doesn't sit comfortably in alignment with Nehalem's cache structure, such as multimeda code and data. That should allow the CPU to process such information more quickly, sending out the frame to be rendered then powering down for a longer period - or working on other tasks - before it needs to pick up and process the next frame, for example.
Again, the emphasis is not on raw processing - we know modern CPUs can do video smoothly - but on getting the job finished more quickly, the better to improve power efficiency.
That's the logic behind the re-introduction of HyperThreading with Nehalem, which will also be able to handle 33 per cent more micro-ops - the Core-specific instructions the x86 instructions are decoded into when they're loaded into the CPU - at any given time than the Penryn architecture can.
Forget Nehalem ...
Intel still haven't managed to get the previous generation into production, despite promises. Where is the Q9450, for instance? Has ANYone got one, apart from engineering samples?
Re :Re: Logic behind HT
HT works as you suggest however there is more to it than you suggest.
Each core up till now had 3 seperate instruction pipelines. On a single thread the instructions for the thread would be analysed and then split so that independent ones (that don't rely on the previous result) would be run on different execution cores. I believe this also happened with branches in some cases where both sides of the branch would be evaluated whilst it was waiting for the result of the branch test to come back.
On net burst architecture because the execution pipeline was very long - so to get branch results back took a long time. Also splitting up the instructions into independent ones was harder because it took so much longer to get results back, making the dependency chain longer. As a result of this quite a lot of the time one or more of the execution cores were sat idle. So HT could take advantage of this because the second thread is always independent from the first and so the instructions could be interleaved.
However the newer Core architectures have a much shorter execution pipeline so there was less idle space in the execution cores to be taken advantage of by hyper threading. Now though they're adding a fourth execution core which they must feel means there's enough spare slots on the cores to support another thread.
This also should mean for power management that if the CPU/OS detects that two threads run happily on the same core then additional cores can be shut down, saving on power usage.
Also as we write more and more core/thread hungry apps then HT will be just as useful for multi-core as it was for single core. It's also probably pretty handy in a server environment too where the cores are often IO bound and so having more hardware threads is a bigger win.
Re: Logic behind HT
HT allows two threads to utilize different areas of the execution core. Hypothetically, one could have a thread running an integer operation, while another runs a floating point operation simultaneously. You are right, that there is less to gain from HT in a multi-core CPU, but I suppose it's like Intel says "on getting the job finished more quickly, the better to improve power efficiency." To me HT isn't that important, and it doesn't bother me one bit that my Core 2 Duo doesn't have it, but I guess it doesn't hurt to have it either.
Logic behind HT
I'm not sure I get that one. Hyperthreading was very useful prior to the appearance of multi-core CPUs, yet it scaled pretty badly. It added performance to 1 or 2 CPU machines, however you wouldn't see any gain in performance with 4 CPUs.
I just wonder how that's going to work - and actually improve performance with Quad Core CPUs.