This article is more than 1 year old

Nvidia GeForce 7800 GTX

More evolution the revolution?

G70 vs NV40

With 24 of those new ALUs inside a full G70, compared to 16 in NV40, there's three times the MADD horsepower per cycle in a full G70, compared to a full NV40. That ALU throughput is key to G70's new performance in fragment programs. Again, the units are dual-issue. So the sub-units can combine to issue two independent instructions on a four-component input vector, with either a 3:1 or 2:2 instruction split. So, for example, the hardware can issue MADD across three components of the vector and square the fourth, in a single cycle.

Finally, the G70's register file isn't bigger than the NV40's, so FP32 operations might still experience register space pressure and not run at full theoretical speed.

The texture samplers attached to the fragment units are no different in G70, compared to NV40. However, Nvidia tells me that when fetching large textures in preparation for filtering, G70's samplers have less latency pulling those textures out of memory. So while fragment programs always seek to hide texturing latency since it's always much slower than a single cycle, there's less latency to hide with G70. The samplers still perform single-cycle bilinear filtering, two-cycle trilinear and up to 16x anisotropic (128-tap) filtering, just like NV40 does.

I reckon the L2 texture cache has increased in G70 (to around 12KB) to go with the increase in fragment units, and it's been tweaked for better cache re-use with larger textures, decompressing those larger textures into L1 faster as well as possibly offering more granularity in cache access by the GPU, to reduce texture bandwidth, speeding up rendering.

The texture hardware in G70 supports ATI's 3DC normal-map compression scheme and it still supports accelerated DST rendering, as made famous by 3DMark05.

There are two more units in a full G70 (as you get in GeForce 7800 GTX), each with FP32 scaler and vector ALUs, and a texture sampler (no FP filtering and needed to comply with the VS 3.0 spec). Since they're VS 3.0-capable vertex processors, they can do dynamic branching for the vertex programs. The vertex unit can issue a single cycle vec4 MADD per clock and scalar performance from the FP32 scalar hardware is apparently up by 20 per cent.

So 33 per cent more vertex units, each with more performance, with a presumably tweaked vertex fetch unit (although Nvidia won't confirm it either way) to go with it, per clock. The triangle setup and rasteriser - hardware that feeds the fragment units - is apparently optimised via the use of a new raster pattern, but again Nvidia don't go into details.

The ROP hardware gets a tweak for G70, too. It can now do gamma-correct multi-sampling entirely in hardware, whereas I think NV40 did some pre-processing for gamma correction in the fragment hardware for GeForce 6 and FX 4400. It can also anti-alias textures with an alpha component that's used for transparency effects. Nvidia's favourite example is the chain-link fence: the fence isn't made up of geometry, rather it's a texture with some see-through parts (using the alpha component of the texture surface). G70 can sample inside a pixel for alpha, anti-aliasing internally to the pixel if it detects alpha. Doing so lessens performance slightly, per clock, as memory bandwidth is consumed.

The ROPs still have a two-loop limit for Z sampling, with up to two Z samples per cycle. So the hardware limit of 4x multi-sample anti-aliasing remains (two samples per cycle, two passes through the Z ROP). Nvidia isn't keen to say if its Z and colour compression schemes have had any tweaks, although if they have I assume they're fairly mild. The ROPs, which run at core clock in G70, as they do in NV40, can still blend floating point render targets and off-screen buffers, but they can't be multi-sampled. Alpha-to-mask MSAA is a decent trade-off compared to full MSAA of float targets, though. Any increase in IQ when anti-aliasing on Nvidia hardware is welcome, given its stiff competition in that respect.

Like NV43 and similar, G70's ROP count is less than the fragment unit count. While G70 can theoretically output 24 pixels from its fragment hardware, per cycle, there's only 16 input buckets (the ROPs) to dump those pixels into, for further processing. Nvidia's reasoning is that when you're shader bound, pixels output per cycle rarely hits that peak. A ROP count of 16 lets Nvidia save transistors. Should more than 16 pixels be ready in a cycle, they're buffered until the raster output hardware is ready to work on them. Peak pixel and texel fill-rate is therefore nearly identical to the 5MHz slower 6800 Ultra, which shares the ROP count of G70.

In plain English, the basic differences are as follows: Compared to a full NV40, a full G70 can do more MADD-based FP vector math (up to 3 times) per clock. It can fetch larger textures faster (and I bet the L2 texture cache is bigger) and filter FP textures quicker. It has 33 more vertex units, each with more performance per clock. And it can now anti-alias alpha textures, all gamma correct.

TSMC's 110nm process is used to cram in 80 million or so new transistors, mostly made up of the more and improved fragment hardware, another pair of vertex units and the tweaked ROPs. With the not-the-top-end GeForce 7800 GTX clocked faster than the 6800 Ultra in both core and memory, all that new FP vector ALU power is clocked even faster than it would have been before. And that's where the majority of G70's new performance lies.

And if you're a user who looks at high-end board performance as a means to predicting what you'll get in a low-end board, the G7x architecture supports TurboCache just as NV4x does.

Next page: The boards

More about

TIP US OFF

Send us news


Other stories you might like