The Register®

Original URL: http://www.theregister.co.uk/2005/06/23/review_nvidia_geforce_7800_gtx/

Nvidia GeForce 7800 GTX

More evolution the revolution?

By Hexus.net

Posted in Reviews, 23rd June 2005 11:30 GMT

Review A year goes by pretty quickly for me these days. I can vividly remember the last few days before Nvidia's NV40 launch 14 months ago, and the last few days have brought some serious déja vu. I've run much the same tests and much the same analysis. Pretty much the same pixels are being painted. The last part of that statement is the telling one - today's hardware is more evolution than revolution compared to what Nvidia delivered last time.

Nvidia GeForce 7800 GTXThe G70 builds upon NV40's Shader Model 3.0 plus SLi foundation with some (admittedly significant) nips and tucks, a wider architecture (without actually going that wide in silicon) and more speed for the fps freaks to dine on.

The SKU Nvidia has chosen to debut G70 is the GeForce 7800 GTX. The GTX isn't the flagship hardware - Nvidia is saving that to force ATI's hand with R520 and Radeon X900 (if indeed it's called that when it shows up). Why release the big daddy product if you don't need to, giving you some time to play with clocks and (hopefully) stay on top of the performance league table? Those few percentage points when you're massively CPU-limited really matter. The trouble is, ATI might well do the same.

Regardless, there's a shiny new GPU to talk about. Nvidia's G70 represents its take on 3D hardware for at least the next nine months or so. It'll power, either directly or indirectly, a range of top-to-bottom products with the GeForce 7-series moniker and its core performance and feature set will define those products until the next, massively-faster-than-the-rest-of-your-system GPU comes along. There's no new base features compared to NV40 with G70 still a Shader Model 3.0 part, so where exactly are the differences? Let's take a look

The fragment processor - usually called a pixel processor - handles fragments output by the GPU's rasteriser, which in turn creates rasterised fragments from the geometry spat out by the vertex hardware. So vertex hardware is first in the render chain, but since G70's main differences compared to NV40 are in the fragment units, I'll cover those first.

NV40 and G70's fragment units are made up of a pair of sub-units. Sub-unit one in NV40 can texture (use 'texture' data as input to a fragment program being run by the fragment units, but it doesn't have to be a coloured image texture), and issue a MUL vector instruction or use its mini-ALU to issue a non-vector instruction like RSQ (reciprocal square root). Sub-unit two can issue a MADD vector instruction (single-cycle MUL and ADD combined) or use its own mini-ALU with the same capability as the mini-ALU attached to sub-unit one.

G70 differs in sub-unit one, which can now issue a MADD as well. Everything else is the same in terms of ALU ability (all mini-ALU instructions are still single-cycle). So G70 widens internally with the power to run two MADDs on a pair of vec4 vectors, in SIMD. That's twice the SIMD MADD power as NV40, per cycle. Nvidia's reasoning - which flies in the face of the reasoning they gave for not allowing sub-unit one to issue a MADD for NV40 - is that the majority of complex fragment shader programs being run today in released and upcoming games will make heavy use of the MADD instruction, which can be used for calculating vector dot products (indeed, the single-cycle vec4 MADD is the equivalent to a single-cycle DP4 instruction).

Calculation of vector dot product is an integral part of many fragment shader effects that it's desirable to run on a 3D GPU. NV35 could issue two MADDs per cycle, per fragment ALU and G70 regains that processing ability.


G70 vs NV40

With 24 of those new ALUs inside a full G70, compared to 16 in NV40, there's three times the MADD horsepower per cycle in a full G70, compared to a full NV40. That ALU throughput is key to G70's new performance in fragment programs. Again, the units are dual-issue. So the sub-units can combine to issue two independent instructions on a four-component input vector, with either a 3:1 or 2:2 instruction split. So, for example, the hardware can issue MADD across three components of the vector and square the fourth, in a single cycle.

Finally, the G70's register file isn't bigger than the NV40's, so FP32 operations might still experience register space pressure and not run at full theoretical speed.

The texture samplers attached to the fragment units are no different in G70, compared to NV40. However, Nvidia tells me that when fetching large textures in preparation for filtering, G70's samplers have less latency pulling those textures out of memory. So while fragment programs always seek to hide texturing latency since it's always much slower than a single cycle, there's less latency to hide with G70. The samplers still perform single-cycle bilinear filtering, two-cycle trilinear and up to 16x anisotropic (128-tap) filtering, just like NV40 does.

I reckon the L2 texture cache has increased in G70 (to around 12KB) to go with the increase in fragment units, and it's been tweaked for better cache re-use with larger textures, decompressing those larger textures into L1 faster as well as possibly offering more granularity in cache access by the GPU, to reduce texture bandwidth, speeding up rendering.

The texture hardware in G70 supports ATI's 3DC normal-map compression scheme and it still supports accelerated DST rendering, as made famous by 3DMark05.

There are two more units in a full G70 (as you get in GeForce 7800 GTX), each with FP32 scaler and vector ALUs, and a texture sampler (no FP filtering and needed to comply with the VS 3.0 spec). Since they're VS 3.0-capable vertex processors, they can do dynamic branching for the vertex programs. The vertex unit can issue a single cycle vec4 MADD per clock and scalar performance from the FP32 scalar hardware is apparently up by 20 per cent.

So 33 per cent more vertex units, each with more performance, with a presumably tweaked vertex fetch unit (although Nvidia won't confirm it either way) to go with it, per clock. The triangle setup and rasteriser - hardware that feeds the fragment units - is apparently optimised via the use of a new raster pattern, but again Nvidia don't go into details.

The ROP hardware gets a tweak for G70, too. It can now do gamma-correct multi-sampling entirely in hardware, whereas I think NV40 did some pre-processing for gamma correction in the fragment hardware for GeForce 6 and FX 4400. It can also anti-alias textures with an alpha component that's used for transparency effects. Nvidia's favourite example is the chain-link fence: the fence isn't made up of geometry, rather it's a texture with some see-through parts (using the alpha component of the texture surface). G70 can sample inside a pixel for alpha, anti-aliasing internally to the pixel if it detects alpha. Doing so lessens performance slightly, per clock, as memory bandwidth is consumed.

The ROPs still have a two-loop limit for Z sampling, with up to two Z samples per cycle. So the hardware limit of 4x multi-sample anti-aliasing remains (two samples per cycle, two passes through the Z ROP). Nvidia isn't keen to say if its Z and colour compression schemes have had any tweaks, although if they have I assume they're fairly mild. The ROPs, which run at core clock in G70, as they do in NV40, can still blend floating point render targets and off-screen buffers, but they can't be multi-sampled. Alpha-to-mask MSAA is a decent trade-off compared to full MSAA of float targets, though. Any increase in IQ when anti-aliasing on Nvidia hardware is welcome, given its stiff competition in that respect.

Like NV43 and similar, G70's ROP count is less than the fragment unit count. While G70 can theoretically output 24 pixels from its fragment hardware, per cycle, there's only 16 input buckets (the ROPs) to dump those pixels into, for further processing. Nvidia's reasoning is that when you're shader bound, pixels output per cycle rarely hits that peak. A ROP count of 16 lets Nvidia save transistors. Should more than 16 pixels be ready in a cycle, they're buffered until the raster output hardware is ready to work on them. Peak pixel and texel fill-rate is therefore nearly identical to the 5MHz slower 6800 Ultra, which shares the ROP count of G70.

In plain English, the basic differences are as follows: Compared to a full NV40, a full G70 can do more MADD-based FP vector math (up to 3 times) per clock. It can fetch larger textures faster (and I bet the L2 texture cache is bigger) and filter FP textures quicker. It has 33 more vertex units, each with more performance per clock. And it can now anti-alias alpha textures, all gamma correct.

TSMC's 110nm process is used to cram in 80 million or so new transistors, mostly made up of the more and improved fragment hardware, another pair of vertex units and the tweaked ROPs. With the not-the-top-end GeForce 7800 GTX clocked faster than the 6800 Ultra in both core and memory, all that new FP vector ALU power is clocked even faster than it would have been before. And that's where the majority of G70's new performance lies.

And if you're a user who looks at high-end board performance as a means to predicting what you'll get in a low-end board, the G7x architecture supports TurboCache just as NV4x does.


The boards

With SLi all the rage these days, reference boards from Nvidia these days come as identical twins. The 7800 GTX has a few obvious talking points, so if you want to pull up a chair and have a good look at one on its own, I'll give you the grand tour.

Nvidia GeForce 7800 GTX

Longer than a 256MB 6800 Ultra or 6800 GT, the PCB for the 7800 GTX isn't small. I'd say it's just shorter than the 512MB 6800 Ultra that various vendors have been peddling in recent times. The single-slot aluminium reference cooler causes the board to weigh in at around the same 320g as a reference 6800 GT, and a fair bit less than a reference 6800 Ultra.

Single-slot is one of the key indicators to a faster flagship product, probably labelled Ultra, which will likely sport a dual-slot cooler. The cooler hides the dual-link TMDS that drives the first of two DVI ports. The dual-link TMDS can drive larger digital displays than the single-link (1600 x 1200) DVI links on almost all other Nvidia boards made to date. Indeed, the panel that the linked Ultra DDL was designed to drive is one such beast that the 7800 GTX can make light work of.

Nvidia uses digital VRMs for power management this time - there are barely three capacitors on the entire product. A black aluminium heatsink cools a FET bank whos high switching speeds equals heat to manage. Being PCI Express, there's approx. 75W available from the slot (see the first few pins on the connector? they're for power) with a six-pin power connector giving the board the rest.

The rear of the board shows you how the GPU is positioned centrally with its heatsink retained by a fairly sizeable bracket. Another bracket, L-shaped, covers the rear memory modules and double up as a retention mechanism for the rest of the cooler. Exposed, you can see four of the eight GC16, 600MHz (1200MHz DDR) Samsung GDDR3 DRAMs that have been ubiquitous on high-end graphics hardware for the last 14 months. They're the same chips that powered 6800 Ultra.

With four DRAMs on the front of the PCB and four on the back, and with obvious room for eight more, 512MB boards - or 1GB should Nvidia prise double-density DRAMs from Samsung's mitts - should show up in due course.

It seems to have a 1.25V Vcore voltage for the 430MHz core clock, and going by the lack of noise the cooler makes under all but the heaviest of loads, it doesn't get too hot. That's not to say that at full chat they aren't loud - they really are, and with two on the go for SLi you definitely hear the noise they make - it's just that the temperature-controlled fans didn't seem to hit full chat all that often. As always, your chassis cooling helps immensely in that respect.

Power-wise, Nvidia will tell you that a single board draws 100W under load and two boards in tandem need 220W. That's pretty close to the mark, with my calculations showing an extra 80W or so of output power consumed when you add in a second board.

Nvidia claims that a 500W power supply is needed for SLi. That's an overstatement, designed to protect it from craptastic supplies that can't supply the output power an SLi system needs. If there's any more than a 300W output power from a high-end SLi rig with GTXs (without loading it up with disks), I'll be very surprised. 100W is less than 9A from a 12V source. Any decent ATX or ATX 2.0 can pull 18A with room to spare for the CPU, and with ATX 2.0 you've got that from an independent voltage rail. Just make sure your power supply is from a decent, trusted vendor, and you'll be fine. I used a Tagan TG480-U01 during all my testing, including SLi and with the most powerful x86 processor on the planet, without any issues.


Test Hardware

Before I discuss the hardware used, a little on the latest ForceWare driver. The 64-bit version of the driver is now on a par with the 32-bit build, performance of TurboCache boards is up and SLi's profile system has had some time spent on it. More applications and games are now supported by the driver, for both GeForce and Quadro parts, and the user now has more control over what SLi rendering mode is used (when a choice is actually available). For video users there's better support for HDTVs and HDTV output resolutions (1080i in particular, apparently) and Windows Media Center gets new driver-supported extensions for controlling your hardware.

DFI's simply superb LanParty UT nF4 SLI-D was a willing host for everything, Corsair's flashy stuff assmastered the memory bus and kept everything fed with data and I kept the CPU cool with an Akasa Evo33.

Comparing the 7800 GTX to the 6800 Ultra means that, yes, I missed out a comparison to ATI's latest and greatest. X850 XT PE is a wee bit faster than a single 6800 Ultra overall, so use your imagination to guess where performance would be, relative to the boards on test.

So, new GPU powered by fast CPU. There wasn't a single lockup, performance glitch or otherwise serious issue during testing.

Transparency

Nvidia's transparency anti-aliasing uses sub-pixel samples to anti-alias alpha-textures. Nvidia supports it in G70 using both multi-sampling (depth samples) and super-sampling (texture samples) using alpha information attached to the texture. Here's how it looks. Firstly, without transparency AA on an alpha-textured chain-link fence (everyone will use chain-link fences, so be warned!). All images are clickable for lossless PNG versions.

Nvidia GeForce 7800 GTX - Transparency (http://www.theregister.co.uk/2005/06/23/g70_1.png)

Notice the significant texture aliasing visible on the links. Turn on super-sampling transparency AA and most of the aliasing artefacts disappear.

The difference output of the two images shows you what parts of the image the super-sampling is working on. You'll need to click the image to see it properly, resizing the image loses the difference detail.

To measure the performance hit, I recorded a short demo in Half-Life 2 inside the prison section of Nova Prospekt, where there's more chain-link fencing (and hence alpha textures) than you know what to do with. Benchmarking the demo showed the performance hit.

The performance difference is about 12 per cent over the demo, with 10-15 per cent Nvidia's claimed average performance hit.

Video Processing

Following ATI's announcement at Computex that unannounced hardware was going to accelerate the decode of H.264 video, a format more commonly known as MPEG 4 AVC, Nvidia has been keen to say it will have support for H.264 sometime in 2005, on all their hardware that has working PureVideo silicon. Playing back 1080p content using the FX and 7800 GTX test platform shows around 45-55 per cent CPU usage. The GPU's doing something, but not something a 6600 GT can't do, for example.

It appears that Nvidia hasn't spent much, if any, of its transistor budget for G70 on silicon used just to process video. H.264 support seems to be something that'll be accelerated by fragment programs on G70 and other Nvidia hardware, rather than by dedicated decode hardware like ATI appear to possess. With 3D speed ever increasing, massively powerful CPUs like the 2800MHz FX becoming a limitation to new single boards, never mind SLi, image quality and video processing increases are what's needed next.

Video quality appears unchanged compared to NV43 and the other Nvidia GPUs with a fixed video processor, which is slightly disappointing.

With H.264, especially since it's the native format for Sony's PSP hand-held gaming console and a video format for both HD-DVD and Blu-ray Disc, about to become the most dominant video format in common use, spending some time with the GPU in terms of decode outside of the fragment hardware seems like a prudent thing to do. We'll see.


Benchmarks

ShaderMark 2.1 tests shader-bound performance using nearly two dozen small shaders. The shaders perform a number of common shader tasks that you might find in games. The geometry the shaders shade is very simple, to text pure shader throughput.

Nvidia GeForce 7800 GTX - ShaderMark

The results show that, on average, the 7800 GTX is around 1.6 times faster than the 6800 Ultra. The 7800 GTX was never slower, and on a couple of occasions, like those shaders utilising FP filtering, it's over twice as fast. Nvidia's focus on pixel shader throughput for G70 is clearly shown to be working in this synthetic test.

Nvidia GeForce 7800 GTX - Half-life 2

Outside of the theoretical tests, we start to see CPU limitations creep in. A single 7800 GTX is faster in our Half-Life 2 benchmark test than SLi 6800 Ultras, with 7800 GTX in SLi pegged against a CPU limitation. That's a CPU limitation with a 2800MHz Athlon FX, incase you needed reminding. At 1600 x 1200 with 4AA and 16AF, 6800 Ultra SLI, 7800 GTX and 7800 GTX SLi have monsterous performance. Everything is faster than 60fps on average, at all settings.

Nvidia GeForce 7800 GTX - Riddick

Riddick's SLI support thankfully seems correct in the 77.62 driver, for both 6800 Ultra SLI and 7800 GTX SLI. 6800 Ultras in SLI best a single 7800 GTX by a fair margin with 7800 GTX SLI standing far and away above the rest. With the lowest framerate for the 1600x1200 4AA 16AF test for 7800 GTX SLI being just under 60fps, Riddick works well with both graphics card and CPU.

A single 7800 GTX is nearly 1.6 times (60 per cent) faster than a single 6800 Ultra at the highest settings with the average performance increase being just over 1.4 times (40 per cent faster). With Riddick's engine being similar to Doom 3's (hardware stencil shadows, Z-only pre-pass, shadow volume extrusion done on the CPU), you can imagine Doom 3 performance, too.


Far Cry is tested in three settings groups, across a set of standard resolutions. The first settings group has no IQ enhancements applied.

Nvidia GeForce 7800 GTX - Far Cry no IQ

With no IQ enhancements applied, there's no benefit to running the 7800 GTX either on its own against a single 6800 Ultra, or as a pair against SLI 6800 Ultras. You need to make the cards work harder to see any gains with G70.

Nvidia GeForce 7800 GTX - Far Cry IQ

SLi beats a single board, obviously, but otherwise there's no advantage to using the 7800 GTX. Far Cry has a HDR rendering mode tailored to NV40 and G70, though.

Nvidia GeForce 7800 GTX - Far Cry HDR

The r_HDRrendering rendering variable is set to 7 for this test. Where 6800 Ultra shows no gains with SLI, a single 7800 GTX is not only faster than 6800 Ultra SLI, but SLI then shows large gains to be had with HDR rendering enabled. On average, SLI 7800 GTX is 30 per cent faster than a single board and it's the first time that Far Cry's HDR rendering has been accelerated to any kind of degree by either a single board or two in SLI.

Other settings for r_HDRrendering show even larger gains. G70's ROP hardware is responsible for the OpenEXR FP buffer compositing that Far Cry uses to apply HDR lighting to a frame. Combined with the new texture filtering performance for FP textures and you can see where G70 is getting its gains

Verdict

Nvidia's focus on fragment ALU improvements, from allowing sub-unit one to do a vec4 MADD in a single cycle and from increasing the number of full fragment ALUs from 16 to 24, are where the vast majority of the speed improvements in G70, compared to NV40, are to be found. Tweaks to the fragment rasteriser, the ROP hardware and the addition of two more vertex units are where you'll find the rest. It's a modest improvement which TSMC's 110nm process afforded Nvidia without too much work, the company looking like it's saving large-scale feature and performance enhancements for a product to come in nine months to a one year's time.

The improvements in performance where FP filtering and blending were concerned (see the last few shader tests in ShaderMark 2.1, along with the HDR tests in Far Cry) are impressive. Nvidia's changes in the texture filter hardware and ROPs combine to explain the new found FP performance, making the G70 attractive if that's the kind of rendering you're looking forward to in upcoming games. If only the hardware could multi-sample float buffers for new speed and image quality.

It's also nice to see an improvement to anti-aliasing quality with G70. Alpha-to-mask AA, for multi or supersampling alpha textures that are used to paint pixels, gives a boost in image quality that's been missing for some years, and gamma correct AA, where the final colour of the pixel is modulated to match your monitor's usual gamma, is welcomed on consumer Nvidia hardware. That the performance hit is acceptable is the icing on that cake.

SLi performance seems to be higher at first glance, although another article is likely needed to investigate further. Finally, the question of CPU limitation rears its head yet again. With hardware so powerful, you waste it to some degree without a CPU and memory subsystem to keep it fed. Finding cash for an FX-57-alike isn't easy. In that respect, it's nice to see this generation of new 3D hardware releases slow down somewhat. ATI is still a little while away from releasing R520 and Nvidia hasn't rolled out the 7-series range-topper yet.

Only lacklustre video support (especially with my recent acquisition of a Sony PSP and a big desire to playback MPEG-4 AVC video on my PC) spoils it for me somewhat. Single-slot cooling and dual-link DVI support (even from just one of the outputs) makes me a happy hardware reviewer.

All-in, a solid improvement to an excellent architecture. Nvidia balances ALU increases with ROP considerations to create a formidable pixel pusher that currently has no equal. 7800 GTX boards can be bought from today for a UK retail price of £369. That's only some £70 more than a 6800 Ultra - ie. spend 15 per cent moe for a 40 per cent increase in performance.

Nvidia remains in the Shader Model 3.0 driving seat with an all-powerful new chip that looks forward to WGF 1.0, 90nm and even better pixel output silicon. The fastest graphics card money can buy, unsurprisingly. And I would be even less surprised if the entire initial UK allocation of boards is sold by the end of the day, such is its power and ability.

Review by
Hexus.net (http://www.hexus.net/)