2 + 2 = 4, er, 4.1, no, 4.3... Nvidia's Titan V GPUs spit out 'wrong answers' in scientific simulations

Fine for gaming, not so much for modeling, it is claimed

Nvidia’s flagship Titan V graphics cards may have hardware gremlins causing them to spit out different answers to repeated complex calculations under certain conditions, according to computer scientists.

The Titan V is the Silicon Valley giant's most powerful GPU board available to date, and is built on Nv's Volta technology. Gamers and casual users will not notice any errors or issues, however folks running intensive scientific software may encounter occasional glitches.

One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we're told.

We have repeatedly asked Nvidia for an explanation, and spokespeople have declined to comment. With Nvidia kicking off its GPU Technology Conference in San Jose, California, next week, perhaps then we'll get some answers.

All in all, it is bad news for boffins as reproducibility is essential to scientific research. When running a physics simulation, any changes from one run to another should be down to interactions within the virtual world, not rare glitches in the underlying hardware.

Collisions

Take for instance software that models molecular interactions. This sort of code uses Newtonian equations to predict the state of a system at any given time, such as calculating the position of particles after collisions. If a simulation has the same environment and starts with the same conditions, the output should be the same, again and again. But that isn’t always the case when using Nvidia’s Titan V GPUs to crunch the numbers.

An industry veteran, who alerted us to the issue, reckoned this is due to a memory issue. Chip companies normally push their high-end silicon to the limit to maximize performance. Nvidia may be overclocking or red-lining its Titan V in some way, causing read errors from memory. These mistakes are carried forward in calculations, resulting in numerical errors. Another cause could be a design blunder.

It is not down to random defects in the chipsets nor a bad batch of products, since Nvidia has encountered this type of cockup in the past, we are told. The moneybags biz released patches for some of its older GeForce and Titan models that exhibited similar problems to address these errors. There was no issue with its Titan X card based on its Pascal architecture, we're told.

Unlike previous GeForce and Titan GPUs, the Titan V is geared not so much for gamers but for handling intensive parallel computing workloads for data science, modeling, and machine learning.

And at $2,999 (£2,200) a pop, it’s not cheap to waste resources and research time on faulty hardware. Engineers speaking to The Register on condition of anonymity to avoid repercussions from Nvidia said the best solution to these problems is to avoid using Titan V altogether until a software patch has been released to address the mathematical oddities.

We understand Nvidia has been made aware of the Titan V reproducibility issue. ®

Updated to add

A spokesperson for Nvidia has been in touch to say people should drop the chip designer a note if they have any problems. The biz acknowledged it is aware of at least one scientific application – a molecular dynamics package called Amber – that reportedly is affected by the Titan V weirdness.

"All of our GPUs add correctly," the rep told us. "Our Tesla line, which has ECC [error-correcting code memory], is designed for these types of large scale, high performance simulations. Anyone who does experience issues should contact support@nvidia.com."




Biting the hand that feeds IT © 1998–2018