DragonFly BSD developer stung by Opteron bug
New 'Bulldozer' cores not affected
Magic Quadrant for Enterprise Backup/Recovery
Matthew Dillon, the lead developer behind the DragonFly BSD fork of the open source FreeBSD Unix variant, had some issues with crashes on Opteron-based systems running his operating system for more than a year - and now Advanced Micro Devices says it's a bug in earlier generations of Opteron processors.
The errata does not affect current Opteron 4200 and 6200 processors or the impending Opteron 3200 chips, all of which have a different microarchitecture based on the "Bulldozer" cores.
Dillon described the crashes he was seeing with Dragonfly BSD in a posting on his kernel list on Christmas Day, and said he had been bug hunting for more than a year.
The problem occurs with the cc1 C compiler in the open-source GNU gcc 4.4.7 compiler set. Dillon ran tests on Opteron and Phenon II machines as well as on Sandy Bridge Xeon E3 processors from Intel, and said he didn't see the issue on Intel iron and had discounted the OS as the source of the problem causing the crashes. On Monday, in another post, Dillon said that after reviewing his issues, AMD confirmed the bug and said that doing certain kinds of loop operations, the processor can incorrectly update some Opteron stack pointers.
AMD provided El Reg with the following statement about the bug:
A program exception has been identified in previous generations of the AMD Opteron processor that occurs in certain environments that leverage a very specific GCC compiler build. A workaround has been identified for the small segment of customers this could potentially impact.It's also important to note the following:
- Only one external software developer has reported this observation.
- This could only impact previous generation products (codenames: "Barcelona," "Shanghai," "Istanbul," "Lisbon," and "Magny-Cours").
- While there are millions of these processors in the field, no other cases have been reported.
- To see this observation multiple events needed to happen concurrently and required a certain BSD-derivative environment (BSD is based on the Unix operating system) that uses a unique GCC compiler build.
- This erratum CANNOT occur on AMD Opteron 3200, 4200 and 6200 ("Valencia" and "Interlagos") Series processors since it utilizes a different microarchitecture.
And finally and even more important for readers to understand, AMD and Intel post errata updates on a regular basis, the difference here is a developer with a blog and unique GCC compiler build uncovered it.
In the wake of that statement, AMD told Dillon that it would be updating its revision guides for its 10h and 12h processors to document this erratum, which has been given issue #721. At that time, AMD will provide a model-specific workaround to the issue, it said. ®
COMMENTS
Spot on. AMD are trying to make out like Matt is a nutty weirdo doing crazy things with their processor. He's not, their processor has an errata. End of story.
"While there are millions of these processors in the field, no other cases have been reported."
'We have millions of people dying, but no one has found the causative agent yet, so it must just be bad lifestyle choices.' Medicine, before viruses were 'discovered'.
With something so esoteric how many people *could* generate a well-defined, well-described 'case'? That only _then_ are you forced to admit is true? Until then a company just falls back on 'have you tried powering it off, then on?'
"To see this observation multiple events needed to happen concurrently and required a certain BSD-derivative environment (BSD is based on the Unix operating system) that uses a unique GCC compiler build."
You fervently hope...
Or maybe...
... they meant to say that nobody needs to rush out to their supplier to have their CPUs replaced now.
They're not saying they're ignoring the report, obviously they are taking it seriously since they are publishing an errata and announcing.
They're not forced to admit it's true: they didn't know about it themselves, they are *confirming* it's true.
Your metaphor is very inaccurate: death concerns everybody. This particular small bug impacts only a very small number of users (because if all those millions of processors were crashing, we'd have heard of it).
No, you cannot blame all of Windows' BSOD on that single bug.
Sheesh, some people have a knack for turning a company publishing a fix into some kind of conspiracy.

IT infrastructure monitoring strategies
What you need to know about cloud backup
Enabling efficient data center monitoring
Agentless Backup is Not a Myth
Top 10 SIEM implementer’s checklist