FYI: Processor bugs are everywhere – just ask Intel and AMD

More chip flaws await

By Thomas Claburn in San Francisco

Posted in Data Centre, 26th January 2018 20:38 GMT

In 2015, Microsoft senior engineer Dan Luu forecast a bountiful harvest of chip bugs in the years ahead.

"We’ve seen at least two serious bugs in Intel CPUs in the last quarter, and it’s almost certain there are more bugs lurking," he wrote. "There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we’ve moved past that."

Thanks to growing chip complexity, compounded by hardware virtualization, and reduced design validation efforts, Luu argued, the incidence of hardware problems could be expected to increase.

This month's Meltdown and Spectre security flaws that affect chip designs from AMD, Arm, and Intel to varying degrees support that claim. But there are many other examples.

Last March, there was a bug affecting AMD's Ryzen chips that got patched with a workaround. And in June, AMD replaced some Ryzen 7 chips that weren't tuned to perform well under load.

That same summer, problems with hyperthreading surfaced in Intel's Skylake and Kaby Lake processors.

In February last year, clock problems with Intel's Atom C2000 chips surfaced, requiring widespread replacement.

Webpage slinger Cloudflare this month recounted a problem with Intel's Broadwell chips that it encountered last year.

In February 2017, while fixing a security issue the company dubbed Cloudbleed, Cloudflare engineers spotted a number of unexplained NGINX process crashes.

These segmentation faults (SIGSEGV) killed server processes intermittently but often enough to attract attention because the company runs so many servers.

The crashes produced core dumps and sifting through them requires some effort because they can be several gigabytes in size.

After ruling out memory errors, explains Cloudflare systems engineer David Wragg a blog post, those working on the issue noticed a common factor: the crashes were all occurring on Intel Xeon E5-2650 v4 servers.

Suspicions of a hardware problem were validated when engineers noticed an entry in Intel's errata for that processor model.

"The Specification Update described 85 issues, most of which are obscure issues of interest mainly to the developers of the BIOS and operating systems," said Wragg. "But one caught our eye: 'BDF76 An Intel Hyper-Threading Technology Enabled Processor May Exhibit Internal Parity Errors or Unpredictable System Behavior.'"

Intel fixed issue BDF76 through a microcode patch that Cloudflare delivered through a BIOS update from its server vendor. After the patch was applied, the number of unexplained core dumps dropped significantly.

Expect more hardware flaws to come. ®

Sign up to our NewsletterGet IT in your inbox daily

81 Comments

More from The Register

Intel confirms it’ll release GPUs in 2020

They sell like hot cakes so why wouldn’t Chipzilla want in?

Intel's still-in-beta drone flight planning software gets update

Chipzilla is doing a little aviating of its own, we see

Intel gives Broadwells and Haswells their Meltdown medicine

Chipzilla and Oracle are working their way back through time to deliver fixes

Micron, Intel consciously uncouple 3D NAND development

Will continue to work on 3D XPoint together

Intel to Tsinghua: I know Micron didn't work out – please buy our 3D NAND

The China Syndrome

Wanted that Windows 10 update but have an Intel SSD? Computer says no

Updated 600p and Pro 6000p devices beset by 'incompatibility issues'

Intel teases Optane DIMMS, but you may need a new Xeon first

128GB, 256GB and 512GB modules offered as new storage tier below RAM, above SSD

Qualcomm to keep server CPUs but avoids head-on Intel battle

Plans to target greenfield hyperscalers, skip boring old servers

Monday: Intel touts 28-core desktop CPU. Tuesday: AMD turns Threadripper up to 32

It's crazy how a little competition can cause that

Hands up who HASN'T sued Intel over Spectre, Meltdown chip flaws

Chipzilla says class-action lawsuit tally stands at 32