FYI: Processor bugs are everywhere – just ask Intel and AMD
More chip flaws await
Posted in Data Centre, 26th January 2018 20:38 GMT
In 2015, Microsoft senior engineer Dan Luu forecast a bountiful harvest of chip bugs in the years ahead.
"We’ve seen at least two serious bugs in Intel CPUs in the last quarter, and it’s almost certain there are more bugs lurking," he wrote. "There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we’ve moved past that."
Thanks to growing chip complexity, compounded by hardware virtualization, and reduced design validation efforts, Luu argued, the incidence of hardware problems could be expected to increase.
This month's Meltdown and Spectre security flaws that affect chip designs from AMD, Arm, and Intel to varying degrees support that claim. But there are many other examples.
Last March, there was a bug affecting AMD's Ryzen chips that got patched with a workaround. And in June, AMD replaced some Ryzen 7 chips that weren't tuned to perform well under load.
That same summer, problems with hyperthreading surfaced in Intel's Skylake and Kaby Lake processors.
In February last year, clock problems with Intel's Atom C2000 chips surfaced, requiring widespread replacement.
Webpage slinger Cloudflare this month recounted a problem with Intel's Broadwell chips that it encountered last year.
In February 2017, while fixing a security issue the company dubbed Cloudbleed, Cloudflare engineers spotted a number of unexplained NGINX process crashes.
These segmentation faults (SIGSEGV) killed server processes intermittently but often enough to attract attention because the company runs so many servers.
The crashes produced core dumps and sifting through them requires some effort because they can be several gigabytes in size.
After ruling out memory errors, explains Cloudflare systems engineer David Wragg a blog post, those working on the issue noticed a common factor: the crashes were all occurring on Intel Xeon E5-2650 v4 servers.
Suspicions of a hardware problem were validated when engineers noticed an entry in Intel's errata for that processor model.
"The Specification Update described 85 issues, most of which are obscure issues of interest mainly to the developers of the BIOS and operating systems," said Wragg. "But one caught our eye: 'BDF76 An Intel Hyper-Threading Technology Enabled Processor May Exhibit Internal Parity Errors or Unpredictable System Behavior.'"
Intel fixed issue BDF76 through a microcode patch that Cloudflare delivered through a BIOS update from its server vendor. After the patch was applied, the number of unexplained core dumps dropped significantly.
Expect more hardware flaws to come. ®