Meltdown, Spectre bug patch slowdown gets real – and what you can do about it
Chip flaw fixes not so insignificant after all
Posted in Security, 9th January 2018 00:45 GMT
Analysis Having shot itself in the foot by prioritizing processor speed over security, the chip industry's fix involves doing the same to customers.
The patches being put in place to address the Meltdown and Spectre bugs that affect most modern CPUs were supposed be airy little things of no consequence. Instead, for some unlucky people, they're anchors.
Having helped find the flaws, Google insisted the software fixes that have begun to appear "introduce minimal performance impact," and insisted the performance hit will diminish over time.
Intel said as much in its statement, claiming "any performance impacts are workload-dependent, and, for the average computer user, should not be significant and will be mitigated over time."
That may be true eventually, thanks in part to a processor feature called Processor-Context ID, or PCID. But more on that later.
Woo-yay, Meltdown CPU fixes are here. Now, Spectre flaws will haunt tech industry for yearsREAD MORE
At the moment, the speed consequences of patching these bugs is significant enough to elicit attention and complaints. To be clear: we here at El Reg highly recommend you install the CPU security bug patches as soon as possible. We just want folks – particularly cloud subscribers and IT admins – to be aware of the effects.
While most casual desktop users and gamers won't notice any prolonged slowdown, or any performance hit at all, people running IO or system-call intensive software, such as databases on backend servers, may notice the difference.
Red Hat has clocked the patch performance impact as ranging from one to 20 per cent.
Epic Games on Friday explained the cause of recent login and stability issues experienced by its players, noting: "All of our cloud services are affected by updates required to mitigate the Meltdown vulnerability."
The company, which relies on AWS servers, posted a screenshot of a graph depicting a spike in CPU utilization after a host was patched. The Register asked Epic to elaborate on its findings, but a spokesperson said the developer had nothing further to add at the moment.
Discussions on the mailing list for Lustre, a parallel distributed filesystem, described slowdowns ranging from 10 per cent to as high as 45 per cent for certain IO intensive applications.
"We found terrible performance on the test system with zfs+compression+lustre," wrote Arman Khalatyan of the Leibniz Institute for Astrophysics Potsdam in a memo on Monday.
On Reddit, a Monero coin miner reported a slowdown of about 45 per cent after applying the Meltdown patch. On that thread, another person cited a hash rate decrease of 10 to 15 per cent.
Quora, which relies on AWS, on Saturday said it is "facing a slowdown due to the patch applied by AWS for Intel's Meltdown and Spectre issues."
Via Twitter, Francis Wolinski, a data scientist with Paris-based Blueprint Strategy, noted that Python slowed significantly (about 37 per cent) after applying the Meltdown patch for Windows 7.
Also via Twitter, Ian Chan, director of engineering for analytics firm Branch Metrics, described CPU utilization increases of five to 20 per cent after the Meltdown patch was applied to the AWS EC2 hypervisor handling its Kafka instances.
Amazon customers have sent The Register several screenshots of CPU utilization showing spikes similar to those that have been publicly discussed. Before the weekend, Amazon confirmed the updates will ding AWS virtual-machine performance to some degree, albeit with no "meaningful performance impact for most customer workloads" expected, apparently.
Soar ... An example AWS CPU utilization spike after installing CPU flaw security patches (Click to enlarge)
These figures are in keeping with the estimates first reported by The Register, a performance hit of roughly five to 30 per cent, with the caveat that any such results are highly variable and depend on a number of factors such as the workload in question and the technology involved.
El Reg's sister site The Next Platform estimated that the amount of computing value lost to the slowdown amounts to $6 billion annually.
These delays are largely the consequence of Meltdown patches, which on Linux enforce separation between the kernel and user virtual memory address spaces through Kernel Page Table Isolation, or KPTI.
Beyond Linux, Microsoft has patched Windows Server 2008 R2, 2012 R2, and 2016, among other flavors of its operating system. Apple has also mitigated Meltdown and Spectre in iOS and macOS.
Spectre mitigations – which involve recompiling software with countermeasures such as Google's retpoline as well as microcode updates depending on the processor model – have just begun to appear. Though considered only partial fixes for a problem that will take some time to sort out, they're nonetheless expected to affect performance, too (beyond knackering some AMD PCs if you're using Windows).
If there's a bright side to all this, it's that the PCID feature in Intel's x86-64 chips since 2010 can reduce the performance hit from patching Meltdown. (If you have a 32-bit system, you're on your own.)
Remediating Meltdown – which is present in modern Intel processors – involves enforcing complete separation between user processes' virtual memory spaces and the kernel's virtual memory areas. Rather than map the kernel into the top portion of every process's virtual memory space where it remains invisible unless required to handle an interrupt or system call, the kernel is moved to a separate virtual address space and context. This fix prevents malware from exploiting the Meltdown CPU bug to read kernel memory from user mode, and is referred to as Kernel Page Table Isolation.
Switching back and forth between these contexts – from the user process context to the kernel context and back to the user process – involves reloading page tables, one set describing the user process and another describing the kernel. These tables map the process or kernel's virtual memory to physical blocks of RAM or swap space.
These context switches from user process to kernel to process not only takes time, it also flushes any cached virtual-to-physical memory translations, all in all causing a performance hit, particularly on workloads that involve a lot of IO or system calls. But with PCID, there's no need to flush the entire translation lookaside buffer (TLB) cache on every context switch as selected TLB entries can be retained in the processor.
PCID first saw Linux support in the 4.14 kernel released in November 2017, and thus it's not necessarily available by default with every Linux instance, particularly on virtual machines.
In a Google Groups post on Sunday, Gil Tene, CTO and cofounder of enterprise Java biz Azul Systems, said PCID has become critical both for security and performance on Intel's x86 platform. But he observed that it isn't present on many of the virtualized Linux instances he's looked at.
Most KVM guests – kernel-based virtual machines – don't include PCID, according to Tene, while most VMware guests do. And about half of the AWS instances he looked at don't have it.
"You REALLY want PCID in your processor," wrote Tene. "Without it, you may be running insecurely (Meltdown fixes turned off by default), or you may run so slow you'll be wishing for a security intrusion to put you out of your misery."
In other words, if you're seeing crap performance after applying these fixes, look at your kernel configuration and get PCID enabled – if the hardware feature is present in your chipset. Windows should, for what it's worth, use PCID if it's provided by the processor. ®