Spectre and Meltdown fixes: How will they affect storage?
More worries for storage admins scouring environments for holes
Analysis How will the bugfixes for Intel's Spectre and Meltdown vulnerabilities affect storage?
IBM – which plans to release firmware updates tomorrow that address Spectre-like bugs in its designs for its POWER CPUs – has said that its storage won't be affected at all. We received an IBM statement which confirmed: "IBM Storage appliances are not impacted by this vulnerability."
HPE doesn't say much. In a statement, it explained: "In terms of impact on performance after patching systems: in most cases, we anticipate very minimal impact to system performance. However, any slower performance resulting from the OS and microprocessor patch updates will vary based on the OS and workloads."
That's all fair enough, but we don't know if storage performance is an affected workload.
Meanwhile, Jeff Ready, founder and CEO at hyper-converged infrastructure supplier Scale Computing, suggests storage performance effects could be severe in some cases, and spoke to us about the issue in some detail.
He said: "Obviously my focus (and the focus of the whole infrastructure industry, I suppose) has been working through all the information in real time, identifying what needs to be patched and how, and getting those patches out the door. But performance impacts are going to be seen across the industry ... As it relates to our systems at Scale: Scale HC3 systems will likely take a small performance hit, but we will be far less impacted than most."
El Reg: "How will storage system benchmarks be affected?"
Jeff Ready: "I would expect SPC benchmarks (which drive pure storage performance) to be affected the most on systems which utilise software-defined storage via a mid-layer filesystem. CPU utilisation around operations on that filesystem will increase measurably. VSA-based architectures, which most companies in HCI use, fall into this category.
"TPC benchmarks should be affected to a lesser extent, due to the inclusion of other database-related CPU load. In either case (SPC/TPC), the performance hit (due to increased CPU load and IO latency) will depend entirely on the underlying storage implementation."
El Reg: "Can you explain why?"
Jeff Ready: "As has been shown in the published materials around the exploits already, performance effects will vary based on the number of syscalls a workload is making. Fortunately for Scale Computing, SCRIBE (our SDS layer) accesses storage directly via raw block devices, which minimizes the number of syscalls required to perform any particular IO operation."
El Reg: "Why would your software be less affected?"
Jeff Ready: "Since no filesystem is involved with SCRIBE, files do not need to be opened or closed (each a syscall) in the IO path. Other software-defined storage solutions which use a mid-layer filesystem will likely have a much larger performance impact as a result of these fixes.
"As the fixes roll out, we will start to see how everyone is hit, but you can speculate on the overall impact when you have a virtual storage controller, running on a filesystem, running as a virtual machine, running on a file system. Traversing that stack results in many syscalls per IO, and that will likely have a significant negative performance impact."
El Reg: "Will NVMe be more or less impacted?"
Jeff Ready: "Lower latency storage technologies (like NVMe) tend to expose the latency in the software stack even more, and the Meltdown/Spectre mitigations will make that even worse in a syscall-heavy datapath. Our Hypercore-Direct version of SCRIBE was designed to not make *any* syscalls in the IO path, and as a result should see a negligible performance impact as a result of these mitigations."
He added: "Hypercore Direct ... is our stack optimized for NVMe ... every system call, context switch, filesystem, etc, added to the latency, and ... one of the keys to our extremely low latency was that we had none of these in the storage system."
El Reg: "So syscall-heavy offerings will suffer more?"
Jeff Ready: "So in our standard HC3 offering, we make very few system calls. In our Hypercore Direct offering, we make none. The platforms which require/make use of many syscalls will see a significant performance hit because of those syscalls, but for our Hypercore Direct offering, we should see negligible if any impact.
"I can't say that we designed the system that way because we expected something as crazy as this to happen, but I can say that we did design it this way specially because syscalls were already too slow. Now, those syscalls are much, much slower, so our relative performance gets that much better."
El Reg: "What will be the effect of implementing the fixes?"
Jeff Ready: "We are in the midst of testing the mitigations in our lab right now, and will have some hard data shortly.
"As far as fixing any performance problems that arise: In the short term, any IT deployment with enough CPU headroom should be fine. The vast majority should fall into this category.
"I think IT administrators who are already 'pushing the envelope' and running on overloaded systems may very well end up with human-observable slowdowns as a result. For the minority already running in a redline situation before a patch we'll be working with them to mitigate the hit. Fortunately with a system like ours, that's likely solved by adding a node, which they already needed to do."
El Reg: "Should customers run systems with performance headroom?"
Jeff Ready: "As a matter of good practice, we always encourage our customers to run their environments such that if they were to sustain a full node failure, that the resources of the remaining nodes in the cluster can handle the load. If systems are set up this way, it's unlikely that they'll notice any impact other than looking under the hood at gauges for overall CPU utilisation, which will show some increase. For these customers, the increase could easily be absorbed into future planning, and admins who may have easily scalable hyper-converged solutions (like ours) would have the extra option to add a single node to a cluster to help with the added CPU load, and adding such a node is trivial."
El Reg: "What can customers with highly loaded systems do?"
Jeff Ready: "For administrators running highly loaded systems that aren't scale-out, or where the scale-out is difficult (for example, disruptive to workloads or scaleable only in large, expensive chunks) and where many syscalls exist in their stack, there may not be an easy solution.
"Minimal syscalls and easy scalability in both price and node size help us, relatively speaking, quite a lot here. But the whole episode still sucks for admins, who now need to look at their environments from top to bottom for security issues. It's not often (thankfully) that something like this universally and simultaneously hits laptops, cell phones, virtual machines, server kernels, and just about everything else.
"We'll be working with all our customers to help them understand and work through the issues. I am certain we will work with customers to help them even as their questions and needs extend outside the limitations of the Scale HC3 stack itself... I have no doubt there is still a lot to learn about the overall impact of these exploits for some time to come." ®