AMD to fix slippery hypervisor-busting bug in its CPU microcode
Patch for Piledriver chips emitted this week to kill off potentially exploitable glitches
Analysis AMD will release on Monday new processor microcode to crush an esoteric bug that can be potentially exploited by virtual machine guests to hijack host servers.
Machines using AMD Piledriver CPUs, such as the Opteron 6300 family of server chips, and specifically CPU microcode versions 0x6000832 and 0x6000836 – the latest available – are vulnerable to the flaw.
When triggered, the bug can glitch a processor core to execute data as software, which crashes the currently running process. It is possible for a non-root user in a virtual machine to exploit this defect to upset the host system, or trick the host kernel into executing malicious code controlled by the user.
In other words, it is possible on some AMD-powered servers for a normal user in a guest virtual machine to escape to the underlying host and take over the whole shared server. Although it is rather tricky to exploit – for one thing, it requires precise timing – AMD has a fix ready for operating system makers to distribute to affected users from this week.
"AMD is aware of the potential issue of unprivileged code running in virtual machine guests on systems that make use of AMD Opteron 6200/6300," a spokesman told The Register.
"Following a thorough investigation we have determined that only AMD driver patch 6000832 and patch 6000836 is affected by this issue. AMD has developed a patch to fully resolve the issue and will be made available to our partners on Monday, 7 March, 2016."
The bug is related to the delivery of non-maskable interrupts (NMI), and is specific to the aforementioned microcode versions. On Linux, /proc/cpuinfo will list the ID number of the microcode running on your processor cores if you want to check if your machine is vulnerable. Microcode – basically, your processor's firmware – can be installed by your motherboard's BIOS or your kernel during boot-up: for example, Debian GNU/Linux distributes the latest patches in the amd64-microcode and intel-microcode packages.
For most affected people, a package update and reboot will ensure the fixed microcode is in place. The new microcode is also expected to appear on the AMD operating system team's website if you want to install it by hand.
The microcode flaw has so far reared its head on systems using QEMU-KVM for virtualization, but it may affect other hypervisors.
Due to Intel's dominance in the data center and virtualization world, this AMD-specific bug is not going to cause widespread chaos. However, it may give some people grief. For one thing, the code gremlin managed to nip the OpenSUSE Linux project, which is cosponsored by AMD.
An OpenSUSE build server that sports an Opteron 6348 processor with microcode version 0x06000836 hit a Linux kernel "oops" while running post-compilation tests on a fresh copy of GDB. The debugger's bytes barely had time to settle on the hard drive before the tests were killed by the underlying kernel.
Jiri Slaby, a SUSE Linux programmer, reported the weird crash to the Linux kernel mailing list at the end of February, and uploaded a bunch of diagnostic information for fellow developers to pore over.
The crash was bizarre and, we're told, couldn't be repeated: while running tests on the newly built GDB debugger, the processor entered kernel mode and suddenly careered off course. Like a car hitting some black ice, it slid off the road and smashed into a tree. It stopped executing the code it was supposed to be running, and instead slammed into a page of memory that had been wisely marked non-executable (NX) because it contained a critical kernel data structure rather than actual code. That collision triggered a fault, which was flagged up as a potential kernel bug, and the running process was killed.
At the time of the crash, the kernel was leaving an internal function called ttwu_stat(), which updates some of the scheduler's accounting statistics. It is harmless. Its instructions aren't that complicated: just some compares, additions, and stack popping and pushing. It's called from the scheduler function try_to_wake_up().
Then a clue was spotted. A scrap of torn red silk left at the GDB process's murder scene. Before ttwu_stat() is called, the kernel function try_to_wake_up() does a bunch of stuff that includes this instruction:
What's a stack?
Think of a stack as a pile of cafeteria trays: you push a tray, or a value, onto the stack, and you pop a tray, or value, off the stack. If you push 1, then 4, then 5, and finally 2 onto the stack in that order, you'll pop them off in the order of 2, 5, 4, and 1. If you push the contents of R15 and then, say, R14 onto the stack, when you next pop a value off, you'll get back R14's.
This moves the hexadecimal value 0x16e80 into the CPU core's R15 register. Soon after, ttwu_stat() is called, which pushes R15 and other registers onto the stack.
At the end of ttwu_stat(), the registers, including R15, are pulled off the stack. This means R15 should have the same value on leaving ttwu_stat() as it did entering the function – specifically, 0x16e80. Whatever the function did to R15, the register's original value should be restored on leaving ttwu_stat().
Let's look at the "oops" report generated by the kernel, which reveals the contents of all the registers at the time of the exception:
RIP: 0010:[<ffff88023fd40000>] [<ffff88023fd40000>] 0xffff88023fd40000 RSP: 0018:ffff8800bb2a7c50 EFLAGS: 00056686 RAX: 00000000bb37e180 RBX: 0000000000000001 RCX: 00000000ffffffff RDX: 0000000000000000 RSI: ffff88023fdd6e80 RDI: ffff88023fdd6e80 RBP: ffffffff810a535a R08: 0000000000000000 R09: 0000000000000020 R10: 0000000001b52cb0 R11: 0000000000000293 R12: 0000000000000046 R13: ffff8800bb37e180 R14: 0000000000016e80 R15: ffff8800bb2a7c80
R15 should be 0x16e80 but it's actually 0xffff8800bb2a7c80 – and R14 is 0x16e80. That's not right at all. In ttwu_stat(), R15 is pushed onto the stack, then R14. At the end of the function, R14 pulls its contents off the stack, and then R15 does the same. But in this case, R14 has popped R15's value instead of its own. Something's not right: the stack is an unexpected state.
ttwu_stat()'s final instructions are:
pop %r14 pop %r15 pop %rbp retq
That's supposed to restore the contents of the R14, R15, and RBP registers from the stack in that order, and then pull another value off the stack: the location in try_to_wake_up() that ttwu_stat() is supposed to return to. The final retq instruction pops this return address and jumps to it.
But, whoops, RBP contains 0xffffffff810a535a, which is the return address we want. The retq instruction was expecting that value, but instead it'll get whatever's next on the stack.
This confirms the stack is off by one 64-bit register, or eight bytes: the value for R15 was popped into R14, the real return address was popped into RBP, and a previously stacked value was popped by retq as a return address and jumped to. That explains why the kernel took off in a seemingly random direction – it tried using a pointer to data from the stack as a legit address to execute code.
While ttwu_stat() was running, something else tampered with the stack pointer – the special register that keeps track of where in memory values are pushed to and popped from the stack. Something invisible dropped the stack pointer an extra eight bytes. A poltergeist spilling cafeteria trays of register values all over the floor in the middle of the night.
You get the idea.