Disk firmware can kill a whole cluster how exactly? Cisco explains
UCS and HyperFlex owners at risk of outages thanks to faulty firmware
Cisco’s issued a Field Notice warning that its USC servers and hyperconverged HyperFlex kit could be brought low by disk drive firmware.
The Notice says “A drive firmware issue on select Self-Encrypting Drives”. The Register counts 16 different disk SKUs that could cause problems on UCS servers and one for HyperFlex.
Cisco turns to AMD Epyc for the first time in new UCS modelREAD MORE
The faulty firmware “might be triggered when the drive is subjected to a low-write, long-idle-time workload, which could result in uncorrectable drive-level errors.” On UCS servers the machines will show read errors.
On HyperFlex rigs, management software will blacklist the drive a step Cisco explains “could be a transitional state while the cluster attempts to repair the disk.”
But while the cluster tries to fix the drive, it could be “permanently blacklisted, which could trigger frequent drive replacements.”
And now for the worst-case scenario:
“While the HyperFlex HX Data Platform software protects against drive failures, there is a potential for the cluster to fail after multiple, simultaneous drive failures.”
There’s a fix – phew! – in the form of updated HyperFlex HX Data Platform and UCS Manager software, with downloads, diagnostics and more at the Field Notice.
Cisco doesn’t identify who makes the disks it ships with its UCS and HyperFlex kit, but as the company’s schtick with servers is Just Making Stuff That Works, it’s likely the company bears at least some responsibility for the situation. ®