Disk firmware can kill a whole cluster how exactly? Cisco explains

UCS and HyperFlex owners at risk of outages thanks to faulty firmware

Cisco’s issued a Field Notice warning that its USC servers and hyperconverged HyperFlex kit could be brought low by disk drive firmware.

The Notice says “A drive firmware issue on select Self-Encrypting Drives”. The Register counts 16 different disk SKUs that could cause problems on UCS servers and one for HyperFlex.

Cisco UCS 4200 chassis and C125 M5 Rack Server Node

Cisco turns to AMD Epyc for the first time in new UCS model

READ MORE

The faulty firmware “might be triggered when the drive is subjected to a low-write, long-idle-time workload, which could result in uncorrectable drive-level errors.” On UCS servers the machines will show read errors.

On HyperFlex rigs, management software will blacklist the drive a step Cisco explains “could be a transitional state while the cluster attempts to repair the disk.”

But while the cluster tries to fix the drive, it could be “permanently blacklisted, which could trigger frequent drive replacements.”

And now for the worst-case scenario:

“While the HyperFlex HX Data Platform software protects against drive failures, there is a potential for the cluster to fail after multiple, simultaneous drive failures.”

Yikes!

There’s a fix – phew! – in the form of updated HyperFlex HX Data Platform and UCS Manager software, with downloads, diagnostics and more at the Field Notice.

Cisco doesn’t identify who makes the disks it ships with its UCS and HyperFlex kit, but as the company’s schtick with servers is Just Making Stuff That Works, it’s likely the company bears at least some responsibility for the situation. ®




Biting the hand that feeds IT © 1998–2018