How a power blip briefly broke GitHub's boxes and tripped it offline

git-blame -l -t

Thu 4 Feb 2016 // 08:15 UTC

Exactly how a momentary power failure managed to trigger a two-hour GitHub outage has been revealed in full today.

The popular and widely used source-code-hosting service fell off the internet last Wednesday, and soon after blamed the downtime on a "brief power disruption at our primary data center" that "caused a cascading failure." For those who keep the lights on in server warehouses and want to know what went wrong, here's a summary of how it went down, literally:

At 0023 UTC on January 28, the power supply equipment in GitHub's main data center suffered a brief disruption that caused 25 per cent of the website's machines and networking gear to reboot. This triggered a bunch of alerts to on-call engineers.
Load-balancing devices and front-end application servers largely managed to stay up, but couldn't reach the backend systems that remained unavailable after the reboot. Users were served the unicorn-of-fail page by public-facing web servers that couldn't reach essential services in the backend.
The internal chat system used by GitHub staff was also knackered by the power blip, hampering attempts to organize a recovery for a short while. For that reason, engineers were late to raise the alarm on status.github.com.
The team gradually worked out that some systems had rebooted: some servers' uptimes were in the minutes. Meanwhile, some backend database machines had disappeared, and app servers relying on them were failing to start.
All of the offline Redis database machines used a particular hardware spec, and were spread out along rows of racks of servers. Connecting to their serial consoles revealed they had died during boot up as their physical drives were no longer recognized by the firmware. Gulp. Technicians had to manually disconnect the boxes from their power supply, plug them back in again, and turn them on in an attempt to restore them.
Meanwhile, another team was trying to rebuild the missing Redis clusters on a second set of machines, an effort hampered by the fact that vital information was stuck on the dead hardware. Eventually, the standby Redis servers were up and running without any data loss, which allowed the app servers to start up properly.
Two hours and six minutes since the start of the outage, the website recovered.

"We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, but we can take steps to ensure recovery occurs in a fast and reliable manner," GitHub engineer Scott Sanders wrote in a blog post on Wednesday explaining the cascade of failures. "We can also take steps to mitigate the negative impact of these events on our users.

"We identified the hardware issue resulting in servers being unable to view their own drives after power-cycling as a known firmware issue that we are updating across our fleet. Updating our tooling to automatically open issues for the team when new firmware updates are available will force us to review the changelogs against our environment."

Sanders also said work will be carried out to make its app servers more resilient the next time its backend systems fall over.

"All of us at GitHub would like to apologize for the impact of this outage," he added. ®

More about

COMMENTS

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

Devops

How a power blip briefly broke GitHub's boxes and tripped it offline

git-blame -l -t

More about

TIP US OFF

Other stories you might like

Banned Nvidia GPUs sneak into sanction-busting Chinese servers

Miles of optical fiber crafted aboard ISS marks manufacturing first

Seagate joins the HDD price hike party, blames AI for spike in demand

Protecting distributed branch office environments from ransomware

SpaceX workplace injury rates are rocketing

Miracle-WM tiling window manager for Mir hits 0.2.0

GM shared our driving data with insurers without consent, lawsuit claims

iPhone sales dive 19.1% in China as Huawei comeback hits Apple in the high end

Microsoft shrinks AI down to pocket size with Phi-3 Mini

Digital Realty wants to turn Irish datacenters into grid-stabilizing power jugglers

Microsoft really does not want Windows 11 running on ancient PCs

SAP cloud swells its topline, but profits slide

About Us

Our Websites

Your Privacy