GitLab crawling back online after breaking its brain in two
Database replication SNAFU took down three out of five PostgreSQL servers
In a classic example of the genre, GitLab yesterday dented its performance by accidentally triggering a database failover.
The resulting “split-brain problem” left the code-collection trying to serve its users out of a single database server,
postgres-02, while it tries to sort out the remaining three.
The problem first arose at around 1:30am UTC on Thursday, and the resulting rebuilds are continuing.
We are currently investigating decreased performance and errors on https://t.co/r11UmmDLDE due to database load.— GitLab.com Status (@gitlabstatus) April 26, 2018
When the accidental failover was triggered, Alex Hanselka wrote that while the fleet “continued to follow the true primary”, the event was apparently painful:
“We shut down
postgres-01since it was the rogue primary. In our investigation, both
postgres-04were trying to follow
postgres-01. As such, we are rebuilding replication on
postgres-03as I write this issue and then
postgres-04when it is finished.”
We are continuing to investigate the performance degradation on GitLab. For more details see https://t.co/9ebGTqgY9b— GitLab.com Status (@gitlabstatus) April 26, 2018
Also impacting performance are a backup (needed because there wasn't a full
pg_basebackup since before the failover), and GitLab's shut down its Sidekiq cluster because it causes large queries.
That was the situation when things first broke: nearly 20 hours later, the ticket hasn't been closed.
For a start, the backup of
postgres-03 is running at 75GB per hour and took until after 23:00 (11pm) to complete. There are still other database tasks to complete, but performance is starting to return to normal according to posts from Andrew Newdigate.
CI/CD queues are back to normal state since 21:30 UTC. Pipelines are being handled at the standard speed now.— GitLab.com Status (@gitlabstatus) April 26, 2018
There's also a timeline here.
At least the backups are working: in February 2017, a data replication error was compounded by backup failures: “So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place”.
The missing data was found on a staging server, and after much much soul-searching, marketing veep Tim Anglade told The Register understood its role as “a critical place for peoples' projects and businesses”.
Working backups, it has to be said, indicate at least some of the lessons were learned. ®