MAID to order: Inside Facebook's cold-storage data ziggurats

Silent iron halls where ancient cat pics slumber undisturbed

Ziggurat

Analysis Facebook is storing old photos in special cold storage halls, Zucky ziggurats housing racks filled with MAID (massive arrays of idle drives) using erasure coding and anti-bitrot scanning to increase storage density and lower power costs while providing faster-than-tape access.

Some two billion photos are shared every day on Facebook and the number is growing; Facebook doesn't disclose the rate though, and they need to be instantly available. So it has built itself two specialised facilities from the ground up to store them.

Zuck's ziggurats are located in data centres at Prineville, Oregon, and Forest City, North Carolina. They use racks of 480 x 4TB consumer-grade disk drives, 1.92PB per rack, in halls that can store 1 exabyte of data, about 520 racks; very heavy racks too, 1,100kg each.

The design is based on the OpenVault OCP specification and extended so as to reduce the amount of power needed and also the amount of disk space required to protect against data loss.

Power reduction

The cold storage halls have no battery backup, no uninterruptible power supplies or generators, the thinking being that these were needed for critical live production data safety, not old photos.

All-in-all the cold storage ziggurats are equipped with less than one-sixth of the power available to a traditional Facebook data centre.

In the storage racks only one drive per tray is powered on at a time with drive controller firmware modified to enforce this.

Facebook_cold_storage_disk_tray

Facebook cold storage disk tray

This enables the number of fans per storage node to go from six to four, three power shelves to be reduced to one, and seven power supplies in the shelf to go down to five. These changes mean that a rack only needs one Open Rack busbar instead of three.

Servers power on as needed, being managed by intelligent software that constantly verifies and rebalances data to optimize durability. This means more rack space for storage shelves with the racks operating at one-quarter the power usage of conventional storage servers.

Disk capacity saving

The calculations here were concerned with getting data durability at much less capacity cost than RAID, in fact with fewer than two copies of the data. Facebook wanted to get rid of single points of failure and be able to recover its entire system as economically as possible.

It decided that it would use its metadata store for data recovery when there was a catastrophic failure; the data being self-describing in this sense.

Reed-Solomon error correction codes are used in an erasure coding scheme. A 1GB file is divided into 10 X 100MB chunks with 4 extra chunks needed for the Reed-Solomon codes, meaning 1.4GB of capacity. The original 1GB file can be reconstructed, to a high degree of statistical certainty, if 4 of the 14 chunks are lost so long as the 14 chunks are stored in different failure domains.

There is an extra layer of validation; the software creates, maintains, and rechecks checksums constantly to validate integrity. A checksum copy is stored next to the data itself for fast verification and replication somewhere else if an error is detected.

Facebook also carries out anti-bitrot scanning looking at all the data in all the drives and reporting corruptions. A full drive scan takes 30 days and the engineers reckon they can re-create lost data successfully by using a separate process. This reads enough data to reconstruct the missing part and write it to new drives elsewhere in the facility.

Overall, Facebook says it's been able to reduce reconstruction from hours to minutes as well as reducing the data protection capacity tax.

The system constantly looks at balancing the data load across the available capacity, assuming new capacity is being added all the time. That means no special data migrations for new capacity are needed.

The mighty ad platform is also evaluating alternate media, like Blu-ray optical disks and low-endurance flash - meaning triple-level cell we assume - and checking out spreading files across multiple data centres for extra resilience.

Is this dedicated fast-access cold storage applicable outside Facebook's special needs? We would say yes, potentially, if you needed fast-access cold storage which is destined to grow to 100PB and beyond.

Perhaps The OCP OpenVault specification will be amended to reflect what Facebook has done in its own facilities and then cloud service providers and suppliers with similar needs can make use of them. ®

Sponsored: Becoming a Pragmatic Security Leader




Biting the hand that feeds IT © 1998–2019