Feeds

Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage

Still doing 'forensics' on the caching layer – Beeb digi wonk

Choosing a cloud hosting partner with confidence

BBC techies have no idea why the load on its database "went through the roof" last weekend, when Auntie was struck by a huge, two-pronged outage that caused its iPlayer service and website to go titsup.

During the downtime, the Beeb was pretty reticent on social media about what had gone wrong, preferring instead to simply post occasional tweets apologising for the disruption and promising to restore access soon.

The silence infuriated some iPlayer fans who demanded more information about when the system would be fixed.

On Tuesday, the BBC's digital distribution controller, Richard Cooper, tried to explain why the Corporation's popular catch-up TV and radio player and its main website had frozen Brits out of accessing the services.

He said in a blog post that the cause of the outage remained a bit of mystery.

The BBC has a system made up of 58 application servers and 10 database servers providing programme and clip metadata, Cooper said.

"This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online," he added. "This system is split across two data centres in a "hot-hot" configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres."

He said that the "load on the database went through the roof" on Saturday morning (19 July), at which point requests for metadata to the application servers started to drop off.

Cooper explained:

The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.

At almost the same time we had a second problem. We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.

Our first priority was to restore the caching layer. The failure was a complex one (we’re still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode (“Due to technical problems, we are displaying a simplified version of the BBC Homepage”). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.

The Beeb's techies struggled to restore the metadata service and Cooper added that isolating the source of the additional load on its database had proved to be "far from straightforward". He confessed that "restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems)."

Cooper said that the system remained wobbly throughout the weekend, with the BBC deciding not to further disrupt its service until Monday when fewer people would be accessing the iPlayer.

It finally returned the iPlayer and BBC Online to normal service more than 48 hours after cracks in the system appeared.

Cooper admitted that viewers and listeners may have missed out on certain programmes as a result of the tech blunder.

"I’m afraid we can’t simply turn back the clock, and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced," he said.

Meanwhile, the Beeb is yet to determine exactly what went wrong.

The timing of the outage came just days after the BBC's Internet Blog ran a post from its senior product manager Kiran Patel.

He celebrated the fact that it had been nearly a year since the Corporation announced that its internal Video Factory product - which "moved live processing into the cloud" - was taking over the production of all vid content for the iPlayer.

Ominously, he said: "I cannot promise we will be this fast all the time. There are times when things go wrong and delivery can be delayed. We have built Video Factory with resilience, as its primary goal. So problems may delay delivery, but we ensure we never miss any content."

But it would seem that last weekend's caching and database failures scuppered any chance of viewers and listeners catching up on some of their favourite programmes. ®

Remote control for virtualized desktops

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
Desktop Linux users beware: the boss thinks you need to be managed
VMware reveals VDI for Linux desktops plan, plus China lab to do the development
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Hey - who wants 4.8 TERABYTES almost AS FAST AS MEMORY?
China's Memblaze says they've got it in PCIe. Yow
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
This time it's SO REAL: Overcoming the open-source orgasm myth with TODO
If the web giants need it to work, hey, maybe it'll work
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
Storage array giants can use Azure to evacuate their back ends
Site Recovery can help to move snapshots around
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
Protecting users from Firesheep and other Sidejacking attacks with SSL
Discussing the vulnerabilities inherent in Wi-Fi networks, and how using TLS/SSL for your entire site will assure security.