Feeds

Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage

Still doing 'forensics' on the caching layer – Beeb digi wonk

Internet Security Threat Report 2014

BBC techies have no idea why the load on its database "went through the roof" last weekend, when Auntie was struck by a huge, two-pronged outage that caused its iPlayer service and website to go titsup.

During the downtime, the Beeb was pretty reticent on social media about what had gone wrong, preferring instead to simply post occasional tweets apologising for the disruption and promising to restore access soon.

The silence infuriated some iPlayer fans who demanded more information about when the system would be fixed.

On Tuesday, the BBC's digital distribution controller, Richard Cooper, tried to explain why the Corporation's popular catch-up TV and radio player and its main website had frozen Brits out of accessing the services.

He said in a blog post that the cause of the outage remained a bit of mystery.

The BBC has a system made up of 58 application servers and 10 database servers providing programme and clip metadata, Cooper said.

"This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online," he added. "This system is split across two data centres in a "hot-hot" configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres."

He said that the "load on the database went through the roof" on Saturday morning (19 July), at which point requests for metadata to the application servers started to drop off.

Cooper explained:

The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.

At almost the same time we had a second problem. We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.

Our first priority was to restore the caching layer. The failure was a complex one (we’re still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode (“Due to technical problems, we are displaying a simplified version of the BBC Homepage”). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.

The Beeb's techies struggled to restore the metadata service and Cooper added that isolating the source of the additional load on its database had proved to be "far from straightforward". He confessed that "restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems)."

Cooper said that the system remained wobbly throughout the weekend, with the BBC deciding not to further disrupt its service until Monday when fewer people would be accessing the iPlayer.

It finally returned the iPlayer and BBC Online to normal service more than 48 hours after cracks in the system appeared.

Cooper admitted that viewers and listeners may have missed out on certain programmes as a result of the tech blunder.

"I’m afraid we can’t simply turn back the clock, and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced," he said.

Meanwhile, the Beeb is yet to determine exactly what went wrong.

The timing of the outage came just days after the BBC's Internet Blog ran a post from its senior product manager Kiran Patel.

He celebrated the fact that it had been nearly a year since the Corporation announced that its internal Video Factory product - which "moved live processing into the cloud" - was taking over the production of all vid content for the iPlayer.

Ominously, he said: "I cannot promise we will be this fast all the time. There are times when things go wrong and delivery can be delayed. We have built Video Factory with resilience, as its primary goal. So problems may delay delivery, but we ensure we never miss any content."

But it would seem that last weekend's caching and database failures scuppered any chance of viewers and listeners catching up on some of their favourite programmes. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
Turnbull should spare us all airline-magazine-grade cloud hype
Box-hugger is not a dirty word, Minister. Box-huggers make the cloud WORK
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
Do you spend ages wasting time because of a bulging rack?
No more cloud-latency tea breaks for you, users! Get a load of THIS
prev story

Whitepapers

Free virtual appliance for wire data analytics
The ExtraHop Discovery Edition is a free virtual appliance will help you to discover the performance of your applications across the network, web, VDI, database, and storage tiers.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
The total economic impact of Druva inSync
Examining the ROI enterprises may realize by implementing inSync, as they look to improve backup and recovery of endpoint data in a cost-effective manner.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Website security in corporate America
Find out how you rank among other IT managers testing your website's vulnerabilities.