Feeds

Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage

Still doing 'forensics' on the caching layer – Beeb digi wonk

Protecting against web application threats using SSL

BBC techies have no idea why the load on its database "went through the roof" last weekend, when Auntie was struck by a huge, two-pronged outage that caused its iPlayer service and website to go titsup.

During the downtime, the Beeb was pretty reticent on social media about what had gone wrong, preferring instead to simply post occasional tweets apologising for the disruption and promising to restore access soon.

The silence infuriated some iPlayer fans who demanded more information about when the system would be fixed.

On Tuesday, the BBC's digital distribution controller, Richard Cooper, tried to explain why the Corporation's popular catch-up TV and radio player and its main website had frozen Brits out of accessing the services.

He said in a blog post that the cause of the outage remained a bit of mystery.

The BBC has a system made up of 58 application servers and 10 database servers providing programme and clip metadata, Cooper said.

"This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online," he added. "This system is split across two data centres in a "hot-hot" configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres."

He said that the "load on the database went through the roof" on Saturday morning (19 July), at which point requests for metadata to the application servers started to drop off.

Cooper explained:

The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.

At almost the same time we had a second problem. We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.

Our first priority was to restore the caching layer. The failure was a complex one (we’re still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode (“Due to technical problems, we are displaying a simplified version of the BBC Homepage”). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.

The Beeb's techies struggled to restore the metadata service and Cooper added that isolating the source of the additional load on its database had proved to be "far from straightforward". He confessed that "restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems)."

Cooper said that the system remained wobbly throughout the weekend, with the BBC deciding not to further disrupt its service until Monday when fewer people would be accessing the iPlayer.

It finally returned the iPlayer and BBC Online to normal service more than 48 hours after cracks in the system appeared.

Cooper admitted that viewers and listeners may have missed out on certain programmes as a result of the tech blunder.

"I’m afraid we can’t simply turn back the clock, and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced," he said.

Meanwhile, the Beeb is yet to determine exactly what went wrong.

The timing of the outage came just days after the BBC's Internet Blog ran a post from its senior product manager Kiran Patel.

He celebrated the fact that it had been nearly a year since the Corporation announced that its internal Video Factory product - which "moved live processing into the cloud" - was taking over the production of all vid content for the iPlayer.

Ominously, he said: "I cannot promise we will be this fast all the time. There are times when things go wrong and delivery can be delayed. We have built Video Factory with resilience, as its primary goal. So problems may delay delivery, but we ensure we never miss any content."

But it would seem that last weekend's caching and database failures scuppered any chance of viewers and listeners catching up on some of their favourite programmes. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Wanna keep your data for 1,000 YEARS? No? Hard luck, HDS wants you to anyway
Combine Blu-ray and M-DISC and you get this monster
Google+ GOING, GOING ... ? Newbie Gmailers no longer forced into mandatory ID slurp
Mountain View distances itself from lame 'network thingy'
US boffins demo 'twisted radio' mux
OAM takes wireless signals to 32 Gbps
Apple flops out 2FA for iCloud in bid to stop future nude selfie leaks
Millions of 4chan users howl with laughter as Cupertino slams stable door
'Kim Kardashian snaps naked selfies with a BLACKBERRY'. *Twitterati gasps*
More alleged private, nude celeb pics appear online
Students playing with impressive racks? Yes, it's cluster comp time
The most comprehensive coverage the world has ever seen. Ever
Run little spreadsheet, run! IBM's Watson is coming to gobble you up
Big Blue's big super's big appetite for big data in big clouds for big analytics
Seagate's triple-headed Cerberus could SAVE the DISK WORLD
... and possibly bring us even more HAMR time. Yay!
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.