Feeds

Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage

Still doing 'forensics' on the caching layer – Beeb digi wonk

Internet Security Threat Report 2014

BBC techies have no idea why the load on its database "went through the roof" last weekend, when Auntie was struck by a huge, two-pronged outage that caused its iPlayer service and website to go titsup.

During the downtime, the Beeb was pretty reticent on social media about what had gone wrong, preferring instead to simply post occasional tweets apologising for the disruption and promising to restore access soon.

The silence infuriated some iPlayer fans who demanded more information about when the system would be fixed.

On Tuesday, the BBC's digital distribution controller, Richard Cooper, tried to explain why the Corporation's popular catch-up TV and radio player and its main website had frozen Brits out of accessing the services.

He said in a blog post that the cause of the outage remained a bit of mystery.

The BBC has a system made up of 58 application servers and 10 database servers providing programme and clip metadata, Cooper said.

"This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online," he added. "This system is split across two data centres in a "hot-hot" configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres."

He said that the "load on the database went through the roof" on Saturday morning (19 July), at which point requests for metadata to the application servers started to drop off.

Cooper explained:

The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail.

At almost the same time we had a second problem. We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front.

Our first priority was to restore the caching layer. The failure was a complex one (we’re still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode (“Due to technical problems, we are displaying a simplified version of the BBC Homepage”). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache.

The Beeb's techies struggled to restore the metadata service and Cooper added that isolating the source of the additional load on its database had proved to be "far from straightforward". He confessed that "restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems)."

Cooper said that the system remained wobbly throughout the weekend, with the BBC deciding not to further disrupt its service until Monday when fewer people would be accessing the iPlayer.

It finally returned the iPlayer and BBC Online to normal service more than 48 hours after cracks in the system appeared.

Cooper admitted that viewers and listeners may have missed out on certain programmes as a result of the tech blunder.

"I’m afraid we can’t simply turn back the clock, and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced," he said.

Meanwhile, the Beeb is yet to determine exactly what went wrong.

The timing of the outage came just days after the BBC's Internet Blog ran a post from its senior product manager Kiran Patel.

He celebrated the fact that it had been nearly a year since the Corporation announced that its internal Video Factory product - which "moved live processing into the cloud" - was taking over the production of all vid content for the iPlayer.

Ominously, he said: "I cannot promise we will be this fast all the time. There are times when things go wrong and delivery can be delayed. We have built Video Factory with resilience, as its primary goal. So problems may delay delivery, but we ensure we never miss any content."

But it would seem that last weekend's caching and database failures scuppered any chance of viewers and listeners catching up on some of their favourite programmes. ®

Beginner's guide to SSL certificates

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.