T-Mobile leaves 300,000 disconnected
Database corruption blamed
300,000 UK T-Mobile customers had a quiet morning as they were unable to make or receive calls thanks to a database snafu that forced the operator to restore from backups - a process which is still in progress.
The problems started at around 10am this morning, and meant that 300,000 customers couldn't be verified by the network and so were unable to make or receive calls. Restoration from a backup is already underway, and 200,000 customers are already back on the network (in London and the south east, naturally) - the remaining 100,000 should be working by 16:00 this afternoon.
Customers calling up T-Mobile had been told that the problem could take even longer to fix, though it seems that the rapid availability of backups has helped speed things along.
Network operators work hard to reduce critical points in the network, replicating key servers and infrastructure, but such procedures are much harder to implement where security and authentication are involved, leaving customers at the mercy of software bugs or hardware failure. ®
@anyone wondering why they didn't just switch to another server
... they may well have had another DB server to switch to.
Usually these things are set up with replication. If the replication target server accepts the same corruptions as prod - (eg. some muppet dropping a table) - then it's feck all use, it's in the same condition as the original production one. The only solution is to have something which really does keep copies of the data somehow, or a way to roll back.
I'm pretty surprised they didn't have an few hours old snapshot somewhere for a DB of this criticality... or the ability to just pull the real DB off tape or disk fast enough for it not to make much difference either way. That's how I'd have done it (with some sort of snapshot, flashcopy or whatever).
Paris 'cause maybe they'll think before acting in future.
@TMS9900, totally agree mate but hey the vast majority of the UK population are so emotionally weak that they go to pieces if someone is not texting (or to some texing) or phoning every two minutes then they go to pieces and these are the same bunch that are on facebook day in day out.
I carry a mobile yeah but wouldn't be devastated if O2 went offline for a bit.
Oh and btw, O2 have better coverage than T-mobile because of the lower frequency and more base stations.
Oh and to the NO2ID crowd, a mobile can be used to track your almost precise location so why do you carry one.
It's the HLR
I happen to know that T-Mobile are upgrading their HLRs around about now, and it sounds like this is what caused the problem. Having a backup close to hand sounds a lot like -- we migrated, it was fecked, we rolled back.
There's not really much you can do in these instances. HLRs are a bit old skool (at least the ones T-Mobile were replacing) and are designed to be highly available, highly resilient in their own right. But, if you're swapping from one to another -- there's always a chance things can go wrong.
To be honest, it seems like they did a pretty good job of containing the issue.
@Danny: SQL Server? Are you having a laugh. That's _certainly_ not carrier grade. If the HLRs were SQL Server based, you'd never connect a call!
@Yorkshirepudding: that's really just physics, and not much T-Mobile can do about it ("ya cannae change tha laws of physics, cap'n, etc. etc.). T-Mobile runs at 1800MHz, while O2 (and Voda) are on 900Mhz. The low frequency has greater penetration. Hence O2 and Voda customers can use their phones where Orange and T-Mobile cutomers can't. As far as I'm aware, all networks are using 2100Hz 3G, so are all as fecked as one another in that area.
Paris, 'cos we all know when she goes down.