3 pre-pay database server has a Sunday snooze
Engineers at 3 are busy rebuilding the firm's pay-as-you-go top-up database after the server carrying it fell over on Sunday.
Customers were left unable to add new credit for several hours, either over the phone, online or in 3 stores. A 3 spokesman said today that it took action to ensure that pay-as-you-go users were able to make calls if they had run out.
Billing systems are now back up, and the work being carried out to restore the database means no credit will be lost. The actual 3G network was unaffected by the downtime, the spokesman said.
3 sent us this apology and some more details:
We experienced problems with our ThreePay billing systems on Sunday which meant customers were not able to add credit to their accounts, send some messages and call some premium-rated and international numbers. Customers were able to make and receive normal calls throughout.
Our core top-up systems have been restored and we are now updating the information on our billing systems. There may be some impact on customers calling in to top up for a short time, while the systems are updated. We apologise for any inconvenience to those prepay customers that were affected.
Angry Reg reader Mike Benso got in touch to say he hasn't been able to contact anyone, including 3 customer service, by phone or text since the system went down. "I'm sitting on £50 credit. It's inexcusable," he said. ®
And T-Mobile payments are down as I read this
Is T-Mobile out in sympathy? I can't use the payment system for either PaYG or contract.
It's STILL down
So it's now 2 days later, and I still cant top up. How long can it take to sort out these problems? Was thinking about a move away from 3, this has made up my mind.
Paris, because she also goes down and stays down
They do have resilient systems
I think some spokesman has over simplified the reason behind the problems. As an ex-3 employee, all their databases have resiliency, backups, etc... Even the lowliest database I was aware of was pretty well protected.
There are other problems though. For example:
With big 24*7 databases, there is a significant delay in recovering terabytes of data from tapes, particularly when everyone else still needs to be backed up at the same time. Maybe a tape has gone to offsite storage already. That imposes a nasty delay, even though the tape management company say they can get it back in half a shake, it always takes 3 days to get the wrong tape out.
There is probably a bigger delay in managing the "paperwork" behind such a massive restore, raising change reqs, filling in fault reports, keeping the problem management people off your back, responding to questions from popular websites and spokespeople, etc...
The biggest problem is : once you've got an HA database that's being used, how do you make sure that any related changes don't break the HA bit, without risking breaking the live service. Managers are always much more prepared to take the risk that if something went wrong 'one day' the system would fail than to say today "go ahead, see if it fails on purpose"