System upgrade to blame for BlackBerry outage
Haven't we heard that before?
A three-hour outage that left many BlackBerry users unable to send and receive email was the result of an upgrade gone wrong.
The upgrade to Research in Motion's internal data-routing system was designed to expand capacity. But instead, it disrupted email service for many subscribers located in North America. The episode, which began around 3:30 pm Eastern Standard Time, caused severe withdrawal symptoms among the more severe CrackBerry addicts, who have grown used to checking their email just about anywhere, frequently during meetings and meals.
It was the second major outage in less than a year. In April a separate upgrade disrupted service in Canada and the US after engineers tried to make improvements to network caching systems. With a growing base of subscribers that recently reached 12 million, RIM's frequent upgrades are understandable, but hey, haven't these guys heard of dry runs? The company says it made similar changes before without suffering the same types of outages.
The outages are at least in part the result of RIM's centralized network, which creates a single point of failure by funneling all email from North America through a network operations center. According to the Associated Press, that's where Monday's problem appears to have occurred.
Meanwhile, RIM is conducting a full analysis of the most recent outage and says it could take "several days or longer" for it to be completed.
"No messages were lost and the system continues to operate normally today," RIM says. ®
Now I know why they're called 'crackberries'
it's not because of the users' crack addition, it's because that's what the network admins are smoking.
I'm sorry, but NO organization in its right mind that targets business users would EVER think that 3:30pm is the right time to roll out a code upgrade. That's primo time between when all the execs are back from lunch, and just as all the lazy ones are checking to see whether they can skip off home early.
Midnight or later is the only time this kind of thing should happen. That's when you'll have the least number of users impacted by any problems that do occur.
And if your network admins cry about it, get new ones. As much as you might like them, they are not the ones that pay the bills.
RIM is a bunch of gromless gits
Blizzard has somewhere around 10 million subscribers for the World of Warcraft online game, and they can manage to do their regular maintenance starting at 3 AM Pacific Coast time.
I'm reasonably sure RIM has more than 10 million Crackberry subscribers. Either they have severely underpriced their services, or they can afford to start (and complete) system downtime during the wee hours (and perform a test run on a hot spare server, in case there's still some nasty surprise lurking in the patches come "upgrade" day).
Yes, it's "just" email, and as a sysadmin I think (actually, I *know*) that it's not a life-threatening issue. However, RIM appears to have knowingly and deliberately ignored even the most basic principles of good network management.
Upgrades during customer peak time - bull-by-product I say
Upgrade during peak NA time - I think not.
QA - how they spelling that.
Too retrospectivly come out with the same bull-spin as last `upgrade` and to say it a few days later when Telco's would of been notified of said upgrade around a week or two in advance; same telco's who dinny know shit about it. Well I call that utter contempt for customers interlect.
If you were hacked then say it, dont spin the bottle.
But given that all the USA goverment emails pass thru there systems with about as much issolation a pea in a pod against other pea's when they thinks its a seperate system. Well you can see why there saying this.
Are RIM capable of a upgrade that has been QA'd or even properly tested without downgrading everybody. You realy have to wonder. Especialy when expansion upgrades that have the complete opposite effect. Well again you wonder how robust there internal infrastructure actualy is.
This also again highlights that there idea of DR is completely different from the rest of the IT industry.
I do wonder if they know what QA actualy is beyond lip service as from what I've seen they havn't got a ruddy clue at all. Seriously is QA something you just tell your staff that you have and its stored next to the golden monkey in the MD's office and nobody can touch it as its that special. I would say get a clue, but first they need to find a large enough stick to induce said clue that wont break on first wack.
Seriously RIM is a stock ticker not what you do too the intellegince of your customer base. "system upgrade" after the last utter SNAFU and to impact at peak-time.
But at least they never lost any emails for there customers, only the whole reason for having there services in the first place of instant email. But thats why RIM handsets allow you to make phone calls, so in essence the whole outage can be blamed upon customers :). But dont let me give RIM idea's for there next outage which will probably happen before the end of there financial year statisticaly.
My recomendations to RIM is:
1) WIKI Quality Assurance and actualy read it with intent.
2) Roll-back and staff changes you have done as clearly things going bad
3) Organise a 3rd party QA team consisting of members of your high-profile customers likew ew T-Mobile say.
4) Stop cluster fudging your infrastructure and do distributed upgrade cells that dont kludge-up all the other egg's in the same box/location.
5) have a real DR setup as clearly you dont have dick in-place currently, unless 3 yours is your idea of a fail-over.
6) Sell lots of your shares as its had its day as far as industry analysts go with the current directions
7) Cash in and market IP to 3rd parties more aggresivly and packaged.
8) Pay your old-school handset designers more and listern too them more instead of lip servicing them.
9) Buy your Root cause anaylsts alot of drinks as they sure in hell are goona need em from the fallout from this.
10) Dont sell out to Microsoft.
11) Focus on the product and not the share price, worked for you in the past and you did good stuff back then, past years products and service has been utterly handbagged marketed. Your product is business not end-consumer biased so stop allowing one to impact the others infrasturcture, its a security flaw that you security team will never stand a hope in hell of addressing until that happens.
12) If you cant tell your staff the truth, then you have no chance of customers believing a word you say.
Bottom line RIM - get back to your roots or thats all you will have to eat. Another outage like this or even using that excuse ever again will be a death blow with regards to industry analysts and your share price will fail badly then.