RIM Releases Reason for Blackberry Outage 106
An anonymous reader writes "According to BBC News, RIM has announced that the cause of this week's network failure for the Blackberry wireless e-mail device was an insufficiently tested software upgrade. Blackberry said in a statement that the failure was trigged by 'the introduction of a new, non-critical system routine' designed to increase the system's e-mail holding space. The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options."
perhaps (Score:5, Interesting)
Re: (Score:1)
I'd hate to be their QA manager right now! (Score:2)
Re:I'd hate to be their QA manager right now! (Score:5, Insightful)
It's quite likely the development group listed this as a risk, with a good backout plan, and upper management simply didn't want to pay for the cost of having a quick backout.
If that's the case, you can be pretty sure upper management WON'T take the blame.
Re:I'd hate to be their QA manager right now! (Score:5, Insightful)
Re:I'd hate to be their QA manager right now! (Score:5, Insightful)
Blasphemer!
Re:I'd hate to be their QA manager right now! (Score:5, Insightful)
Re:I'd hate to be their QA manager right now! (Score:4, Insightful)
Re: (Score:2)
Re: (Score:3, Funny)
~kicks guy into a bottomless pit~
Re:I'd hate to be their QA manager right now! (Score:5, Insightful)
Whatever it is, the production problems are due to bad process, which is what management is supposed to control. They are not even responsible for coming up with the technicalities of the process, they are responsible for making sure that there is a sufficient process (sufficient in terms that it is agreed by all parties, DEVs, QAs, BAs, client that it is good enough.) They are responsible to make sure that the process is followed.
Over a year ago now in Toronto, ON, Canada, the Royal Bank of Canada had a similar problem, but of course with a bank it is much more dangerous it is lots of money of lots of people. Heads rolled at the management level only.
Re:I'd hate to be their QA manager right now! (Score:5, Insightful)
Because that's not how change should happen in large/business critical applications.
What should happen is that the update is thoroughly tested, a change control request is raised and at the next change control meeting the change request is discussed.
The change request should include at the very least a benefit analysis (what's the benefit in making this change), risk analysis (what could happen if it goes wrong) and a rollback plan (what we do if it goes wrong). None of these should necessarily be vastly complicated - but if the risk analysis is "our entire network falls apart horribly" and the rollback plan is "er... we haven't got one. Suppose we'll have to go back to backups. We have tested those, haven't we?" then the change request should be denied.
As much as anything else, this process forces the person who's going to be making the change to think about what they're going to be doing in a clear way and make sure they've got a plan B. It also serves as a means to notify the management that a change is going to be taking place, and that a risk is attached to it.
And if a change is made but hasn't been approved through that process, then it's a disciplinary issue.
Of course, it's entirely possible that such a process was in place and someone did put a change through without approval. In which case, I don't envy their next job interview.... "Why did you leave your last job?"
Re: (Score:2)
And what if there was? What if, gasp, this software upgrade had an "unexpected" impact? Risk analysis almost certainly would not have listed "worldwide operations will grind to a halt, cats and dogs start sleeping together, all the molecules in your person fly apart in exciting ways", and the
Re: (Score:3, Insightful)
Re:I'd hate to be their QA manager right now! (Score:4, Insightful)
How many people here have checked in buggy code that neither management nor QA knew was buggy? (crickets)
How many people here have been on projects where management shoved the code out the door despite major bugs that they knew about? (thunderous applause)
How many people here have tried to get time on The Schedule to do something The Right Way, only to be told by management to do it half-assed, because that's all there's time/resources for? (applause, hooting)
There you go.
Re: (Score:1, Funny)
Re: (Score:1)
Re: (Score:2)
More importantly, they apparently had no or a very bad backout plan.
It's quite likely the development group listed this as a risk, with a good backout plan, and upper management simply didn't want to pay for the cost of having a quick backout.
If that's the case, you can be pretty sure upper management WON'T take the blame.
I don't know what shops you've worked in, but the devs in most places I've worked never have a backout plan unless management forces them to -- the prevailing attitude is that the software is tested, so what could possibly go wrong?
A QA manager has any say on how much testing? (Score:1, Funny)
Re: (Score:3, Insightful)
We're lucky we can get through a single pass of functionality testing; forget about load/stress/performance/long-term stability. We're lucky we have a test environment composed of hardware retired from production, because it was deemed insufficient to meet the needs of the production environment.
True story: I was supposed to be testing
Re: (Score:3, Funny)
Did you try setting CardboardEthernet0/0 to "100/full" instead of "auto/auto"?
What really happened... (Score:5, Funny)
Re: (Score:1)
I think it's more along the lines of "Unplugged the coffee maker; please feel free to restart the server now."
Re: (Score:1)
Re: (Score:1)
Re: (Score:1)
Re: (Score:1)
Non-critical? (Score:5, Funny)
bkd
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
Buying time (Score:5, Funny)
Short answer (Score:2)
Re: (Score:2)
That's probably because dispatch couldn't reach him on his Blackberry.
testing departments (Score:2)
Re:testing departments (Score:5, Informative)
Increasing storage capacity (when current capacity not close to exhaustion)? Non-critical.
Fixing the shut-down system that resulted from the upgrade? Critical.
Watching the sales reps in my office apoplectically try to figure out how to get in touch with their clients? Priceless.
Relevance? (Score:2)
Re: (Score:2)
Financial Relevance (Score:2)
Re: (Score:2)
Conspiracy theories. I think I'm anti-business and cynical enough to see it:
RIM sending a message to the SEC: "Enough of the government and business is dependent on us that, if you take us down, you both make a big hit to the economy, and piss off your own bosses, who probably use our product."
Ah ha! (Score:5, Funny)
Re: (Score:2)
Damn! We thought these were emails for whitehouse.com, eh?
We have blackberries and Bes (Score:2)
on the plus side... (Score:2)
all publicity is good publicity, right?
as the other poster said:- boy I would hate to be their QA at this time.
Is this really so bad? (Score:4, Insightful)
Yeah, they've got areas to tighten up their QA and patch processes, but on the whole they got it all back up and running faster than most enterprises get their email functioning after a worm.
Re: (Score:3, Funny)
Yes it is. They've put themselves in a critical... (Score:5, Insightful)
Several hours of email downtime is "OKish" if you are talking about a medium sized company that only has a handful of servers and a few IT guys. This is not the same at all.
Prior to this, I never realized that the RIM system was THIS centralized. It's kind of concerning really. And I don't quite understand why so many US gov't users are allowed to route their email through a NOC in Canada (disclosure: I'm Canadian).
Re:Yes it is. They've put themselves in a critical (Score:2)
Re:Yes it is. They've put themselves in a critical (Score:2)
Your information isn't quite right here. RIM has more than two data centers in more than two locations in more than two continents.
Governments tend to be (justifiably) paranoid customers. I'm sure it's safe to assume that
Re: (Score:2)
Re: (Score:2)
E.g. http://news.zdnet.com/2100-1035_22-6177829.html [zdnet.com]
Re:Yes it is. They've put themselves in a critical (Score:2)
Re: (Score:2)
I can understand smaller countries having to accept that as a part of life, but lets face it, America has a few bucks to toss around. I'm quite surprised that the government hasn't forced RIM to put a NOC on American ground.
Re: (Score:1)
Re: (Score:1)
Name me one piece of software that is as complex as this which has no bugs in it.
T-Mobile recently did an upgrade which took many months. There were bugs in this system too, but they worked in quite the opposite direction. Did you here about those?
RIM's biggest failure (Score:5, Interesting)
Re: (Score:3, Funny)
Pop quiz! (Score:3, Insightful)
A) The fact one piece of software took down their environment.
B) Their failover plan didn't work.
C) All of the above.
D) None of the above.
Personally, I vote for "B". Face it, s**h happens. But when you plan for s**t happening and the plan doesn't work, that's a VERY bad thing.
Re: (Score:1)
what failover plan .. (Score:2)
What failover plan and assuming what they said really happened
was Re:Pop quiz!
Screws fall out (Score:2)
It's an imperfect world. Now, show Dick some respect!
Testing of Complex Systems (Score:4, Insightful)
And a bunch of suits will want the heads of the technicians responsible.
I feel for them, I really do.
A few years ago I put in a minor maintenance change that made headlines for my employer.
This is a natural result of the budgetary constraints we have to live with in the real world. Testing and certification is expensive, and the more complex the environment, the more expensive it gets. It is difficult to justify a full blown certification test for minor, routine maintenance, unless you are talking about health and safety systems. So a worst-case event occurred, RIM suffers some corporate embarrassment, some low-level techs will get yelled at, and possibly fired, and a bunch of people had to suffer crackberry withdrawal.
Nobody died. No planes crashed. No reactors melted down.
RIM will work up some new and improved testing standards, and tighten the screws on system maintenance so much that productivity will suffer, they may even spend a bunch of money on the equipment needed to do full-production-parallel certification testing. And then in a year or so cut the budget to upgrade the certification environment as 'needless expense', and come up with work-arounds to reduce the time it takes to get trivial changes and bugfixes rolled out.
I wish them luck. Especially to the poor sods who did the implementation.
At least when I did my 'headline-making-minor-maintenance' it only made the local papers for a couple of days.
Re: (Score:3, Insightful)
Nobody died. No planes crashed. No reactors melted down.
You are safe on the planes crashing and on the meltdowns. I didn't hear of any such incidents.
However, I will argue that the outage may have contributed to deaths. There are many hospitals which use Blackberries instead of pagers (2-way comms), so paging a surgeon or doctor or other staff to an emergency may not work well. I am sure there are other examples of critical applications (which should or should not use blackberries) that may have been effected. The obvious thing is that I cannot provide stats,
Re: (Score:1)
you can't be serious. I mean come on. they (hospitals, surgeons, doctors, other staff) don't have phones, or public address systems? that sounds like malpractice suits waiting to happen.
"yes, your honor, we called and called and called our doctor to schedule an appointment to have
the REAL reason.... (Score:4, Funny)
:
>The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options.
Hmmm... so when they wiped the incriminating e-mails from the system (which would certainly create more space), they took the rest of the system down (which prevented anyone else from grabbing copies).
I'm reading WAY too many conspiracy novels these days
(Not that I think this actually happened - but it makes for a great plotline).
More details (Score:4, Informative)
Of course he would not elaborate more on what it is.
This Computer World article [computerworld.com] has more detail.
I don't believe it .. (Score:2)
Given the nature of the technology I find the explanation of a 'fail-over' system failing to kick in a tad disingenuous. It's not like a generator kicking in when the mains electricity stops. And what kind of design decisions led to an upgrade triggering outages for the entire North America.
I would have thought they had multiple nodes at multiple locations with no single point of failure. Or at least three redund
Re: (Score:2)
Paid for how? Increased service rates? We see things like this all the time. When it's cheque signing time, they talk up and down about how they understand that they're leaving redundancy or uptime on the t
PR to IT translation results (Score:3, Funny)
Routine? (Score:1)
"In other news, the wikipedia.org web site screeched to a halt as /. readers rushed to lookup the meaning of the term 'routine' applied in the context of software systems. The RIM public relations department could not be reached for a clarification as to why such an anachronism was used in their announcement."
Chandler: "Quick, we must telegraph presidend Coolidge!"
Non-critical (Score:2)
It is quite obvious they were not referring to the criticalness of the system which was affected.
It's not as simple as a defective patch (Score:1)
Re: (Score:1)
Foolish, foolish VMware (Score:1)
Ship Dates (Score:1)
what ever happened to no single point of failure . (Score:2)
Reminds me of when a Mobile phone company upgrades over the weekend and everyone discovered you could make long distance phone calls for free.
Re:what ever happened to no single point of failur (Score:1)
Re: (Score:2)
This isn't even hard (unless, of course, you really have learned nothing in the past 10 years and don't have a fully redundant production system hot at all times).
Sounds more like... (Score:2)
Adding storage space to a single system shouldn't be a problem, since you take your system down for that anyway (or put it in spare mode or so) even if it's a hotplug-always-on-superfast-resizing-raid-with-au
RIM won't be a fun place to work in anymore (Score:1)
The public is ignorant as to what causes IT problems - even if RIM upgrade their QA process to "better than normal" no one will forgive them if lightning strikes twice. Thus RIM are likely to bring in extraordinarily restrictive processe
Been there - it was my first maintenance callout (Score:1)
Turned out one of our contract software guys had made a simple change to the file retention period - so trivial, he said, there was no need to test it. He was rath
living proof that QA matters... (Score:3, Insightful)
You can't expect programmers to do perfect work, even with unit testing and all the other basic amenities of software development. It requires QA, and that is something sorely lacking in contemprary software product. From the smallest OSX widget to MS Vista,Testing Matters.
RS
The irony is killing me (Score:2)
http://www.stpcon.com/ [stpcon.com]
They probably missed the early bird discount, though.
My favorite quote: "The cost of software failures is high -- and in today's increasingly litigious and regulated business environment, they're higher than ever. Security flaws, usability problems, functional defects, performance issues, all carry a tremendous price tag."
This is a match made in heaven.
P.S.
Non-
Function: prefix
2 : of little or no consequence : unimportant : worthless <nonissues> <
Hooray for outages! (Score:1)
I highly doubt they will ever say who is officially to blame, but most likely it was a combination of pressure from 'above' for the developers to complete the upgrade by xxxxx, for the roll-out team to implement & verify the upgrade globally with absolutely no downtime, the lack of time to test the application for every possible bug or 'feature' that may arise (including going through the code step-by-step to make sure no weird situations or invalid data input/output could occur) and the sheer complexit