Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

RIM Releases Reason for Blackberry Outage

Posted by Zonk on Fri Apr 20, 2007 10:24 AM
from the isn't-testing-a-requirement dept.
An anonymous reader writes "According to BBC News, RIM has announced that the cause of this week's network failure for the Blackberry wireless e-mail device was an insufficiently tested software upgrade. Blackberry said in a statement that the failure was trigged by 'the introduction of a new, non-critical system routine' designed to increase the system's e-mail holding space. The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options."
+ -
story

Related Stories

[+] IT: RIM Offers BlackBerry Service Without the BlackBerry 80 comments
TheCybernator writes "RIM has announced that they're essentially planning to offer BlackBerry service ... without the BlackBerry. The company plans an app suite that will turn its push e-mail technology into a platform for Windows Mobile 6 devices. Less than a week after a network outage crippled BlackBerry users across North America, Research In Motion announced an application pack for Windows Mobile 6 devices that Canadian software developers said will intensify the competition for push e-mail. The firm has said that the BlackBerry Application suite will appear as an icon on the screen of the Mobile Windows device and load BlackBerry applications such as e-mail, phone, calendar, address book, tasks, memos, browser, and instant messaging. RIM said users will easily be able toggle between the two platforms, one of which would have a BlackBerry-style interface."
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • perhaps (Score:5, Interesting)

    by geekoid (135745) <dadinportland.yahoo@com> on Friday April 20 2007, @10:26AM (#18812141) Homepage Journal
    a routine that can take down the system is a tad more critical then you think?
  • I'd really hate to be the guy that signed off on the quality of this software update. And apparently they didn't adequately test their recovery system. Oh, well. I hope they learn from this and improve!
    • by Mr Pippin (659094) on Friday April 20 2007, @10:31AM (#18812203)
      More importantly, they apparently had no or a very bad backout plan.

      It's quite likely the development group listed this as a risk, with a good backout plan, and upper management simply didn't want to pay for the cost of having a quick backout.

      If that's the case, you can be pretty sure upper management WON'T take the blame.
      • by spells (203251) on Friday April 20 2007, @10:38AM (#18812311)
        You can tell this is a geek site. Bad software rollout, first post wants to blame the QA manager, second wants to blame "Upper Management." How about a little blame for the devs?
        • How about a little blame for the devs?

          Blasphemer!

        • Clearly bugs originate with devs, the same way typos and spelling errors originate with authors. The occurrence of such errors is inevitable. The process as a whole is what is responsible for eliminating them. To the extent that the devs failed to contribute to that process then yes, they also deserve blame.
          • I couldn't agree more. Yes, the developers should be responsible for their errors, but still, they're only human. Even the best dev makes a serious mistake from time to time. That's why it's essential to have good coders and good QA folks and good management for any project, especially one as large as the Blackberry network. Sometimes redundancy is a good thing.
        • This is blasphemy! This is MADNESS!
        • by roman_mir (125474) on Friday April 20 2007, @10:53AM (#18812499) Homepage
          I am not sure if you are trying to be funny or insightful, probably you are aiming for a bit of both, however, while bugs in software (inevitably) are developers' fault, release of software with bugs into production system is always management fault. There must be a process in place to catch bugs before release for mission critical systems (isn't it one of them?) There must be a process in place for a quick rollback for such systems. There must be some form of backup. How about running both, new and old systems in parallel for a while with ability to switch to the old if the new one fails?

          Whatever it is, the production problems are due to bad process, which is what management is supposed to control. They are not even responsible for coming up with the technicalities of the process, they are responsible for making sure that there is a sufficient process (sufficient in terms that it is agreed by all parties, DEVs, QAs, BAs, client that it is good enough.) They are responsible to make sure that the process is followed.

          Over a year ago now in Toronto, ON, Canada, the Royal Bank of Canada had a similar problem, but of course with a bank it is much more dangerous it is lots of money of lots of people. Heads rolled at the management level only.
        • by jimicus (737525) on Friday April 20 2007, @11:08AM (#18812713) Homepage
          How about a little blame for the devs?

          Because that's not how change should happen in large/business critical applications.

          What should happen is that the update is thoroughly tested, a change control request is raised and at the next change control meeting the change request is discussed.

          The change request should include at the very least a benefit analysis (what's the benefit in making this change), risk analysis (what could happen if it goes wrong) and a rollback plan (what we do if it goes wrong). None of these should necessarily be vastly complicated - but if the risk analysis is "our entire network falls apart horribly" and the rollback plan is "er... we haven't got one. Suppose we'll have to go back to backups. We have tested those, haven't we?" then the change request should be denied.

          As much as anything else, this process forces the person who's going to be making the change to think about what they're going to be doing in a clear way and make sure they've got a plan B. It also serves as a means to notify the management that a change is going to be taking place, and that a risk is attached to it.

          And if a change is made but hasn't been approved through that process, then it's a disciplinary issue.

          Of course, it's entirely possible that such a process was in place and someone did put a change through without approval. In which case, I don't envy their next job interview.... "Why did you leave your last job?"
          • The change request should include at the very least a benefit analysis (what's the benefit in making this change), risk analysis (what could happen if it goes wrong) and a rollback plan (what we do if it goes wrong).

            And what if there was? What if, gasp, this software upgrade had an "unexpected" impact? Risk analysis almost certainly would not have listed "worldwide operations will grind to a halt, cats and dogs start sleeping together, all the molecules in your person fly apart in exciting ways", and the

        • by mutterc (828335) on Friday April 20 2007, @01:12PM (#18814499)

          How many people here have checked in buggy code that neither management nor QA knew was buggy? (crickets)

          How many people here have been on projects where management shoved the code out the door despite major bugs that they knew about? (thunderous applause)

          How many people here have tried to get time on The Schedule to do something The Right Way, only to be told by management to do it half-assed, because that's all there's time/resources for? (applause, hooting)

          There you go.

          • Re: (Score:3, Insightful)

            I am a dev and my motto is "all software engineers are liars and idiots" and I include myself in this. If you want to know how something is supposed to work in theory, ask the dev. If you want to know the actual behavior, ask QA.
    • Re: (Score:3, Insightful)

      As a QA guy, I can't tell you how many times I've been told, on a Monday, "Do whatever is required to make sure this software is stable, as long as you release it on Friday."

      We're lucky we can get through a single pass of functionality testing; forget about load/stress/performance/long-term stability. We're lucky we have a test environment composed of hardware retired from production, because it was deemed insufficient to meet the needs of the production environment.

      True story: I was supposed to be testing
      • I complained to the VP of Engineering that our tests were blocked because I couldn't get the video bridge to come up on our lab network.

        Did you try setting CardboardEthernet0/0 to "100/full" instead of "auto/auto"? :^)

  • by Mockylock (1087585) on Friday April 20 2007, @10:28AM (#18812179) Homepage
    This is all just technical jargon for, "I tripped over the power cord. MY BAD."
  • by Anonymous Coward on Friday April 20 2007, @10:30AM (#18812191)
    This is obviously some new definition of the word "non-critical" with which I was previously unfamiliar.

    bkd
  • Buying time (Score:5, Funny)

    by faloi (738831) on Friday April 20 2007, @10:30AM (#18812197)
    The irony is that the SEC couldn't do any more investigating during the outage because they had no email access!
  • Their tubes were clogged and the plumber wasn't responding. Damn Canadian plumbers...
    • Their tubes were clogged and the plumber wasn't responding.

      That's probably because dispatch couldn't reach him on his Blackberry.

  • So, an outage affecting a core part of the buisiness was caused by a 'non-critical' upgrade. Someone needs to redefine what non-critical actually is. As far as my experience goes (mostly in mission critical datacentres), most of the testing was actually done by the engineers installing and fixing on-the-fly. Engineers are more likely to look in the right places to find a bug, due to hands-on real life experience.
    • by Red Flayer (890720) on Friday April 20 2007, @10:48AM (#18812427) Journal

      Someone needs to redefine what non-critical actually is.
      A non-critical upgrade is one that isn't critical that it be performed.

      Increasing storage capacity (when current capacity not close to exhaustion)? Non-critical.

      Fixing the shut-down system that resulted from the upgrade? Critical.

      Watching the sales reps in my office apoplectically try to figure out how to get in touch with their clients? Priceless.
  • The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options.
    And this is relevant how? Do you expect the SEC to fine them for downtime?
    • Because the journo concerned had some space to fill, probably?
    • The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options.
      And this is relevant how? Do you expect the SEC to fine them for downtime?
      Both would likely have a negative impact on stock prices (and you can add ongoing patent troubles and competition from iPhone to the list as well).
  • Ah ha! (Score:5, Funny)

    by Grashnak (1003791) on Friday April 20 2007, @10:36AM (#18812273)
    So that is where the missing 5 million White House emails went! Sneaky Canadians!
    • So that is where the missing 5 million White House emails went! Sneaky Canadians!

      Damn! We thought these were emails for whitehouse.com, eh?

  • And let me tell you, I have no problem believing they have buggy software.
  • ...they just became famous as a lesson in what not to do

    all publicity is good publicity, right?

    as the other poster said:- boy I would hate to be their QA at this time.
  • by TheBishop613 (454798) on Friday April 20 2007, @10:53AM (#18812509)
    Am I the only one who thinks they actually survived this pretty well? I mean sure, the goal is to try to make sure that the system never goes down and is up 24/7, but sometimes shit happens in large systems. It seems to me that getting everything back to normal within 12 hours is pretty reasonable. Did they have an instant fix? Well no, of course not, but they got the system back to a working state relatively quickly and hopefully didn't lose data.


    Yeah, they've got areas to tighten up their QA and patch processes, but on the whole they got it all back up and running faster than most enterprises get their email functioning after a worm.

    • "BlackBerry goes down, it's headline news. Exchange goes down, it but be Friday"
    • RIM is not a regular company. They have specifically created a centralized system where the email for millions of people depend on the uptime of their two (?!?!) data centres. Delivering email is literally their business and uptime is a critical part of that. IMHO, a half an hour of system wide downtime is pushing RIM's luck.

      Several hours of email downtime is "OKish" if you are talking about a medium sized company that only has a handful of servers and a few IT guys. This is not the same at all.

      Prior to this, I never realized that the RIM system was THIS centralized. It's kind of concerning really. And I don't quite understand why so many US gov't users are allowed to route their email through a NOC in Canada (disclosure: I'm Canadian).

  • by toupsie (88295) on Friday April 20 2007, @10:54AM (#18812517) Homepage
    Mistakes in QA do happen and everyone can do more testing but RIM's biggest failure during the outage was not their QA but their PR. How many BES Admins wasted an hour or two trying to figure out why their servers were not delivering properly to their user's handhelds? If there was a statement on their website or a message on their support line, a lot of wasted time would have been averted. If it were not for a few of the independent blackberry forums, I would not have known their was a nationwide outage during my troubleshooting.
    • Yeah... they should have just sent out an email to all the BlackBerries saying email would be disrupted for a while....
  • Pop quiz! (Score:3, Insightful)

    by 8127972 (73495) on Friday April 20 2007, @10:55AM (#18812523)
    Which is worse:

    A) The fact one piece of software took down their environment.
    B) Their failover plan didn't work.
    C) All of the above.
    D) None of the above.

    Personally, I vote for "B". Face it, s**h happens. But when you plan for s**t happening and the plan doesn't work, that's a VERY bad thing.
  • It's an imperfect world. Now, show Dick some respect!

  • by Fritz T. Coyote (1087965) on Friday April 20 2007, @11:05AM (#18812659) Homepage
    I love the (Friday) morning quarterbacks who will now proceed to beat up RIM for a system outage after a 'non critical' upgrade.

    And a bunch of suits will want the heads of the technicians responsible.

    I feel for them, I really do.

    A few years ago I put in a minor maintenance change that made headlines for my employer.

    This is a natural result of the budgetary constraints we have to live with in the real world. Testing and certification is expensive, and the more complex the environment, the more expensive it gets. It is difficult to justify a full blown certification test for minor, routine maintenance, unless you are talking about health and safety systems. So a worst-case event occurred, RIM suffers some corporate embarrassment, some low-level techs will get yelled at, and possibly fired, and a bunch of people had to suffer crackberry withdrawal.

    Nobody died. No planes crashed. No reactors melted down.

    RIM will work up some new and improved testing standards, and tighten the screws on system maintenance so much that productivity will suffer, they may even spend a bunch of money on the equipment needed to do full-production-parallel certification testing. And then in a year or so cut the budget to upgrade the certification environment as 'needless expense', and come up with work-arounds to reduce the time it takes to get trivial changes and bugfixes rolled out.

    I wish them luck. Especially to the poor sods who did the implementation.

    At least when I did my 'headline-making-minor-maintenance' it only made the local papers for a couple of days.

    • Re: (Score:3, Insightful)

      Nobody died. No planes crashed. No reactors melted down.

      You are safe on the planes crashing and on the meltdowns. I didn't hear of any such incidents.

      However, I will argue that the outage may have contributed to deaths. There are many hospitals which use Blackberries instead of pagers (2-way comms), so paging a surgeon or doctor or other staff to an emergency may not work well. I am sure there are other examples of critical applications (which should or should not use blackberries) that may have been effected. The obvious thing is that I cannot provide stats,

  • by markana (152984) on Friday April 20 2007, @11:08AM (#18812697)
    >...the failure was trigged by 'the introduction of a new, non-critical system routine' designed to increase the system's e-mail holding space.
        :
    >The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options.

    Hmmm... so when they wiped the incriminating e-mails from the system (which would certainly create more space), they took the rest of the system down (which prevented anyone else from grabbing copies).

    I'm reading WAY too many conspiracy novels these days :-)

    (Not that I think this actually happened - but it makes for a great plotline).
  • More details (Score:4, Informative)

    by kbahey (102895) on Friday April 20 2007, @11:11AM (#18812735) Homepage
    I live in Waterloo, and have friends and acquaintances who work at RIM. Talking to one of them who got called that night, he says that it started with a vendor issue, and then RIM's software did not react well to that issue.

    Of course he would not elaborate more on what it is.

    This Computer World article [computerworld.com] has more detail.

    The outage lasted about 12 hours overnight Tuesday for BlackBerry users mainly in North America, RIM and users reported.

    RIM said a fail-over system designed to stop the impact of such a problem did not work as expected, either. The company apologized to its 8 million users. RIM added that security and capacity issues were not the cause of the outage.

    "RIM has determined that the incident was triggered by the introduction of a new, noncritical system routine that was designed to provide better optimization of the system's cache," RIM officials said in a statement.

    "The system routine was expected to be nonimpacting with respect to the real-time operation of the BlackBerry infrastructure, but the pretesting of the system routine proved to be insufficient," the statement said.

    The new system routine "produced an unexpected impact and triggered a compounding series of interaction errors between the system's operational database and cache," according to the statement. "After isolating the resulting database problem and unsuccessfully attempting to correct it, RIM began it's fail-over process to a backup system."

    RIM described the backup system inadequacies this way: "Although the backup system and fail-over process had been repeatedly and successfully tested previously, the fail-over process did not fully perform to RIM's expectations in this situation and therefore caused further delay in restoring service and processing the resulting message queue."


    • it started with a vendor issue, and then RIM's software did not react well to that issue.

      Given the nature of the technology I find the explanation of a 'fail-over' system failing to kick in a tad disingenuous. It's not like a generator kicking in when the mains electricity stops. And what kind of design decisions led to an upgrade triggering outages for the entire North America.

      I would have thought they had multiple nodes at multiple locations with no single point of failure. Or at least three redund
  • "insufficiently tested software upgrade" => "untested software upgrade" => "some superstar at RIM changed the CRASH_NETWORK constant from 0 to 1."
  • Their use of the term "non critical" is most likely referring to the nature of the patch. It was an "optional" patch that did not fix any "critical vulnerabilities" or anything like that.

    It is quite obvious they were not referring to the criticalness of the system which was affected.
  • What ever happened to no single point of failure. And since when do you update a live system. Has no one learned anything in the past decade.

    Reminds me of when a Mobile phone company upgrades over the weekend and everyone discovered you could make long distance phone calls for free.
  • ...somebody forgot the ~ in rm -rf ~/

    Adding storage space to a single system shouldn't be a problem, since you take your system down for that anyway (or put it in spare mode or so) even if it's a hotplug-always-on-superfast-resizing-raid-with-aut omatic-failover-and-d2d2t2brain system. That it takes the whole network down, is a problem.
  • by Ralph Spoilsport (673134) on Friday April 20 2007, @12:25PM (#18813771) Journal
    If the product had been properly tested (and face it - outside of medical and military applications, how much of ANYTHING is properly tested?) they'd have found, reported, and fixed the bug weeks earlier.

    You can't expect programmers to do perfect work, even with unit testing and all the other basic amenities of software development. It requires QA, and that is something sorely lacking in contemprary software product. From the smallest OSX widget to MS Vista,Testing Matters.

    RS