Forgot your password?
typodupeerror
Cloud Businesses Data Storage Microsoft IT

Certificate Expiry Leads to Total Outage For Microsoft Azure Secured Storage 176

Posted by timothy
from the keeping-the-lights-on dept.
rtfa-troll writes "There has been a worldwide (all locations) total outage of storage in Microsoft's Azure cloud. Apparently, 'Microsoft unwittingly let an online security certificate expire Friday, triggering a worldwide outage in an online service that stores data for a wide range of business customers,' according to the San Francisco Chronicle (also Yahoo and the Register). Perhaps too much time has been spent sucking up to storage vendors and not enough looking after the customers? This comes directly after a week-long outage of one of Microsoft's SQL server components in Azure. This is not the first time that we have discussed major outages on Azure and probably won't be the last. It's certainly also not the first time we have discussed Microsoft cloud systems making users' data unavailable."
This discussion has been archived. No new comments can be posted.

Certificate Expiry Leads to Total Outage For Microsoft Azure Secured Storage

Comments Filter:
  • Lolwut? (Score:4, Funny)

    by Anonymous Coward on Saturday February 23, 2013 @10:21AM (#42988913)

    What's an expirty?

  • Expirty? (Score:1, Insightful)

    by Anonymous Coward

    Timothy!! It's your fucking JOB!

  • by Anonymous Coward

    Had better get fired. I normally don't condone firing over mistakes, but this is pretty huge.

    Although, it's also a point of proof of the cloud's inability to be reliable if not set up right.

    • It seems to be a point of poof...
    • Re:Somebody (Score:5, Insightful)

      by Glendale2x (210533) <slashdot@ninjam[ ]ey.us ['onk' in gap]> on Saturday February 23, 2013 @11:23AM (#42989225) Homepage

      Eh, don't put anything too important that you can't live without on systems outside of your control.

      • Re:Somebody (Score:5, Interesting)

        by Nerdfest (867930) on Saturday February 23, 2013 @12:35PM (#42989635)

        On the other hand, I've worked at places where the worst thing you could do is leave things that the company can't live without *in* the control of the company. Sometimes certain areas of expertise require specializations that the company just doesn't have and isn't interested in acquiring. Of course handing the responsibility of those things off to *Microsoft* is not necessarily any better.

        • by Kalriath (849904)

          Yeah, but who is? AWS has more outages than I care to remember, Rackspace has had it's share of outages, Google goes down like once a month, even Apple can't keep a service up - and that's pretty much all the big players counted out.

  • Typical. (Score:5, Funny)

    by berchca (414155) on Saturday February 23, 2013 @10:28AM (#42988963) Homepage

    Not the first time they've made such blunders:
    http://slashdot.org/story/03/11/06/1540257/microsoft-forgets-to-renew-hotmailcouk

    If only Redmond had some sort of calendar system to help them remember this stuff.

    • Re:Typical. (Score:5, Funny)

      by Stormthirst (66538) on Saturday February 23, 2013 @10:35AM (#42988983)

      Does MS not have a credit card its vendor can keep on file?

      • Re:Typical. (Score:5, Interesting)

        by Charliemopps (1157495) on Saturday February 23, 2013 @11:38AM (#42989277)

        You'd think that, but there's contract stuff. The thing is, you basically need a department in charge of renewing shit like this when you have enterprise level services. We've got a site with millions of hits daily and still manage to let it expire every couple of years. You try the credit card thing, but credit cards expire. You try recurring billing and then you get into a contractual nightmare with the registrar. The registrar isn't going to do you any favors, you might get millions of hits daily, but they still only get $5/year even from google.com so fuck you, figure out the billing yourself.

        The only real way to do it effectively is build yourself a database of all the crap you need to renew regularly, then hire someone to renew that stuff. But who are you going to hire? It usually ends up being some assistant that doesn't know a damned thing about tech... and it's still going to cost you $60k a year in pay and bennifits to retain them. That's an expensive way of keeping track of such things... ah, the website admins can remember right?

        • by drinkypoo (153816)

          It seems like a competent registrar would send a bill[ing statement] to the billing contact.

        • by Kalriath (849904)

          Except that companies like Microsoft and Google register domains through "Enterprise" registrars like MarkMonitor, who charge upwards of a few hundred (possibly even thousand) dollars per year for their service - which supposedly includes "not letting the fucking things expire" and "making sure other people don't register our damn marks".

          Microsoft actually has even less excuse in this instance, believe it or not - Microsoft's certificate vendor is itself. All MS certificates are chained up to a Microsoft s

    • Re:Typical. (Score:5, Interesting)

      by hsmith (818216) on Saturday February 23, 2013 @10:39AM (#42988999)
      It is almost a year ago to the day Azure was down for a day because no one accounted for leap year for validating certificates, lol. AWS seems to have issues too, but they don't seem to revolve around blatant stupidity and result in an entire day of downtime.
      • Re:Typical. (Score:5, Insightful)

        by rtb61 (674572) on Saturday February 23, 2013 @11:20AM (#42989179) Homepage

        M$ has a history of lack of customer focus hence it will fail ay any industry that demand the highest levels of customer focus. For cloud services to be down for a down is inexcusable and seriously any IT management staff that fails to acknowledge these failures and uses or recommends Azure should be fired. Any down time should be measured in minutes not days, this should be considered catastrophic failure. M$ is far to used to it's EULA's a warranty without a warranty and has become woefully complacent about actually guaranteeing a supply of service, meh, it mostly works it their motto and we'll fix it net time round, for sure this time.

  • Tip of the iceberg (Score:5, Insightful)

    by gmuslera (3436) on Saturday February 23, 2013 @10:38AM (#42988997) Homepage Journal
    If you can't trust Microsoft for such kind of small but essential things, should you trust them with bigger ones?
    • by pr0nbot (313417)
      For me the confusing thing is that there was a single point of failure. I thought that much of what the cloud was about was resilience; I would expect that someone designing cloud infrastructure would have done an analysis of failure points, and implemented failover mechanisms (or at least monitoring and recovery procedures). Ok, maybe not a cloud-startup-du-jour, but certainly a big enterprise-style entity like Microsoft.
      • by Junta (36770) on Saturday February 23, 2013 @12:10PM (#42989499)

        The reality is, if you outsource your hosting to a single company, there will always be single points of failure.

        There will be architectural ones, like root of trust expiring resulting in security framework taking everything down.

        There will be bugs that can bite all of their instances in the same way at the same time.

        There will be business realities like failing to pay electric bills, or collapsing, or simply closing down their hosting business for the sake of other business interests.

        Ideally:
        -You must keep ownership of all data required to set up anywhere at all time. Even if you host nothing publicly yourself, you must assure all your data exists on storage that you own.
        -You either do not outsource your hosting (in which case your single point of failure business wise would take you out anyway) or else you outsource to financially independent companies. "Everything to EC2" is a huge mistake, just as much as "everything to azure" is a huge mistake.
        -Never trust a providers security promises beyond what they explicitly accept liability for. If you consider the potential risk to be "priceless", then you cannot host it. If you do know what your exposure is (e.g. you could be sued for 20 million, then only host it if the provider will assume liability to the tune of 20 million)

      • by dbIII (701233)
        I think there's multiple single points of failure, such as the leap year problem that caused an entire day of downtime last year.
  • by crt (44106) on Saturday February 23, 2013 @10:56AM (#42989069)

    The really amazing thing is that if you look at their service dashboard, it took them 12 hours to update the certificates on their site:
    http://www.windowsazure.com/en-us/support/service-dashboard/ [windowsazure.com]

    They spent several hours doing "test deployments" ... while it's great to make sure you aren't going to make something worse, updating an SSL cert isn't exactly rocket science. I'd had to see how long it took to recover from a more serious service issue triggered by a software bug.

    • Maybe they tried rolling back to an older version of the cert first.

      (Yes, that was sarcasm.)

      • by sribe (304414)

        Maybe they tried rolling back to an older version of the cert first.

        No, first they would have tried reinstalling the current cert. Three times. Only then would they have moved on to rolling back to the prior version.

      • by gweihir (88907)

        Maybe they tried rolling back to an older version of the cert first.

        (Yes, that was sarcasm.)

        You know, from their track record, I would consider this entirely possible....

      • by rjr162 (69736)

        Pretty sure they tried rebooting first to solve the solution, which cause windows system repair to start on boot up. System repair ran for the whole time (since theres a grayed out cancel button you cant click) after which it reported system repair was unable to repair the system

    • by dbIII (701233)
      It's not that amazing when you consider the service level of their hosted email. A week to correct an internal DNS entry, and meanwhile a customer with sixteen thousand email users just had to wait in queue to get it fixed. The large print pretends to give, but the fine print says you just have to wait for as long as it takes and SLA's be damned.
  • by dargaud (518470) <slashdot2@gdargau d . n et> on Saturday February 23, 2013 @11:07AM (#42989129) Homepage
    I wonder how long it will be before there's a major failure loop in the cloud, something like the certificate for cloud X is stored in service Y, which actually uses cloud X as its backend. So when certificate for X stops, the whole thing grinds to a halt with no way to restart it (unless backdoors)...
    • by gweihir (88907)

      Hehehehehe, nice!

      I expect we _will_ see things like this though.

    • by Njovich (553857)

      And I wonder when Slashdot commenters will get how certificate infrastructures work these days... I guess neither of us will get lucky.

  • Anyone have the link?

  • An out of reach place where you give other people your stuff and hope they will hand it to you when you ask.
    I don't want my head in the clouds.
  • Microsoft's Azure could!

  • by johnlcallaway (165670) on Saturday February 23, 2013 @11:41AM (#42989297)
    ... this is what you get. Sure, it's possible the same thing can happen for any company. But at least then you can fire your incompetent staff.

    Once you deploy to a vendor, you are stuck. From what I've seen, you can't easily move data and code from one vendor to another. One of our clients is in the UK Azure cloud and we have to BCP about 6M rows from their server to our system every week. Takes over 90 minutes, and constantly fails because of losing the connection. We've looked at deploying systems to various clouds, and the costs were not worth it.

    I will NEVER put any critical business system in someone else's cloud. At worst, I might put it in someone's data center on *MY* servers. The cloud seems to be fine for small business startups and non-important data for personal use. Businesses who no one would even notice if their site was down for a day.

    BTW .. 'Cloud' computing is just remote virtual servers over the Internet. It's really not something new and original. People act like it's some amazing new 'thing'. Well .. it's not. It's just another way of letting companies with limited or no tech skills put up a web site or store data. It's expensive, proprietary, and I doubt very cost effective in the long run.
    • by Alioth (221270) <no@spam> on Saturday February 23, 2013 @11:54AM (#42989385) Journal

      Actually, there's a bit more to being "cloudy" than just virtual servers over the internet (indeed, they not even need be over the internet - you can have your own local cloud and many companies have internal clouds). Virtual servers over the internet is merely client/server. For a service to be "cloudy", generally it'll have attributes like HTTP (in other words, RESTful interfaces and each request being treated no different to the first request, in other words, the service doesn't hold state from request to request, just like with HTTP) and distributable. The main benefit of "cloudiness" is because of this you can easy scale up services when demand is high, and scale them back when demand is low. It makes it easier to make a resilient service than the traditional client/server type service where the server side has to keep state. Infrastructures like Amazon's EC2 allow you to scale things up and down easily and economically because you can turn on the "virtual server over the internet" part of it on and off very rapidly, and you only pay for the instances you've instatiated. But just using Amazon's EC2 doesn't automatically make your service "cloudy" if it does not have all the other necessary attributes.

      • by Viol8 (599362)

        "The main benefit of "cloudiness" is because of this you can easy scale up services when demand is high, and scale them back when demand is low."

        Do you genuinely think this wasn't done until some marketdroid thought up the term "cloud"?

        This is supposed to be a tech website FFS, at least pretend to have some sort of tech nous. Scaling available services up and down has been done since the days of fscking mainframes!

        • Yes and it was done by buying a shit ton of hardware and all the complexities and expenses that come with it. The problem is that 90% of the time that hardware was sitting around idle. Or that you would have to purchase a bunch of hardware for a one time project and then hope and pray that someone would buy that hardware from you when you were done. It doesn't take a tech website genius to realize how incredibly inefficient that is.
          • by Todd Knarr (15451)

            And you think the cloud works differently? It's just that someone else is buying all that hardware to have sitting around idle until you need it. You hope. But, being a business, I'll bet one of their policies is to not buy more hardware than their projected needs, to avoid having any more sitting around idle than they absolutely have to to cover their own short-term needs. Anything else increases their costs without providing any revenue, so as a business they're going to avoid it just like you are.

            What ma

            • It's just that someone else is buying all that hardware to have sitting around idle until you need it.

              That's no longer my problem. It's now an operating expense for me instead of a massive up front capital expense.

              What makes it work is that they have so many customers that when one needs more capacity they can take a bit away from everybody else and each customer's share will be so small they won't notice.

              Nooo... when you reserve a VM that VM is yours whether you use it or not. You are paying for it after all. I have a very tough time buying that any of the major cloud platforms are oversubscribed. You will have to back up that claim.

              It doesn't matter anyways. If you have grown to such a monstrous scale that you start to outgrow the capabilities of these cloud platforms, the capital cost of rol

              • by Todd Knarr (15451)

                That's no longer my problem. It's now an operating expense for me instead of a massive up front capital expense.

                Exactly. Now, answer me this: you've decided that you can't afford that large up-front capital expense and having that capacity sitting around unused to deal with the occasional large spike in demand. So why is your cloud provider not following exactly the same business logic that you find sound? Why are they not trying to avoid exactly the same large capital expenditure that you're trying to avoi

                • You seem to have only read the first 2 sentences of my post. I'm going to go ahead and let you read that again because it's relevant to your post.
          • The only people who bought a bunch of hardware and had it sitting around idle were people that didn't know how to manage data centers. You still have to project loads for the cloud, and you still have to pay for the ability to scale up. In fact, in our cost estimating, the cost of moving data into and out of someone else's cloud, and the cost of having those large data sets on their servers, was the reason it was more pricey than having our own servers locally even if we had to buy extra servers.

            And of c
    • Re: (Score:2, Interesting)

      by Anonymous Coward

      Once you deploy to a vendor, you are stuck. From what I've seen, you can't easily move data and code from one vendor to another.

      RHEL is CentOS is RHEL is Amazon Linux wherever you are. A basic of the cloud is that, as you migrate to it you migrate almost everything to Linux.

      One of our clients is in the UK Azure cloud and we have to BCP about 6M rows from their server to our system every week. Takes over 90 minutes, and constantly fails because of losing the connection. We've looked at deploying systems to various clouds, and the costs were not worth it.

      There have been outages in Amazon; almost nothing has ever crossed from one Availability zone to another. Multiple countries have never happened. At the same time there have been many total outages in Azure. Whilst Microsoft regularly loses data; every time a Google system fails totally, it turns out they have a tape backup. These are not "minor issues betw

  • Back in the bad old days, IBM had a solution for down time in mission critical systems - such as for United Airlines. It was called redundancy - a complete dual system. Or as we described it: when one of the two parallel systems detected an error, it automatically sent a signal to the second system so that it could go down too.

    • by gweihir (88907)

      I think this design was also used in the first Ariane 5 flight! You know the one where 800 Million Euros in solar-research satellites went up in smoke, because some manager was too stupid to understand that you cannot just plug-in an Ariane 4 guidance module and expect it to work.

  • The system works! Certificates work! Yeah!

    Now fire the idiot who forgot to update the certs and we can get on with life.

    • by fatphil (181876)
      Yes, the single point of failure works!

      But I thought "the cloud" wasn't supposed to have a single point of failure, otherwise it would be just a "remote server" rather than "the cloud"?
      • by kqs (1038910)

        There are always single points of failure. Always. In this case, it was that x509 is poorly designed, but there are others.

        The point of "the cloud" was never to have no single points of failure. It is to avoid any single points of failure it can, and hire smart people to avoid and fix the SPoFs it cannot, all at a far lower price than you could afford. And it works well (unless you choose to use an incompetent cloud provider). Most companies screw up certificate expirations at some point, then spend da

      • by Virtucon (127420)

        Well the cloud works on open web standards and while certificate servers can have redundancy built in, the underlying certificate would still essentially be a single point of failure in the design. Any TLS that relies on certs will have to take this into account. The good news is that while somebody goofed at MSFT, the underlying principles of Certs prevailed and people were denied access to resources because their clients wouldn't trust the MSFT resources protected by those certs. Now, I would be more c

  • Monitoring Fail (Score:4, Insightful)

    by HTMLSpinnr (531389) on Saturday February 23, 2013 @11:58AM (#42989405) Homepage
    I find it hard to believe anyone who maintains such a large fleet of services wouldn't have setup some sort of trivial monitoring (I know they own a product or two) that would include SSL Certificate expiration warning. 30+ days out, a ticket (or some sort of actionable tracking mechanism) should have been generated, alerting those responsible to start taking action. Said ticket should have become progressively higher severity as the expiration date loomed (meaning nothing had been updated), which in any sane company, would have implied higher and higher visibility.

    That way, if an extensive test plan for such a simple operation was required, they had plenty of time to execute upon it and still not miss the boat.

    Working with MS in other ways, and combined with both the lack of foresight and inability to act quickly, just shows that this sort of customer-forward thinking just doesn't exist inside the MS mind.
    • by ageoffri (723674)
      Believe it. When I worked at IBM, there was a certain automation team who let the critical SSL certificate for an ID provisioning tool expire not just once, but two years in a row causing a major outage to a large client.
    • Re:Monitoring Fail (Score:5, Insightful)

      by rabbitfood (586031) on Saturday February 23, 2013 @03:28PM (#42990763)

      Simple operation? You've clearly never worked for a large company.

      Even if a warning wasn't trickled down a month ago, and we've no reason to assume it wasn't, the person whose job it is to act on it, provided they weren't on vacation, won't have simply thrown five dollars at a registrar. They'll have had to put in a request to the finance department, probably via a cost-management chain of command, with a full description of what needed to be paid to whom and why, with payee reference, cost-center code, expense code and departmental authorization, and hoped it would arrive in time to be allocated to the next monthly rubber-stamp meeting. Assuming the application contained no errors, was suitably endorsed and was made against an allocated budget that hadn't been over-spent and wasn't under review, then, perhaps, in the fullness of time, it might have received approval and have been sent back down the chain for subsequent escalation to the bought-ledger department, who'd have looked at the due date, added ninety days and put it on the bottom of the pile. After those ninety days, when the finance folk began to take a view to assessing its urgency, unless they found a proper purchase order from the supplier, and a full set of signed terms and conditions of purchase, non-disclosure agreements, sustainability declarations and ethical supply-chain statements, as now required by any self-respecting outfit, it'll have been put aside and, eventually, sent back round to be done properly. Or, if it all checked out first time, it'll have been put on the system for calendering into the next round of payment processing.

      I'm sure it might be possible to streamline aspects of such mechanisms, but to suggest there's anything trivial about them is a touch hasty. But you never know. Perhaps they're already thinking of planning a meeting to discuss it, and are working on a framework for identifying the stakeholders as I write.

    • by jader3rd (2222716)
      I would be shocked if some sort of monitoring didn't fire. The problem would be that it would have gotten lost in the noise of all of the other monitors firing for other issues.
    • by TheLink (130905)

      After the infamous Feb 29th incident MS should have set up an Azure cluster identical to production stuff but with all the clocks set to 1 week or more ahead. Have it continuously running regression tests. Certs even getting close to 3 days before expiring is stupid.

      Microsoft has billions of dollars, so if this 12 hour downtime is the best MS can do when they're "All In" (Ballmer's words not mine), it's not a good sign.

  • by Skiron (735617) on Saturday February 23, 2013 @12:18PM (#42989551) Homepage

    I guessMS somewhere in their licensing of this stuff have a clause that states they are not liable. Basically, 'bollocks to the Customers' when we fuck up [again].

    So I cannot understand why people use them at all (once bitten, twice shy, twice bitten.. etc.).

    • Actually, Microsoft has a wide variety of SLAs with financial penalties covering the Azure cloud. I expect customers will be able to claim at least a 10% service credit on this, as it's definitely an issue within Microsoft's control and definitely would cause a miss of the monthly availability number.

      Review http://www.windowsazure.com/en-us/support/legal/sla/ [windowsazure.com] if you're interested in the Azure SLAs. Interestingly, Amazon has a much less tough SLA, as it's calculated on a yearly basis and doesn't have as brut

      • by Skiron (735617)

        99.9% is stated there a lot of times. Is that over a 1000 years?

        If not, that is about 1 day a year outage (when Customers go tits-up).

        They are keeping their promise, it seems.

        • by fatphil (181876)
          99.9% is between 7 and 8 hours down-time a month (which is the unit they measure in). If it took them 12 hours to get new certificates up, then they are not keeping their promise, they are failing.

          Of course, if that downtime coincides with your working hours, that's an entire working day down. It's a shitty level of service. Nobody hosting their own services, and having skilled staff managing their systems, would find that acceptable. I will admit that 99.999% uptime/connectivity is hard (we've had it one y
        • by symbolset (646467) *
          The Azure SQL Database reporting facility just completed a 5-day outage this month [theregister.co.uk] so they may be a couple years over their downtime quota this month. Or as somebody else put it recently: "Five nines: 9.9999".
      • by Todd Knarr (15451)

        My problem with those SLAs is that they're for a credit for a fraction of the cost of the service for that month. Which is fine if your business doesn't depend on the service and you suffer no disruption when the service is down. But if you're hosting a Web site on the service, or using it for anything business-critical? The cost of the service is going to be the smallest part of the cost to you of the disruption (that's why you went with the service after all, because it was so much cheaper than doing it i

  • by ejoe_mac (560743) on Saturday February 23, 2013 @01:30PM (#42989999)
    So wrong in so many ways. Any reason you wouldn't purchase a 100 year certificate and just roll with it? Too bad about 1/3 of all Azure disk space is used for endpoint backup. This reminds me of the leap-year calculating bug - Feb 29 2012, you couldn't generate a site because the default is to generate a certificate for 1 year, and well, Feb 29 2013 just doesn't exist. http://blogs.msdn.com/b/windowsazure/archive/2012/03/09/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx [msdn.com]
  • by gweihir (88907) on Saturday February 23, 2013 @02:02PM (#42990219)

    From a business perspective, it makes perfect sense: If Azure were reliable, secure and fast, customers could start to wonder why the other products by MS are not. This could heighten customer expectations, and that would be bad as MS really does not have the engineering capabilities to build, say, a good OS or a good office productivity suite and then customers may leave for the alternatives. So I applaud them for their foresight in making Azure just as bad as their other things are. This may actually be quite beneficial for their bottom-line.

  • by Sloppy (14984) on Saturday February 23, 2013 @02:48PM (#42990509) Homepage Journal

    Imagine if someone's signature on your PGP identity expired. It might be a bit of a blow, but people would still have other trust pathways toward you. Then you get a new signature from 'em, or someone else.

    Certs can fail in so many ways, both false positives (compromised CAs) or false negatives (such as this expiration), and a myriad of subjective failures since different people have different reasons to trust (or not trust) different CAs. The risks aren't even theoretical. Failure really happens, to the extent that it's almost routine and we see a story about it here on Slashdot every month.

    And Phil Zimmerman totally solved the problem(!) in, what, 1988? Why are we still using obsolete-the-day-it-came-out single signer systems? So brittle. So unrealistic.

    The only reason I can think of, is that it would work too well. MitM attacks would become nearly impossible for even the most powerful governments. Certs would become so competitive and cheap that the CA business would collapse.

  • My perception of Ballmer and Dell is that they virtually started with their companies and neither person has a wide ranging training in business management & psychology of managing. Ballmer is famous for his chair throwing and viscous firing with a loud voice, sometimes for trivial reasons & banning Apple products in most places inside the company. Dell has been reported to become physically withdrawn when competitor Apple is mentioned.

    Neither of those responses to common activities speak good of

  • by ei4anb (625481) on Saturday February 23, 2013 @05:37PM (#42991491)
    $ curl -vIs https://www.windowsazure.com/ [windowsazure.com] 2>&1 >/dev/null | grep "expire date"
    * expire date: 2013-11-15 18:15:53 GMT

    Call this from a cronjob script which should then take suitable action if the date is too close.

It is impossible to enjoy idling thoroughly unless one has plenty of work to do. -- Jerome Klapka Jerome

Working...