Certificate Expiry Leads to Total Outage For Microsoft Azure Secured Storage

Follow Slashdot blog updates by subscribing to our blog RSS feed

Certificate Expiry Leads to Total Outage For Microsoft Azure Secured Storage 176

Posted by timothy on Saturday February 23, 2013 @10:17AM from the keeping-the-lights-on dept.

rtfa-troll writes "There has been a worldwide (all locations) total outage of storage in Microsoft's Azure cloud. Apparently, 'Microsoft unwittingly let an online security certificate expire Friday, triggering a worldwide outage in an online service that stores data for a wide range of business customers,' according to the San Francisco Chronicle (also Yahoo and the Register). Perhaps too much time has been spent sucking up to storage vendors and not enough looking after the customers? This comes directly after a week-long outage of one of Microsoft's SQL server components in Azure. This is not the first time that we have discussed major outages on Azure and probably won't be the last. It's certainly also not the first time we have discussed Microsoft cloud systems making users' data unavailable."

This discussion has been archived. No new comments can be posted.

Certificate Expiry Leads to Total Outage For Microsoft Azure Secured Storage

Load All Comments

Search 176 Comments Log In/Create an Account

Comments Filter:

Lolwut? (Score:4, Funny)

by Anonymous Coward writes: on Saturday February 23, 2013 @10:21AM (#42988913)

What's an expirty?

Share
twitter facebook
- Re:Lolwut? (Score:5, Funny)
  
  by Nidi62 ( 1525137 ) writes: on Saturday February 23, 2013 @10:27AM (#42988955)
  
  I think you get them from storage vendros
  
  Parent Share
  twitter facebook
  - Re:Lolwut? (Score:5, Funny)
    
    by drinkypoo ( 153816 ) writes: <drink@hyperlogos.org> on Saturday February 23, 2013 @10:34AM (#42988981) Homepage Journal
    
    Vendro is Destro's cousin, who works on the supply side.
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by flyingfsck ( 986395 ) writes:
      
      Look, if Destro is selling expirtys in my neighbourhood, then I want a slice!
Expirty? (Score:1, Insightful)

by Anonymous Coward writes:

Timothy!! It's your fucking JOB!
Somebody (Score:1)

by Anonymous Coward writes:

Had better get fired. I normally don't condone firing over mistakes, but this is pretty huge.
Although, it's also a point of proof of the cloud's inability to be reliable if not set up right.
- Re: (Score:2)
  
  by flyingfsck ( 986395 ) writes:
  
  It seems to be a point of poof...
- Re:Somebody (Score:5, Insightful)
  
  by Glendale2x ( 210533 ) writes: <slashdot@ninjam o n k ey.us> on Saturday February 23, 2013 @11:23AM (#42989225) Homepage
  
  Eh, don't put anything too important that you can't live without on systems outside of your control.
  
  Parent Share
  twitter facebook
  - Re:Somebody (Score:5, Interesting)
    
    by Nerdfest ( 867930 ) writes: on Saturday February 23, 2013 @12:35PM (#42989635)
    
    On the other hand, I've worked at places where the worst thing you could do is leave things that the company can't live without *in* the control of the company. Sometimes certain areas of expertise require specializations that the company just doesn't have and isn't interested in acquiring. Of course handing the responsibility of those things off to *Microsoft* is not necessarily any better.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by Kalriath ( 849904 ) writes:
      
      Yeah, but who is? AWS has more outages than I care to remember, Rackspace has had it's share of outages, Google goes down like once a month, even Apple can't keep a service up - and that's pretty much all the big players counted out.
- - Re: (Score:3)
    
    by flyingfsck ( 986395 ) writes:
    
    So you are saying that MS is the lowest common denominator. I guess you can say that again.
  - Re:Somebody (Score:4, Insightful)
    
    by Anonymous Coward writes: on Saturday February 23, 2013 @11:53AM (#42989377)
    
    Somehow I feel those worker visas are the issue here.
    Anything else you'd like to blame on foreigners?
    Declining population of ducks in the local pond?
    Chips no-longer served in old newspaper?
    Lack of respect for elders?
    Banning of blackboards in schools?
    Rampant rape and violence all foreigners bring to your little Daily Mail reading village?
    
    Parent Share
    twitter facebook
    - Re:Somebody (Score:4, Insightful)
      
      by RazorSharp ( 1418697 ) writes: on Saturday February 23, 2013 @02:39PM (#42990439)
      
      If you want to defend H1B1 workers and dirt-cheap Indian code monkeys, perhaps you should make a logical argument.
      I don't think the guy you're responding to had the most well thought out argument but your response did nothing to refute it. You accuse him of xenophobia when it's obvious that he wasn't talking about foreigners in general, he was talking about specific foreigner workers that are hired by American firms that are looking to cut costs. That doesn't mean that all foreigners are incompetent -- the assumption is that the most competent foreigners don't have to accept lower than deserved wages to undercut American workers. There's a reason the foreigners who undercut American jobs are willing to accept less money -- they're not worth as much.
      Shame on the four mods who upvoted your post.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by PRMan ( 959735 ) writes:
        
        Having worked in hiring of H1B workers, I can assure you it's illegal to offer them significantly less than an American. Maybe they get a couple thousand less because they don't have the experience, but it's the same as if you hired a US worker with the same experience. They don't work for half and they are not indentured servants (they can find another H1B sponsor and stay in the country).
        
        Re: (Score:2)
        
        by RazorSharp ( 1418697 ) writes:
        
        I won't try to argue against your anecdote, though I'd like to point out that your experience is a single instance and I've heard other anecdotes that are contradictory. Different situations are different.
        My major qualm was the AC's accusation of xenophobia where xenophobia wasn't present. It sickens me when people try to use political correctness as a trump card in a debate when it's really just a red herring.
    - Re: (Score:2)
      
      by sjames ( 1099 ) writes:
      
      It's not their foreign-ness that is the problem, it's their dirt-cheap-ness. Those just happen to go hand-in-hand due to the way H1-B works.
      It all falls under "If you pay peanuts, you get monkeys". It just happens that the current fashionable way to get monkeys is to import them.
    - - Re: (Score:2)
        
        by GigaplexNZ ( 1233886 ) writes:
        
        I can't even have a phone conversation with one most of the time and you're lucky if they ever follow through with what they say they'll do.
        Funny, I can say the same thing about most Americans I've had to deal with.
  - Re:Somebody (Score:5, Insightful)
    
    by DarkOx ( 621550 ) writes: on Saturday February 23, 2013 @12:21PM (#42989563) Journal
    
    Right and I think this is an important aspect to the problem here.
    There is simply no substitute for having all your I's dotted and T's cross with large integrated systems like this. This is a culture problem not a individual screwed up problem. If you just fire the guy, there will be lots of awareness but the take away most of your remaining people will get is "don't forget to check the certificate expiry dates, that'll get you canned" many of them traumatized by the experience will dutifully check certificate dates for the rest of their careers but this will do nothing to prevent your next major outage; because that will almost certainly be the result of something else.
    Everyone is pushing this vitalization + "dev ops" + management/monitoring is going to let us have one admin do what was once the work of ten. The fact is it just does not work like that. Management/monitoring like Microsoft Mom for example requires you to have all the failure modes identified and the scripts written to check conditions like expiry dates and trigger the alerts. Unless everyone is really good about all the routine maintenance tasks in there is won't help with something like this. That takes time you ONE admin has not got and discipline that breaks down when someone is overworked.
    The "dev ops" and vitalization stuff is all great in terms of how much can be automated. Someone has to develop that automation though. Your ONE guy does not have time to build and test his generic deployment scrip when you promised your customers you'd have their infrastructure stood up last week.
    It comes down to the business recognizing its important to have good people, enough people, and willingness to invest in making sure the job is done correctly and completely every time, and that documentation is maintained and in a way everyone knows how to use it. Check lists need to be kept and followed etc. IT got away from plant engineering style discipline when hardware got cheap. You know longer had to worry about that one computer you had failing. As we move back to more consolidated and integrated solutions; management is going to have to get used to the idea again that there is some people time investment that must be made. Its great you can save on power, cooling equipment, and headcount but you can't cut headcount to far because the more consolidated you get the less you can afford for anything to go wrong so it all must be check, doubled checked, and checked again just to be sure. This is if you do it yourself or if you pay your cloud provider to do it. Either way cloud services so far have been mostly a race to the bottom and that is going to cause some to have to learn some very painful lessons if the industry remains on its current trajectory.
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by shutdown -p now ( 807394 ) writes:
    
    There is still a person who wrote the manual for that procedure. And the HR or manager who approved such by-the-rote practice in the first place. Ultimately, there is a responsible party - a person, or several people.
    Also, "it was not in my work manual" is not a valid excuse for the failure to apply common sense. Are all the people who deal with those certs failed to notice the expiry date? Did it never cross their mind to consider what would happen when the cert expires? I'm not talking about showing "fire
    - Re: (Score:2)
      
      by cheekyboy ( 598084 ) writes:
      
      heres a tip, dont expire by N years or months.
      Check the date it expires on, never expire on sunday.
      Or some holiday.
      Oh and we need better notifications, than just an email, or sms, how about a computer TTS voice to call you and speak it to you, "wahh dude, your gona expire in 1 day"
Typical. (Score:5, Funny)

by berchca ( 414155 ) writes: on Saturday February 23, 2013 @10:28AM (#42988963) Homepage

Not the first time they've made such blunders:
http://slashdot.org/story/03/11/06/1540257/microsoft-forgets-to-renew-hotmailcouk
If only Redmond had some sort of calendar system to help them remember this stuff.

Share
twitter facebook
- Re:Typical. (Score:5, Funny)
  
  by Stormthirst ( 66538 ) writes: on Saturday February 23, 2013 @10:35AM (#42988983)
  
  Does MS not have a credit card its vendor can keep on file?
  
  Parent Share
  twitter facebook
  - Re:Typical. (Score:5, Interesting)
    
    by Charliemopps ( 1157495 ) writes: on Saturday February 23, 2013 @11:38AM (#42989277)
    
    You'd think that, but there's contract stuff. The thing is, you basically need a department in charge of renewing shit like this when you have enterprise level services. We've got a site with millions of hits daily and still manage to let it expire every couple of years. You try the credit card thing, but credit cards expire. You try recurring billing and then you get into a contractual nightmare with the registrar. The registrar isn't going to do you any favors, you might get millions of hits daily, but they still only get $5/year even from google.com so fuck you, figure out the billing yourself.
    The only real way to do it effectively is build yourself a database of all the crap you need to renew regularly, then hire someone to renew that stuff. But who are you going to hire? It usually ends up being some assistant that doesn't know a damned thing about tech... and it's still going to cost you $60k a year in pay and bennifits to retain them. That's an expensive way of keeping track of such things... ah, the website admins can remember right?
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by drinkypoo ( 153816 ) writes:
      
      It seems like a competent registrar would send a bill[ing statement] to the billing contact.
    - Re: (Score:2)
      
      by Kalriath ( 849904 ) writes:
      
      Except that companies like Microsoft and Google register domains through "Enterprise" registrars like MarkMonitor, who charge upwards of a few hundred (possibly even thousand) dollars per year for their service - which supposedly includes "not letting the fucking things expire" and "making sure other people don't register our damn marks".
      Microsoft actually has even less excuse in this instance, believe it or not - Microsoft's certificate vendor is itself. All MS certificates are chained up to a Microsoft s
    - - Re: (Score:2)
        
        by symbolset ( 646467 ) * writes:
        
        I think your estimated minimum of the damage is several digits short.
    - - Re: (Score:2)
        
        by Anne Thwacks ( 531696 ) writes:
        
        send it out to every IT guy in the company
        When the entire IT department is incapacitated by incoming chair damage, it might not help. Every janitor is the required email target.
- Re:Typical. (Score:5, Interesting)
  
  by hsmith ( 818216 ) writes: on Saturday February 23, 2013 @10:39AM (#42988999)
  
  It is almost a year ago to the day Azure was down for a day because no one accounted for leap year for validating certificates, lol. AWS seems to have issues too, but they don't seem to revolve around blatant stupidity and result in an entire day of downtime.
  
  Parent Share
  twitter facebook
  - Re:Typical. (Score:5, Insightful)
    
    by rtb61 ( 674572 ) writes: on Saturday February 23, 2013 @11:20AM (#42989179) Homepage
    
    M$ has a history of lack of customer focus hence it will fail ay any industry that demand the highest levels of customer focus. For cloud services to be down for a down is inexcusable and seriously any IT management staff that fails to acknowledge these failures and uses or recommends Azure should be fired. Any down time should be measured in minutes not days, this should be considered catastrophic failure. M$ is far to used to it's EULA's a warranty without a warranty and has become woefully complacent about actually guaranteeing a supply of service, meh, it mostly works it their motto and we'll fix it net time round, for sure this time.
    
    Parent Share
    twitter facebook
Tip of the iceberg (Score:5, Insightful)

by gmuslera ( 3436 ) writes: on Saturday February 23, 2013 @10:38AM (#42988997) Homepage Journal

If you can't trust Microsoft for such kind of small but essential things, should you trust them with bigger ones?

Share
twitter facebook
- Re: (Score:2)
  
  by pr0nbot ( 313417 ) writes:
  
  For me the confusing thing is that there was a single point of failure. I thought that much of what the cloud was about was resilience; I would expect that someone designing cloud infrastructure would have done an analysis of failure points, and implemented failover mechanisms (or at least monitoring and recovery procedures). Ok, maybe not a cloud-startup-du-jour, but certainly a big enterprise-style entity like Microsoft.
  - Re:Tip of the iceberg (Score:5, Insightful)
    
    by Junta ( 36770 ) writes: on Saturday February 23, 2013 @12:10PM (#42989499)
    
    The reality is, if you outsource your hosting to a single company, there will always be single points of failure.
    There will be architectural ones, like root of trust expiring resulting in security framework taking everything down.
    There will be bugs that can bite all of their instances in the same way at the same time.
    There will be business realities like failing to pay electric bills, or collapsing, or simply closing down their hosting business for the sake of other business interests.
    Ideally:
    -You must keep ownership of all data required to set up anywhere at all time. Even if you host nothing publicly yourself, you must assure all your data exists on storage that you own.
    -You either do not outsource your hosting (in which case your single point of failure business wise would take you out anyway) or else you outsource to financially independent companies. "Everything to EC2" is a huge mistake, just as much as "everything to azure" is a huge mistake.
    -Never trust a providers security promises beyond what they explicitly accept liability for. If you consider the potential risk to be "priceless", then you cannot host it. If you do know what your exposure is (e.g. you could be sued for 20 million, then only host it if the provider will assume liability to the tune of 20 million)
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by dbIII ( 701233 ) writes:
    
    I think there's multiple single points of failure, such as the leap year problem that caused an entire day of downtime last year.
12 hours to update the certs? (Score:5, Informative)

by crt ( 44106 ) writes: on Saturday February 23, 2013 @10:56AM (#42989069)

The really amazing thing is that if you look at their service dashboard, it took them 12 hours to update the certificates on their site:
http://www.windowsazure.com/en-us/support/service-dashboard/ [windowsazure.com]
They spent several hours doing "test deployments" ... while it's great to make sure you aren't going to make something worse, updating an SSL cert isn't exactly rocket science. I'd had to see how long it took to recover from a more serious service issue triggered by a software bug.

Share
twitter facebook
- Re:12 hours to update the certs? (Score:5, Funny)
  
  by Glendale2x ( 210533 ) writes: <slashdot@ninjam o n k ey.us> on Saturday February 23, 2013 @11:21AM (#42989203) Homepage
  
  Maybe they tried rolling back to an older version of the cert first.
  (Yes, that was sarcasm.)
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by sribe ( 304414 ) writes:
    
    Maybe they tried rolling back to an older version of the cert first.
    No, first they would have tried reinstalling the current cert. Three times. Only then would they have moved on to rolling back to the prior version.
  - Re: (Score:3)
    
    by gweihir ( 88907 ) writes:
    
    Maybe they tried rolling back to an older version of the cert first.
    (Yes, that was sarcasm.)
    You know, from their track record, I would consider this entirely possible....
  - Re: (Score:3)
    
    by rjr162 ( 69736 ) writes:
    
    Pretty sure they tried rebooting first to solve the solution, which cause windows system repair to start on boot up. System repair ran for the whole time (since theres a grayed out cancel button you cant click) after which it reported system repair was unable to repair the system
- Re: (Score:2)
  
  by dbIII ( 701233 ) writes:
  
  It's not that amazing when you consider the service level of their hosted email. A week to correct an internal DNS entry, and meanwhile a customer with sixteen thousand email users just had to wait in queue to get it fixed. The large print pretends to give, but the fine print says you just have to wait for as long as it takes and SLA's be damned.
Entwined failure loop... (Score:5, Interesting)

by dargaud ( 518470 ) writes: <slashdot2@@@gdargaud...net> on Saturday February 23, 2013 @11:07AM (#42989129) Homepage

I wonder how long it will be before there's a major failure loop in the cloud, something like the certificate for cloud X is stored in service Y, which actually uses cloud X as its backend. So when certificate for X stops, the whole thing grinds to a halt with no way to restart it (unless backdoors)...

Share
twitter facebook
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Hehehehehe, nice!
  I expect we _will_ see things like this though.
- Re: (Score:2)
  
  by Njovich ( 553857 ) writes:
  
  And I wonder when Slashdot commenters will get how certificate infrastructures work these days... I guess neither of us will get lucky.
Where do I pay? (Score:2)

by Trailer Trash ( 60756 ) writes:

Anyone have the link?
Ahhh.. the cloud (Score:2)

by wbr1 ( 2538558 ) writes:

An out of reach place where you give other people your stuff and hope they will hand it to you when you ask.
I don't want my head in the clouds.
Microsoft's Azure cloud (Score:2)

by Mister Liberty ( 769145 ) writes:

Microsoft's Azure could!
When you depend on other people ... (Score:4, Insightful)

by johnlcallaway ( 165670 ) writes: on Saturday February 23, 2013 @11:41AM (#42989297)

... this is what you get. Sure, it's possible the same thing can happen for any company. But at least then you can fire your incompetent staff.

Once you deploy to a vendor, you are stuck. From what I've seen, you can't easily move data and code from one vendor to another. One of our clients is in the UK Azure cloud and we have to BCP about 6M rows from their server to our system every week. Takes over 90 minutes, and constantly fails because of losing the connection. We've looked at deploying systems to various clouds, and the costs were not worth it.

I will NEVER put any critical business system in someone else's cloud. At worst, I might put it in someone's data center on *MY* servers. The cloud seems to be fine for small business startups and non-important data for personal use. Businesses who no one would even notice if their site was down for a day.

BTW .. 'Cloud' computing is just remote virtual servers over the Internet. It's really not something new and original. People act like it's some amazing new 'thing'. Well .. it's not. It's just another way of letting companies with limited or no tech skills put up a web site or store data. It's expensive, proprietary, and I doubt very cost effective in the long run.

Share
twitter facebook
- Re:When you depend on other people ... (Score:4, Interesting)
  
  by Alioth ( 221270 ) writes: <no@spam> on Saturday February 23, 2013 @11:54AM (#42989385) Journal
  
  Actually, there's a bit more to being "cloudy" than just virtual servers over the internet (indeed, they not even need be over the internet - you can have your own local cloud and many companies have internal clouds). Virtual servers over the internet is merely client/server. For a service to be "cloudy", generally it'll have attributes like HTTP (in other words, RESTful interfaces and each request being treated no different to the first request, in other words, the service doesn't hold state from request to request, just like with HTTP) and distributable. The main benefit of "cloudiness" is because of this you can easy scale up services when demand is high, and scale them back when demand is low. It makes it easier to make a resilient service than the traditional client/server type service where the server side has to keep state. Infrastructures like Amazon's EC2 allow you to scale things up and down easily and economically because you can turn on the "virtual server over the internet" part of it on and off very rapidly, and you only pay for the instances you've instatiated. But just using Amazon's EC2 doesn't automatically make your service "cloudy" if it does not have all the other necessary attributes.
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by Viol8 ( 599362 ) writes:
    
    "The main benefit of "cloudiness" is because of this you can easy scale up services when demand is high, and scale them back when demand is low."
    Do you genuinely think this wasn't done until some marketdroid thought up the term "cloud"?
    This is supposed to be a tech website FFS, at least pretend to have some sort of tech nous. Scaling available services up and down has been done since the days of fscking mainframes!
    - Re: (Score:3)
      
      by DeathFromSomewhere ( 940915 ) writes:
      
      Yes and it was done by buying a shit ton of hardware and all the complexities and expenses that come with it. The problem is that 90% of the time that hardware was sitting around idle. Or that you would have to purchase a bunch of hardware for a one time project and then hope and pray that someone would buy that hardware from you when you were done. It doesn't take a tech website genius to realize how incredibly inefficient that is.
      - Re: (Score:3)
        
        by Todd Knarr ( 15451 ) writes:
        
        And you think the cloud works differently? It's just that someone else is buying all that hardware to have sitting around idle until you need it. You hope. But, being a business, I'll bet one of their policies is to not buy more hardware than their projected needs, to avoid having any more sitting around idle than they absolutely have to to cover their own short-term needs. Anything else increases their costs without providing any revenue, so as a business they're going to avoid it just like you are.
        What ma
        
        Re: (Score:2)
        
        by DeathFromSomewhere ( 940915 ) writes:
        
        It's just that someone else is buying all that hardware to have sitting around idle until you need it.
        That's no longer my problem. It's now an operating expense for me instead of a massive up front capital expense.
        What makes it work is that they have so many customers that when one needs more capacity they can take a bit away from everybody else and each customer's share will be so small they won't notice.
        Nooo... when you reserve a VM that VM is yours whether you use it or not. You are paying for it after all. I have a very tough time buying that any of the major cloud platforms are oversubscribed. You will have to back up that claim.
        
        It doesn't matter anyways. If you have grown to such a monstrous scale that you start to outgrow the capabilities of these cloud platforms, the capital cost of rol
        
        Re: (Score:2)
        
        by Todd Knarr ( 15451 ) writes:
        
        That's no longer my problem. It's now an operating expense for me instead of a massive up front capital expense.
        Exactly. Now, answer me this: you've decided that you can't afford that large up-front capital expense and having that capacity sitting around unused to deal with the occasional large spike in demand. So why is your cloud provider not following exactly the same business logic that you find sound? Why are they not trying to avoid exactly the same large capital expenditure that you're trying to avoi
        
        Re: (Score:2)
        
        by DeathFromSomewhere ( 940915 ) writes:
        
        You seem to have only read the first 2 sentences of my post. I'm going to go ahead and let you read that again because it's relevant to your post.
      - Re: (Score:2)
        
        by johnlcallaway ( 165670 ) writes:
        
        The only people who bought a bunch of hardware and had it sitting around idle were people that didn't know how to manage data centers. You still have to project loads for the cloud, and you still have to pay for the ability to scale up. In fact, in our cost estimating, the cost of moving data into and out of someone else's cloud, and the cost of having those large data sets on their servers, was the reason it was more pricey than having our own servers locally even if we had to buy extra servers.
        
        And of c
- Re: (Score:2, Interesting)
  
  by Anonymous Coward writes:
  
  Once you deploy to a vendor, you are stuck. From what I've seen, you can't easily move data and code from one vendor to another.
  
  RHEL is CentOS is RHEL is Amazon Linux wherever you are. A basic of the cloud is that, as you migrate to it you migrate almost everything to Linux.
  One of our clients is in the UK Azure cloud and we have to BCP about 6M rows from their server to our system every week. Takes over 90 minutes, and constantly fails because of losing the connection. We've looked at deploying systems to various clouds, and the costs were not worth it.
  There have been outages in Amazon; almost nothing has ever crossed from one Availability zone to another. Multiple countries have never happened. At the same time there have been many total outages in Azure. Whilst Microsoft regularly loses data; every time a Google system fails totally, it turns out they have a tape backup. These are not "minor issues betw
error protection (Score:2)

by GPierce ( 123599 ) writes:

Back in the bad old days, IBM had a solution for down time in mission critical systems - such as for United Airlines. It was called redundancy - a complete dual system. Or as we described it: when one of the two parallel systems detected an error, it automatically sent a signal to the second system so that it could go down too.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  I think this design was also used in the first Ariane 5 flight! You know the one where 800 Million Euros in solar-research satellites went up in smoke, because some manager was too stupid to understand that you cannot just plug-in an Ariane 4 guidance module and expect it to work.
- - Re: (Score:2)
    
    by Anne Thwacks ( 531696 ) writes:
    
    IBM never provided anything inexpensively.
    And they still nearly went bust.
The system works! (Score:2)

by Virtucon ( 127420 ) writes:

The system works! Certificates work! Yeah!
Now fire the idiot who forgot to update the certs and we can get on with life.
- Re: (Score:3)
  
  by fatphil ( 181876 ) writes:
  
  Yes, the single point of failure works!
  
  But I thought "the cloud" wasn't supposed to have a single point of failure, otherwise it would be just a "remote server" rather than "the cloud"?
  - Re: (Score:2)
    
    by kqs ( 1038910 ) writes:
    
    There are always single points of failure. Always. In this case, it was that x509 is poorly designed, but there are others.
    The point of "the cloud" was never to have no single points of failure. It is to avoid any single points of failure it can, and hire smart people to avoid and fix the SPoFs it cannot, all at a far lower price than you could afford. And it works well (unless you choose to use an incompetent cloud provider). Most companies screw up certificate expirations at some point, then spend da
  - Re: (Score:2)
    
    by Virtucon ( 127420 ) writes:
    
    Well the cloud works on open web standards and while certificate servers can have redundancy built in, the underlying certificate would still essentially be a single point of failure in the design. Any TLS that relies on certs will have to take this into account. The good news is that while somebody goofed at MSFT, the underlying principles of Certs prevailed and people were denied access to resources because their clients wouldn't trust the MSFT resources protected by those certs. Now, I would be more c
- - Re: (Score:2)
    
    by Anne Thwacks ( 531696 ) writes:
    
    So all is revealed: renewing the certificates was your job!
Monitoring Fail (Score:4, Insightful)

by HTMLSpinnr ( 531389 ) writes: on Saturday February 23, 2013 @11:58AM (#42989405) Homepage

I find it hard to believe anyone who maintains such a large fleet of services wouldn't have setup some sort of trivial monitoring (I know they own a product or two) that would include SSL Certificate expiration warning. 30+ days out, a ticket (or some sort of actionable tracking mechanism) should have been generated, alerting those responsible to start taking action. Said ticket should have become progressively higher severity as the expiration date loomed (meaning nothing had been updated), which in any sane company, would have implied higher and higher visibility.

That way, if an extensive test plan for such a simple operation was required, they had plenty of time to execute upon it and still not miss the boat.

Working with MS in other ways, and combined with both the lack of foresight and inability to act quickly, just shows that this sort of customer-forward thinking just doesn't exist inside the MS mind.

Share
twitter facebook
- Re: (Score:2)
  
  by ageoffri ( 723674 ) writes:
  
  Believe it. When I worked at IBM, there was a certain automation team who let the critical SSL certificate for an ID provisioning tool expire not just once, but two years in a row causing a major outage to a large client.
- Re:Monitoring Fail (Score:5, Insightful)
  
  by rabbitfood ( 586031 ) writes: on Saturday February 23, 2013 @03:28PM (#42990763)
  
  Simple operation? You've clearly never worked for a large company.
  Even if a warning wasn't trickled down a month ago, and we've no reason to assume it wasn't, the person whose job it is to act on it, provided they weren't on vacation, won't have simply thrown five dollars at a registrar. They'll have had to put in a request to the finance department, probably via a cost-management chain of command, with a full description of what needed to be paid to whom and why, with payee reference, cost-center code, expense code and departmental authorization, and hoped it would arrive in time to be allocated to the next monthly rubber-stamp meeting. Assuming the application contained no errors, was suitably endorsed and was made against an allocated budget that hadn't been over-spent and wasn't under review, then, perhaps, in the fullness of time, it might have received approval and have been sent back down the chain for subsequent escalation to the bought-ledger department, who'd have looked at the due date, added ninety days and put it on the bottom of the pile. After those ninety days, when the finance folk began to take a view to assessing its urgency, unless they found a proper purchase order from the supplier, and a full set of signed terms and conditions of purchase, non-disclosure agreements, sustainability declarations and ethical supply-chain statements, as now required by any self-respecting outfit, it'll have been put aside and, eventually, sent back round to be done properly. Or, if it all checked out first time, it'll have been put on the system for calendering into the next round of payment processing.
  I'm sure it might be possible to streamline aspects of such mechanisms, but to suggest there's anything trivial about them is a touch hasty. But you never know. Perhaps they're already thinking of planning a meeting to discuss it, and are working on a framework for identifying the stakeholders as I write.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by HTMLSpinnr ( 531389 ) writes:
    
    You'd be mistaken. The large company I work for has indeed learned from these mistakes.
- Re: (Score:2)
  
  by jader3rd ( 2222716 ) writes:
  
  I would be shocked if some sort of monitoring didn't fire. The problem would be that it would have gotten lost in the noise of all of the other monitors firing for other issues.
- Re: (Score:3)
  
  by TheLink ( 130905 ) writes:
  
  After the infamous Feb 29th incident MS should have set up an Azure cluster identical to production stuff but with all the clocks set to 1 week or more ahead. Have it continuously running regression tests. Certs even getting close to 3 days before expiring is stupid.
  Microsoft has billions of dollars, so if this 12 hour downtime is the best MS can do when they're "All In" (Ballmer's words not mine), it's not a good sign.
Liability (Score:3)

by Skiron ( 735617 ) writes: on Saturday February 23, 2013 @12:18PM (#42989551)

I guessMS somewhere in their licensing of this stuff have a clause that states they are not liable. Basically, 'bollocks to the Customers' when we fuck up [again].
So I cannot understand why people use them at all (once bitten, twice shy, twice bitten.. etc.).

Share
twitter facebook
- Re: (Score:3)
  
  by FreelanceWizard ( 889712 ) writes:
  
  Actually, Microsoft has a wide variety of SLAs with financial penalties covering the Azure cloud. I expect customers will be able to claim at least a 10% service credit on this, as it's definitely an issue within Microsoft's control and definitely would cause a miss of the monthly availability number.
  Review http://www.windowsazure.com/en-us/support/legal/sla/ [windowsazure.com] if you're interested in the Azure SLAs. Interestingly, Amazon has a much less tough SLA, as it's calculated on a yearly basis and doesn't have as brut
  - Re: (Score:2)
    
    by Skiron ( 735617 ) writes:
    
    99.9% is stated there a lot of times. Is that over a 1000 years?
    If not, that is about 1 day a year outage (when Customers go tits-up).
    They are keeping their promise, it seems.
    - Re: (Score:2)
      
      by fatphil ( 181876 ) writes:
      
      99.9% is between 7 and 8 hours down-time a month (which is the unit they measure in). If it took them 12 hours to get new certificates up, then they are not keeping their promise, they are failing.
      
      Of course, if that downtime coincides with your working hours, that's an entire working day down. It's a shitty level of service. Nobody hosting their own services, and having skilled staff managing their systems, would find that acceptable. I will admit that 99.999% uptime/connectivity is hard (we've had it one y
      - Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        If that's the case then they truly broke it last year with the 24 hour leap day outage.
      - Re: (Score:2)
        
        by petermgreen ( 876956 ) writes:
        
        umm 99.9% is about 0.73 hours per month, not 7-8 hours per month.
        
        Re: (Score:2)
        
        by fatphil ( 181876 ) writes:
        
        very true. ooops
    - Re: (Score:2)
      
      by symbolset ( 646467 ) * writes:
      
      The Azure SQL Database reporting facility just completed a 5-day outage this month [theregister.co.uk] so they may be a couple years over their downtime quota this month. Or as somebody else put it recently: "Five nines: 9.9999".
  - Re: (Score:3)
    
    by Todd Knarr ( 15451 ) writes:
    
    My problem with those SLAs is that they're for a credit for a fraction of the cost of the service for that month. Which is fine if your business doesn't depend on the service and you suffer no disruption when the service is down. But if you're hosting a Web site on the service, or using it for anything business-critical? The cost of the service is going to be the smallest part of the cost to you of the disruption (that's why you went with the service after all, because it was so much cheaper than doing it i
Microsoft has it's own internal CA (Score:4, Interesting)

by ejoe_mac ( 560743 ) writes: on Saturday February 23, 2013 @01:30PM (#42989999)

So wrong in so many ways. Any reason you wouldn't purchase a 100 year certificate and just roll with it? Too bad about 1/3 of all Azure disk space is used for endpoint backup. This reminds me of the leap-year calculating bug - Feb 29 2012, you couldn't generate a site because the default is to generate a certificate for 1 year, and well, Feb 29 2013 just doesn't exist. http://blogs.msdn.com/b/windowsazure/archive/2012/03/09/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx [msdn.com]

Share
twitter facebook
- Re: (Score:2)
  
  by antdude ( 79039 ) writes:
  
  FYI, it's = it is. You're welcome. ;)
- Re: (Score:2)
  
  by Kalriath ( 849904 ) writes:
  
  Because they are forbidden to issue a certificate with a greater validity than 39 months in accordance with the CA/Browser Forum Baseline Requirements for Publicly Trusted Certificates [cabforum.org] (warning: PDF). If they were to violate this, they'd have GTE's root certificate de-listed by Apple, Google, Mozilla, KDE, Opera, Blackberry and ... um ... Microsoft. Which would invalidate their subordinate certificate.
Makes good business sense! (Score:3)

by gweihir ( 88907 ) writes: on Saturday February 23, 2013 @02:02PM (#42990219)

From a business perspective, it makes perfect sense: If Azure were reliable, secure and fast, customers could start to wonder why the other products by MS are not. This could heighten customer expectations, and that would be bad as MS really does not have the engineering capabilities to build, say, a good OS or a good office productivity suite and then customers may leave for the alternatives. So I applaud them for their foresight in making Azure just as bad as their other things are. This may actually be quite beneficial for their bottom-line.

Share
twitter facebook
Sick Cert Solutions Suck (Score:3)

by Sloppy ( 14984 ) writes: on Saturday February 23, 2013 @02:48PM (#42990509) Homepage Journal

Imagine if someone's signature on your PGP identity expired. It might be a bit of a blow, but people would still have other trust pathways toward you. Then you get a new signature from 'em, or someone else.
Certs can fail in so many ways, both false positives (compromised CAs) or false negatives (such as this expiration), and a myriad of subjective failures since different people have different reasons to trust (or not trust) different CAs. The risks aren't even theoretical. Failure really happens, to the extent that it's almost routine and we see a story about it here on Slashdot every month.
And Phil Zimmerman totally solved the problem(!) in, what, 1988? Why are we still using obsolete-the-day-it-came-out single signer systems? So brittle. So unrealistic.
The only reason I can think of, is that it would work too well. MitM attacks would become nearly impossible for even the most powerful governments. Certs would become so competitive and cheap that the CA business would collapse.

Share
twitter facebook
- - Re: (Score:3)
    
    by Sloppy ( 14984 ) writes:
    
    You're acting like its an SSL issue that MS decided to consider expired certs invalid in their systems rather than accepting them.
    No. I'm saying it's an SSL issue that when The One and Only cert that can possibly exist, expires, there is no backup trust path. When the expiration happens, the number of valid certifications falls from 1 to 0. With a real world trust model, when an expiration happens, the number of valid certifications could fall from, say, 4 to 3.
    If you lose your drivers license, your pass
CEO Backgrounds (Score:2)

by BoRegardless ( 721219 ) writes:

My perception of Ballmer and Dell is that they virtually started with their companies and neither person has a wide ranging training in business management & psychology of managing. Ballmer is famous for his chair throwing and viscous firing with a loud voice, sometimes for trivial reasons & banning Apple products in most places inside the company. Dell has been reported to become physically withdrawn when competitor Apple is mentioned.
Neither of those responses to common activities speak good of
Microsoft cloud rains on world's parade. (Score:2)

by sjames ( 1099 ) writes:

n/t
SSL rocket science (Score:3)

by ei4anb ( 625481 ) writes: on Saturday February 23, 2013 @05:37PM (#42991491)

$ curl -vIs https://www.windowsazure.com/ [windowsazure.com] 2>&1 >/dev/null | grep "expire date"
* expire date: 2013-11-15 18:15:53 GMT
Call this from a cronjob script which should then take suitable action if the date is too close.

Share
twitter facebook
- Re:Spellcheck... (Score:5, Funny)
  
  by mystikkman ( 1487801 ) writes: on Saturday February 23, 2013 @10:31AM (#42988973)
  
  Maybe rtfa-troll and Timothy's spell checkers were hosted on Azure.
  
  Parent Share
  twitter facebook
- - Re: (Score:3)
    
    by Kalriath ( 849904 ) writes:
    
    IE10 has a spell checker now. They're only 5 years late, but they got there.
- Re: (Score:1)
  
  by egcagrac0 ( 1410377 ) writes:
  
  The spell check is strong with this one.
  Apparently not.
- Re:Blew their support contracts.. (Score:5, Funny)
  
  by binarylarry ( 1338699 ) writes: on Saturday February 23, 2013 @11:26AM (#42989235)
  
  Finally the Microsoft Blue Screen of Death has made into the new mobile cloud age.
  I mean the Azure Screen of Death, excuse me Mr. Ballmer.
  
  Parent Share
  twitter facebook
  - Re:Blew their support contracts.. (Score:5, Insightful)
    
    by click2005 ( 921437 ) * writes: on Saturday February 23, 2013 @11:34AM (#42989261)
    
    The Blue Sky of Cloud Death
    
    Parent Share
    twitter facebook
  - Re:Blew their support contracts.. (Score:4, Insightful)
    
    by RazorSharp ( 1418697 ) writes: on Saturday February 23, 2013 @02:24PM (#42990355)
    
    Azure - bright blue in color, like a cloudless sky
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by wisty ( 1335733 ) writes:
      
      That's the great thing about cloud computing. It's always there, like a cloud. Except when it decides to go away. Like a cloud.
- Re:Then what the hell was this Slashdot article? (Score:5, Funny)
  
  by multi io ( 640409 ) writes: <olaf.klischat@googlemail.com> on Saturday February 23, 2013 @12:06PM (#42989469)
  
  Outperforms in reliability, huh? bullshit
  Of course it doesn't work, but look how fast it is!
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by ThreeKelvin ( 2024342 ) writes:
    
    Azure is webscale? I never knew!
- - - Re: (Score:2)
      
      by Kalriath ( 849904 ) writes:
      
      Apple. iCloud uses Azure Storage for (some) document storage.
- Re: Does Timothy Have Brain Damage? (Score:2)
  
  by grcumb ( 781340 ) writes:
  
  Shyeah. Even a total idiot knows it's 'expirtation'.
- - - - Re:Does Timothy Have Brain Damage? (Score:4, Informative)
        
        by MrL0G1C ( 867445 ) writes: on Saturday February 23, 2013 @02:53PM (#42990541) Journal
        
        Calling someone a cunt because they missed a typo is not constructive criticism.
        
        Parent Share
        twitter facebook
        
        Re:Does Timothy Have Brain Damage? (Score:5, Funny)
        
        by 6ULDV8 ( 226100 ) writes: on Saturday February 23, 2013 @03:42PM (#42990861)
        
        Calling someone a cunt for any reason wouldn't make constructive criticism. When I use say it, it definitely isn't an attempt at anything constructive. I still love the word though.
        
        Parent Share
        twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Lolwut? (Score:4, Funny)

Re:Lolwut? (Score:5, Funny)

Re:Lolwut? (Score:5, Funny)

Re: (Score:3)

Expirty? (Score:1, Insightful)

Somebody (Score:1)

Re: (Score:2)

Re:Somebody (Score:5, Insightful)

Re:Somebody (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re:Somebody (Score:4, Insightful)

Re:Somebody (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Somebody (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Typical. (Score:5, Funny)

Re:Typical. (Score:5, Funny)

Re:Typical. (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Typical. (Score:5, Interesting)

Re:Typical. (Score:5, Insightful)

Tip of the iceberg (Score:5, Insightful)

Re: (Score:2)

Re:Tip of the iceberg (Score:5, Insightful)

Re: (Score:2)

12 hours to update the certs? (Score:5, Informative)

Re:12 hours to update the certs? (Score:5, Funny)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Entwined failure loop... (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Where do I pay? (Score:2)

Ahhh.. the cloud (Score:2)

Microsoft's Azure cloud (Score:2)

When you depend on other people ... (Score:4, Insightful)

Re:When you depend on other people ... (Score:4, Interesting)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Interesting)

error protection (Score:2)

Re: (Score:2)

Re: (Score:2)

The system works! (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Monitoring Fail (Score:4, Insightful)

Re: (Score:2)

Re:Monitoring Fail (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Liability (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Microsoft has it's own internal CA (Score:4, Interesting)

Re: (Score:2)