Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Cloud Businesses Power

More Uptime Problems For Amazon Cloud 183

1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."
This discussion has been archived. No new comments can be posted.

More Uptime Problems For Amazon Cloud

Comments Filter:
  • by Anonymous Brave Guy ( 457657 ) on Saturday June 30, 2012 @01:48PM (#40505809)

    It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

    You can only argue that the extra costs and admin involved with cloud hosting outweigh the extra costs of self-hosting and paying competent IT staff for so long. If you read the various forums after an event like this, the mantra from cloud evangelists already seems to have changed from a general "cloud=reliable, and Google's/Amazon's/whoever's people are smarter than your in house people" to a much more weasel-worded "cloud is realiable as long as you've figured out exactly how to set it all up with proper redundancy etc." If you're going to pay people smart enough to figure that out, and you're not one of the few businesses whose model really does benefit disproportionately from the scalability at a certain stage in its development, why not save a fortune and host everything in-house?

  • by Anonymous Coward on Saturday June 30, 2012 @01:53PM (#40505845)

    So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.

    They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.

    You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.

    Now before people say "well this was a major storm system that killed 10 people, what do you expect", my response is that cloud computing is expected to do work for customers hundreds and thousands of kilometres/miles from the actual data centre so this is a somewhat crucial thing that we're talking about - millions of people literally depend on these services; that's my first point.

    My second point is it's not like anything happened to the data centre, it simply lost mains energy. It's not like there was a fire, or flood, or the roof blew off the building, or anything like that; they simply lost power and failed to bring all their millions of dollars in equipment up to the task of picking up the load.

    If I were a corporate customer, or even a regular consumer I would be seriously questioning the sustainability of at least Amazons cloud computing, Google and Facebook seem to be able to handle it but not Amazon - granted they don't offer identical products the overall data centres seem to stay up 100 or 99.9999999% of the time unlike Amazons.

  • by Anonymous Coward on Saturday June 30, 2012 @01:58PM (#40505881)
    You realise that this took out one data center? That is, all of those other AWS data centers are working still just fine? If anything, this is proving the reliability of cloud providers!

    Why not save a fortune and host everything in-house?

    You really think hosting your own hardware in your own data centers spread across the world will save you a fortune? Have you even bothered to run those figures?

    Even if you have more money than sense, once you've got your hardware spread across the globe, you've still got to build the systems on top to survive an outage in one of them I.e. exactly what you have to do if you use a cloud provider anyway. So what have you saved, precisely?

  • by tnk1 ( 899206 ) on Saturday June 30, 2012 @01:59PM (#40505889)

    And this is ridiculous. How are they not in a datacenter with backup diesel generators and redundant internet egress points? Even the smallest service business I have worked for had this. All they need to do is buy space in a place like Qwest or even better, Equinix and it's all covered. A company like Amazon shouldn't be taken out by power issues of all things. They are either cheaping out or their systems/datacenter leads need to be replaced.

  • by bugs2squash ( 1132591 ) on Saturday June 30, 2012 @01:59PM (#40505891)
    However "Netflix, which uses an architecture designed to route around problems at a single availability zone." seems to have efficiently spread the pain of a North Eastern outage to the rest of the country. Sometimes I think redundancy in solutions is better left turned off.
  • That really shouldn't matter though as long as the Data center's generators are running and they can get fuel. It seems that they are not performing the proper testing and maintenance on their switchgear and generators if they are having this much trouble. The last time the data center in the building where I work went down for a power outage was when we had an arc flash in one of the UPS battery cabinets and they had to shut the data center (and the rest of the building's power for that matter) down.

  • by fuzzyfuzzyfungus ( 1223518 ) on Saturday June 30, 2012 @02:29PM (#40506119) Journal

    The problem is that a lot of people cheap out on their backup power. Generators and UPSes are expensive.

    I wonder, in comparing the price/performance numbers on the invoices from Dell and the invoices from APC(hint, one of these has Moore's law at its back, the other... Doesn't.) what it would take in terms of hardware pricing and software system reliability design to make these backup power systems economically obsolete for most of the 'bulk' data-shoveling and HTTP cruft that keep the tubes humming...

    Obviously, if your software doesn't allow any sort of elegant failover, or you paid a small fortune per core, redundant PSUs, UPSes, generators, and all the rest make perfect sense. If, however, your software can tolerate a hardware failure and the price of silicon and storage is plummeting and the price of electrical gear that is going to spend most of its life generating heat and maintenance bills isn't, it becomes interesting to consider the point at which the 'Eh, fuck it. Move the load to somewhere where the lights are still on until the utility guys figure it out.' theory of backup power becomes viable.

  • by PTBarnum ( 233319 ) on Saturday June 30, 2012 @04:04PM (#40506701)

    There is a gap between technical and marketing requirements here.

    The Amazon infrastructure was initially built to support Amazon retail, and Amazon put a lot of pressure on its engineers to make sure their apps were properly redundant across three or more data centers. At one point, the Amazon infrastructure team used to do "game days" where they would randomly take a data center offline and see what broke. The EC2 infrastructure is mostly independent of retail infrastructure, but it was designed in a similar fashion.

    However, Amazon can't tell their customers how to build apps. The customers build what is familiar to them, and make assumptions about up time of individual servers or data centers. As the OP says, it's "the standard people are used to". Since the customer is always right, Amazon has a marketing need to respond by bringing availability up to those standards, even though it isn't technically necessary.

  • Re:Infrastructure (Score:5, Interesting)

    by tyler_larson ( 558763 ) on Saturday June 30, 2012 @10:47PM (#40508497) Homepage

    In my past two jobs and over the past 20 years, we've worked with dozens of independent an unrelated vendors with locations around the country, including Virginia. Of all the locations where these companies have operations, the ones in Virginia have been dramatically, almost comically, more disaster-prone than the rest of the country and even the rest of the world. The running joke in the office is that whenever any vendor or service provider drops offline, we first check the weather in Virginia before checking to see if any of our own systems are offline. Every time, we see a post-mortem a few days later disclosing some failed system or backup or contingency, and every time, they say this problem that will never happen again.

    You'd think that all the failing locations would share a operations center or service provider or even a single city, but it turns out that the only thing these disaster-prone operations have in common is that they're in Virginia. I have no idea why this is the case. But our company has a policy singling out Virginia saying that no mission-critical components are allowed to be based there.

THEGODDESSOFTHENETHASTWISTINGFINGERSANDHERVOICEISLIKEAJAVELININTHENIGHTDUDE

Working...