Forgot your password?
typodupeerror
Cloud Businesses Power

More Uptime Problems For Amazon Cloud 183

Posted by Soulskill
from the stormy-weather dept.
1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."
This discussion has been archived. No new comments can be posted.

More Uptime Problems For Amazon Cloud

Comments Filter:
  • by AlienIntelligence (1184493) on Saturday June 30, 2012 @01:43PM (#40505785)

    Nuf said

    • Re: (Score:2, Informative)

      by Anonymous Coward

      Here's what's going on - Amazon's us-east-1 datacenter has been having some issues with its Relational Database Services (RDS), which is the database system holding all of the chumby data.

      What appears to be happening is frequent premature disconnects between the EC2 instances running the web servers and the main database. MySQL has a trigger in it that when too many premature disconnects occur without a successful connection, it assumes it's being hacked and blocks incoming connections from that server unt

    • by CptNerd (455084)
      Well, they did say there was a lot of "cloud to cloud" lightning in that storm...
  • by Anonymous Coward on Saturday June 30, 2012 @01:44PM (#40505797)

    I live in the affected area and that's what they're saying. May take 7 days for the last person to have their power restored.

    • That really shouldn't matter though as long as the Data center's generators are running and they can get fuel. It seems that they are not performing the proper testing and maintenance on their switchgear and generators if they are having this much trouble. The last time the data center in the building where I work went down for a power outage was when we had an arc flash in one of the UPS battery cabinets and they had to shut the data center (and the rest of the building's power for that matter) down.

      • by John Bresnahan (638668) on Saturday June 30, 2012 @02:05PM (#40505939)
        Of course, the network only works if every router in between the data center and the customer has power. In a power outage of this size, it's entirely possible that more than one link is down.
        • by bhcompy (1877290)
          Isn't this the point of routing the way we do? It's self-healing, of a sort, as long as another path exists.
          • by timeOday (582209)
            I hope an insider will weigh in on this, but I don't think the Internet is all that self-healing at the upper levels, as when dealing with Netflix, Amazon, google etc. At those levels links are not an abstraction; they are statically routing across specific fiber segments, and there probably isn't enough overcapacity in the infrastructure to simply route around without interruption. Think about when an Interstate is closed through a metropolitan area - yes, you can still get there along side streets event
      • Well, as of current reports. . . . 2.5 million are without power in Virginia [foxnews.com], 800 Thousand in Maryland [chicagotribune.com], 400+ thousand in DC [wtop.com]. I've seen numbers in the 3.5 million region between Ohio and New Jersey. We got power back early this morning ~0400, but we STILL don't have phone, net, or cable at home. The real question, since some areas in DC Metro are not supposed to get power back for nearly a week is. . . . do the emergency fuel generators have sufficient fuel bunkers ???
        • Natural gas generators will likely be ok. Gasoline may be a problem however since stations can't pump fuel. There was power in Fredricksburg, VA and it seems that the surrounding areas didn't have any power going by the mobs at the gas stations.
        • by baegucb (18706)

          I asked a maintenance person at work how long we could go in the event of a power outage. I got a blank look like they couldn't fathom the question, and then told we'd go forever. My workplace has 6 generators with 400? gallons of diesel for each one. One generator will handle the current load. It's all tested monthly. (and the odd times city power gets cut)

      • I drove through the affected areas today, there were swaths of I-95 that didn't have any cell phone service. I'd say that's pretty bad considering I still had service during the 2003 blackout. The cloud outage is the least of these folks worries, 100+ degree (f) weather forecasted the next few days with no A/C and water conservation measures in some areas is a concern right now
    • But then the question must be asked...

      [queue Psycho screeching violins]
      How are you posting this now!

    • by Mashiki (184564)

      I'm guessing you're talking about population? I was up in Northern Ontario back during the last major ice storm we had. That hit the area, along with southern and mid-northern Quebec. There were places without power 4 months later. In the dead of winter, let me know how well you're going to survive when it's -38C outside will ya? 7 days is bad, no doubt and I know what you're going through, but try 3 months with no power.

      Damn was it fucking cold. We ended up living with 4 other families in the asshole

  • Infrastructure (Score:5, Insightful)

    by TubeSteak (669689) on Saturday June 30, 2012 @01:47PM (#40505807) Journal

    We need to invest trillions in roads, water, and electrical infrastructure to keep this country going.
    If you let the basic building blocks of civilization rot, don't be surprised when everything else follows suit.

    • Re:Infrastructure (Score:4, Insightful)

      by rubycodez (864176) on Saturday June 30, 2012 @02:05PM (#40505935)

      war is the basic building block of our particular civilization. if we waste money on your frivolities, how will we afford war & keep war machine shareholder value?

      • by Anonymous Coward

        What are you, 14? Democracies don't like War, because they don't like their sons, fathers, brothers, and husbands getting killed. It generally takes quite a lot to motivate Democracies into war, because of the hatred of casualties. Even when it is the best option. Example: going to war against Hitler in 1934, or 1936, or in 1938.

        Out here in the real world, the sum total of human experience suggests a strong military is like insurance or a seat belt. You hope you never have to use it, but its a godsend if yo

    • I've had the idea in mind for years that we need to bury every single electrical line in this country to create jobs for stimulus and to ensure service during severe climate disruptions. In my area (South-West Ohio) we had winds, little rain for ehh....10 minutes. Clobbered the hell out of us, and then it was gone. I was down for over twelve hours and many will not get power until Monday at midnight. While burying these lines we can run additional fiber owned by the people and covering the last mile. I

  • by Anonymous Brave Guy (457657) on Saturday June 30, 2012 @01:48PM (#40505809)

    It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

    You can only argue that the extra costs and admin involved with cloud hosting outweigh the extra costs of self-hosting and paying competent IT staff for so long. If you read the various forums after an event like this, the mantra from cloud evangelists already seems to have changed from a general "cloud=reliable, and Google's/Amazon's/whoever's people are smarter than your in house people" to a much more weasel-worded "cloud is realiable as long as you've figured out exactly how to set it all up with proper redundancy etc." If you're going to pay people smart enough to figure that out, and you're not one of the few businesses whose model really does benefit disproportionately from the scalability at a certain stage in its development, why not save a fortune and host everything in-house?

    • It seems that recently, anything can take down the cloud,

      It wasn't just anything that took down the cloud: it was another cloud.

    • by tnk1 (899206) on Saturday June 30, 2012 @01:59PM (#40505889)

      And this is ridiculous. How are they not in a datacenter with backup diesel generators and redundant internet egress points? Even the smallest service business I have worked for had this. All they need to do is buy space in a place like Qwest or even better, Equinix and it's all covered. A company like Amazon shouldn't be taken out by power issues of all things. They are either cheaping out or their systems/datacenter leads need to be replaced.

      • How are they not in a datacenter with backup diesel generators and redundant internet egress points?

        Something about maximizing profits... by cutting corners... perhaps.

      • by Joe_Dragon (2206452) on Saturday June 30, 2012 @02:11PM (#40505973)

        it seems like the switching system failed and or the back up power generators did not kick on.

        Maybe natural gas ones are better. The firehouses have them. I also see them at a big power sub station as well.

        • by tnk1 (899206)

          While failure of the backup systems is a possibility (just look at Fukushima), the backup systems are usually fairly redundant and tested as well. I know most datacenters I have been in test their generators periodically, something like every month or two. Unless there's a fairly large natural disaster, or someone sets off a very large bomb, backup power should be available for at least 24-48 hours. At that point, things could start breaking down because you have to start getting fuel shipped in, but aft

          • Hehe - normally I'd agree, but Pepco did all right last night as far as am concerned. I flickered for about 2 seconds last night, but I'm in the downtown Capitol Wastelands - I don't know what grid I'm on but it seems to be a good one. Oh - and yay for personal UPSes! They did what they should have done.
        • by Relayman (1068986)
          Natural gas fails when there is an earthquake. Depending where your data center is located, diesel may be a better choice.
          • by drinkypoo (153816)

            Natural gas fails when there is an earthquake.

            Natural gas generators (or even fuel cells) are commonly used within city limits for a broad number of reasons. First and foremost, you're not permitted to store quantities of flammables in most cities. Another is that the emissions are relatively benign.

            OUTSIDE of a city, you can use a propane generator, which can be a converted gasoline generator if you prefer. You can even convert one to be dual-mode so it will run on either gasoline or propane, but that's quite a bit more work. Common dual-mode generato

            • whoops, I forgot to say OUTSIDE of a city you can use a propane generator FROM A PROPANE TANK. Which, of course, means it can still function after a 'quake. And if you live in someplace where it's legal to have a tank AND where you can get city gas, you can get the best of both worlds.

      • They expect the customers to pay for the redundancy by using multiple servers in different geographical locations. People buying one server or a bunch only in one datacentre are taking a risk already. I'm assuming someone in Amazon said lets build a few datacentres and skimp on the redundancy at each one. The redundancy is at the multi-datacentre level not at the multi-UPs multi-connection etc level at each datacentre.

    • by hawguy (1600213) on Saturday June 30, 2012 @02:02PM (#40505907)

      It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

      I think it's more because a cloud outage affects thousands of customers, so it has more visibility. When Amazon has problems, the news is reported on Slashdot. When a smaller collocation center has an accidental fire suppression discharge taking hundreds of customers offline, it doesn't get any press coverage at all.

      But the biggest takeaway from this is - never put all of your assets in one region. No matter how much redundancy Amazon builds into a region, a local disaster can still take out the datacenter. That's why they have Availability zones *and* regions. I have some servers in us-east-1a and they weren't affected at all. If they were down, I could bring up my servers in us-west within about an hour. (I could even automate it, but a few hours or even a day of downtime for these servers is no big deal)

      • Almost spot on - in fact don't even put all of your assets into the same cloud even because the day IS going to come when an infrastructure issue takes out even the largest of providers.

    • While the nimbostratus salesweasels are(obviously, these are salesweasels) lying, an incident where a datacenter gets taken down good and hard by weather won't do the in-house guys much good either... 'Cloud' or not, a datacenter(and probably a fair few smaller ones, and a veritable legion of various converted-broom-closet small business setups) was taken down by weather.

      It certainly has become increasingly hard to hide that most of the 'cloud' providers do, er, rather less magic-distributed-reliability
    • by andy1307 (656570)

      I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

      This sort of thing never ever happens when you host everything in-house?

      • I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.

        This sort of thing never ever happens when you host everything in-house?

        Obviously they do. But at least you have some control over the recovery, rather than sitting around watching for carefully-worded email and Twitter updates from Amazon about when you just might get access to the shit you are paying for again. That makes communicating real information to your customers a bit easier.

        Of course, you can always use the excuse that it's not your fault and blame Amazon ("see...look at all the other people who are down"). But that's largely a marketing decision I suppose.

        • by codepunk (167897)

          Hmm no you don't usually have much control over the recovery either. I was involved in a outage once because some guys trenching cable cut clean through our fiber bundle. There is no controlling anything that happens after that you are just down until the fiber is repaired.

          In a cloud environment, given that you have a DR plan you press a button and you are back online.

          • Hmm no you don't usually have much control over the recovery either. I was involved in a outage once because some guys trenching cable cut clean through our fiber bundle. There is no controlling anything that happens after that you are just down until the fiber is repaired.

            Diverse utility paths are pretty much required for any datacenter. And even that may not be enough, which I will respond to in the next point.

            In a cloud environment, given that you have a DR plan you press a button and you are back online.

            Two things: that whole concept is not a "cloud environment" thing, it's the way things have been done for a long time. Also, if you have to "press a button" (or perform any action) you are doing it pretty much wrong and have nothing to be smug about. None of this is magic, not unique to "cloud computing". Stop letting your brain fall out of your ear when you hear

  • by ebunga (95613) on Saturday June 30, 2012 @01:49PM (#40505821) Homepage

    Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.

    • by rubycodez (864176)

      The laugh is that those 1960s sytems had, for additional money, configurations for 24x7 uptime. Here we supposedly design for that with the cloud architecture, and fail. I would not be surprised at all if the modern mainframe were a cost effective alternative to this bloated expensive cloud.

    • Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.

      Cool. Can we get those old Teletype terminals back? The clattering ones that left little round bits of paper all over the place?

      And 8-track tapes while we're at it.

    • Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.

      Cloud computing a little more than 1960s timesharing services. Some miniscule differences such as being accessible from anywhere in the world, providing enormously more power and exponentially more capacity, and priced by they penny, but those are tiny differences that matter. Not to mention that as other commenters have mentioned, the Amazon Cloud does provide more redundancy, the people using it just didn't want to pay for it.

      The parent is the single stupidest comment possible for this thread and it's m

  • by Anonymous Coward on Saturday June 30, 2012 @01:53PM (#40505845)

    So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.

    They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.

    You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.

    Now before people say "well this was a major storm system that killed 10 people, what do you expect", my response is that cloud computing is expected to do work for customers hundreds and thousands of kilometres/miles from the actual data centre so this is a somewhat crucial thing that we're talking about - millions of people literally depend on these services; that's my first point.

    My second point is it's not like anything happened to the data centre, it simply lost mains energy. It's not like there was a fire, or flood, or the roof blew off the building, or anything like that; they simply lost power and failed to bring all their millions of dollars in equipment up to the task of picking up the load.

    If I were a corporate customer, or even a regular consumer I would be seriously questioning the sustainability of at least Amazons cloud computing, Google and Facebook seem to be able to handle it but not Amazon - granted they don't offer identical products the overall data centres seem to stay up 100 or 99.9999999% of the time unlike Amazons.

    • A datacenter is a datacenter is a datacenter. You are not in "the cloud" if you can't scape from a datacenter-level incident.

      Given that there is no "cloud" provider (not yet, at least) that will automagically protect your services from a datacenter-level incident, is up to you, the customer, to do it.

      It's certainly possible with current technology but it's neither cheap nor straightforward, no matter what the "cloud" providers insist in sell and the PHBs in believe.

    • by hawguy (1600213) on Saturday June 30, 2012 @02:10PM (#40505967)

      So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.

      They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.

      You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.

      Well, the entire system didn't fail, my servers in us-east-1a weren't affected at all.

      Hardware fails, even well tested hardware... especially in extreme conditions - don't forget that this storm has left millions of people without power, killed at least 10, and caused 3 states to declare an emergency. Amazon may have priority maintenance contracts with their generator and UPS system vendors and fuel delivery contracts, but when a storm like this hits, they vendors are busy keeping government and medical customers online. Rather than spend millions more dollars building redundancy for their redundancy (which adds complexity that can cause a failure itself), Amazon isolates datacenters into availability zones, and has geographically disperse datacenters.

      Customers are free to take advantage of availability zones and regions if they want to (which costs more money), but if they chose not to, they shouldn't blame Amazon.

      • by ahodgson (74077)

        ELB issues last night did cause problems to services with zone redundancy. We had services with zone redundancy that were experiencing issues because the ELB addresses being served were not functional even though they had working instances connected to them.

        Amazon has also had at least one other outage in the last 18 months that affected more than one availability zone.

        Region redundancy would be good. But it's quite a bit more complex and costly, what with security groups and ELBs not crossing regions and h

    • by dbrueck (1872018) on Saturday June 30, 2012 @03:09PM (#40506387)

      Sorry, but "Amazon's cloud has gone down" is wildly incorrect. From the sounds of it, *one* of their many data centers went down. We run tons of stuff on AWS and some of our servers were affected but most were not. Most important of all is that we had *zero* service interruption because we deployed our service according to their published best practices, so our traffic was automatically handled in different zones/regions.

      Having managed our own infrastructure in the past, it's these sort of outages at AWS that make us grateful we switched and that continue to convince us it was a good move. It might not be for everybody, but for us it's been a huge win. When we started getting alarms that some of our servers weren't responding, it was so cool to see that the overall service continued on its merry way. I didn't even bother staying up late to babysit things - checked it before bed and checked it again this morning.

      Firing up a VM on EC2 (or any other provider) != architecting for the cloud.

  • by bugs2squash (1132591) on Saturday June 30, 2012 @01:59PM (#40505891)
    However "Netflix, which uses an architecture designed to route around problems at a single availability zone." seems to have efficiently spread the pain of a North Eastern outage to the rest of the country. Sometimes I think redundancy in solutions is better left turned off.
  • Instagram's servers in that cloud server were also affected, and more people griped about that on my facebook feed than netflix.

    as for "an electrical storm", that's a bit of an understatement. The issue was actually more the 80 mph wind gusts as well as the lightning continuing on for 2 hours after the wind and rain had passed (meaning crews couldn't get out there overnight).

    The result is some 2 million people without power, 1 million around DC alone. Dominion Power (which services the area where the data center resides, about 5 miles from my house) lost power for more than half of its northern virginia customers, and even now has only restored power to about 60,000* out of 461,000 that lost it. On the Maryland/DC side of the potomac, half a million people may be without power for days through a 100 degree each day heat wave (and more storms like last nights coming...).

    * fortunately that would include me...though i'm writing this via my sprint phone as a wifi hotspot 'cause our cable modem is still down ;-)

    • by Onuma (947856)
      You lucked out, then. I've driven around Fairfax, Arlington and PG counties as well as DC today. I haven't seen a major road without some kind of debris blocking it, nor an area which has 100% power restored at this point.

      This was a bad storm, but could certainly have been far worse. Even still, the grocers and stores are out of ice and people are swarming out of their homes like rats abandoning ship in some areas. These same people would be fucked if the S really HTF.
  • by gman003 (1693318) on Saturday June 30, 2012 @02:16PM (#40506001)

    I was in it - it was not a particularly bad storm. Heavy winds, lots of cloud-to-cloud lightning, but very little rain or cloud-to-ground lightning. I lost power repeatedly, but it was always back up within seconds. And I'm located way out in a rural area, where the power supply is much more vulnerable (every time a major hurricane hits, I'm usually without power for about a week - bad enough that I bought a small generator).

    According to TFA, they were only without power for half an hour, and that the ongoing problems were related to recovery, not actual power-lossage. So their problems are more "bad disaster planning" than "bad disaster".

    Still, you'd think a major data center would have the usual UPS and generator setup most major data centers have - half an hour without power is something they should have been able to handle. Or at least have enough UPS capacity to cleanly shut down all the machines or migrate the virtual instances to a different datacenter.

    • by CptNerd (455084)

      I was in it, and barely missed getting hit by multiple tree branches of the 6+inch diameter variety as I drove the final half-mile home. I lost power long enough to make my UPS whine, but that was before I got there. Every street around me had branches down, some completely blocking main streets. I live in Alexandria near Potomac Yard, got hit by the weather driving through Shirlington.

      I had managed to not be out in a storm of this size before, usually I stay in or get back before one hits (I'm a

  • Which is the problem. Not the power outage itself.
    If the power outage happened, and the servers where back let's say ... in 30 minutes, 1hour... alright, but 9 freakin' hrs ?

    In my specific case I didn't suffer as much because I have another instance in different zone with db replication and all that, serving as a backup server, and my project there, although very critical (20 people are getting wages out of it) is very low on resource usage... I can imagine there where quite a lot of people that lost quite

    • by PTBarnum (233319) on Saturday June 30, 2012 @04:04PM (#40506701)

      There is a gap between technical and marketing requirements here.

      The Amazon infrastructure was initially built to support Amazon retail, and Amazon put a lot of pressure on its engineers to make sure their apps were properly redundant across three or more data centers. At one point, the Amazon infrastructure team used to do "game days" where they would randomly take a data center offline and see what broke. The EC2 infrastructure is mostly independent of retail infrastructure, but it was designed in a similar fashion.

      However, Amazon can't tell their customers how to build apps. The customers build what is familiar to them, and make assumptions about up time of individual servers or data centers. As the OP says, it's "the standard people are used to". Since the customer is always right, Amazon has a marketing need to respond by bringing availability up to those standards, even though it isn't technically necessary.

  • My company uses Amazon Web Services to host some of our product, and I got a call at 7 am to help bring our stuff back up. A bunch of our instances were stopped, and a bunch of Elastic Block Store volumes were marked Impaired. We're working on making our environment more "cloudy" to make better use of multiple availability zones, regions, and automation to better survive an outage like this, but we're not there yet.
  • by gelfling (6534) on Saturday June 30, 2012 @05:28PM (#40507111) Homepage Journal

    Didn't you get the memo? Netflix barely runs now and this is working as planned. Time Warner had four internet outages in Raleigh THIS WEEK.

    Everything everywhere is slowly grinding to a halt. So let's send more work to China and India. Who cares anymore.

  • We don't have downtime. We have "uptime problems."
  • To migrate Click Here!

    At least for those that have a DR migration plan.

If a listener nods his head when you're explaining your program, wake him up.

Working...