Forgot your password?
typodupeerror
Cloud Businesses Power

More Uptime Problems For Amazon Cloud 183

Posted by Soulskill
from the stormy-weather dept.
1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."
This discussion has been archived. No new comments can be posted.

More Uptime Problems For Amazon Cloud

Comments Filter:
  • by Anonymous Coward on Saturday June 30, 2012 @12:44PM (#40505797)

    I live in the affected area and that's what they're saying. May take 7 days for the last person to have their power restored.

  • by hawguy (1600213) on Saturday June 30, 2012 @01:10PM (#40505967)

    So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.

    They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.

    You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.

    Well, the entire system didn't fail, my servers in us-east-1a weren't affected at all.

    Hardware fails, even well tested hardware... especially in extreme conditions - don't forget that this storm has left millions of people without power, killed at least 10, and caused 3 states to declare an emergency. Amazon may have priority maintenance contracts with their generator and UPS system vendors and fuel delivery contracts, but when a storm like this hits, they vendors are busy keeping government and medical customers online. Rather than spend millions more dollars building redundancy for their redundancy (which adds complexity that can cause a failure itself), Amazon isolates datacenters into availability zones, and has geographically disperse datacenters.

    Customers are free to take advantage of availability zones and regions if they want to (which costs more money), but if they chose not to, they shouldn't blame Amazon.

  • by Joe_Dragon (2206452) on Saturday June 30, 2012 @01:11PM (#40505973)

    it seems like the switching system failed and or the back up power generators did not kick on.

    Maybe natural gas ones are better. The firehouses have them. I also see them at a big power sub station as well.

  • by gman003 (1693318) on Saturday June 30, 2012 @01:16PM (#40506001)

    I was in it - it was not a particularly bad storm. Heavy winds, lots of cloud-to-cloud lightning, but very little rain or cloud-to-ground lightning. I lost power repeatedly, but it was always back up within seconds. And I'm located way out in a rural area, where the power supply is much more vulnerable (every time a major hurricane hits, I'm usually without power for about a week - bad enough that I bought a small generator).

    According to TFA, they were only without power for half an hour, and that the ongoing problems were related to recovery, not actual power-lossage. So their problems are more "bad disaster planning" than "bad disaster".

    Still, you'd think a major data center would have the usual UPS and generator setup most major data centers have - half an hour without power is something they should have been able to handle. Or at least have enough UPS capacity to cleanly shut down all the machines or migrate the virtual instances to a different datacenter.

  • The automatic transfer switch(es) would be the first component I would check even without knowing anything. In order to maintain the UL listing on the transfer switch, it must be tested monthly. The idea is, if it is tested monthly, everything is operated and is less likely to seize and fail than if the device is not tested. Modern systems can be designed that the generators can start BEFORE the transfer switch operates when in test mode to reduce the impact of the test (miliseconds without power versus 30 seconds or so).

  • by dbrueck (1872018) on Saturday June 30, 2012 @02:09PM (#40506387)

    Sorry, but "Amazon's cloud has gone down" is wildly incorrect. From the sounds of it, *one* of their many data centers went down. We run tons of stuff on AWS and some of our servers were affected but most were not. Most important of all is that we had *zero* service interruption because we deployed our service according to their published best practices, so our traffic was automatically handled in different zones/regions.

    Having managed our own infrastructure in the past, it's these sort of outages at AWS that make us grateful we switched and that continue to convince us it was a good move. It might not be for everybody, but for us it's been a huge win. When we started getting alarms that some of our servers weren't responding, it was so cool to see that the overall service continued on its merry way. I didn't even bother staying up late to babysit things - checked it before bed and checked it again this morning.

    Firing up a VM on EC2 (or any other provider) != architecting for the cloud.

  • Well, as of current reports. . . . 2.5 million are without power in Virginia [foxnews.com], 800 Thousand in Maryland [chicagotribune.com], 400+ thousand in DC [wtop.com]. I've seen numbers in the 3.5 million region between Ohio and New Jersey. We got power back early this morning ~0400, but we STILL don't have phone, net, or cable at home. The real question, since some areas in DC Metro are not supposed to get power back for nearly a week is. . . . do the emergency fuel generators have sufficient fuel bunkers ???
  • by Anonymous Coward on Saturday June 30, 2012 @04:26PM (#40507099)

    Here's what's going on - Amazon's us-east-1 datacenter has been having some issues with its Relational Database Services (RDS), which is the database system holding all of the chumby data.

    What appears to be happening is frequent premature disconnects between the EC2 instances running the web servers and the main database. MySQL has a trigger in it that when too many premature disconnects occur without a successful connection, it assumes it's being hacked and blocks incoming connections from that server until a command is explicitly given to it to clear the error and resume accepting connections.

    During all of the time the system appeared to be down, it really wasn't - the database was actually running and completely operational from a parallel web server hosted under "insignia.chumby.com", which we use to provide a branded experience for Infocast and Insignia TV users. It had just blocked the systems that are used most frequently. All of the web servers, the forum, wiki, content servers were all up and running.

    To compound the problem there was a storm on Friday night that greatly impaired RDS at that datacenter, and as it came back up, it ended up producing the same kind of disconnect errors, and the same trigger happened.

    As of this writing, that issue is still ongoing and the RDS service in us-east-1 is still impaired. Note that several other companies - Pinterest, Heroku, Instagram and others are being similarly impaired.

  • Re:Infrastructure (Score:4, Informative)

    by Sir_Sri (199544) on Saturday June 30, 2012 @04:33PM (#40507135)

    In the case of panama it's control of the panama canal zone, which while by itself isn't a natural economic resource, but it saves a crap load of them in reduced shipping costs.

    Though true, wars are generally fought for gold glory and god as one of my past history teachers used to say. I think what she meant is that wars are *started* for gold glory or god. Afghanistan was very much god and glory (for Al Qaeda and the Taliban at least), and it was for them in part about natural resources and control, benefit and possession of the islamic caliphates (yes, that's doesn't actually exist, but that's the kind of level they were thinking at) resources.

    The invasion of Grenada is more tricky. By itself Grenada isn't anything, but a major military airfield in Grenada could cover all of the oil export ports from Venezuela, and there was the matter of US prestige on the issue.

... when fits of creativity run strong, more than one programmer or writer has been known to abandon the desktop for the more spacious floor. -- Fred Brooks

Working...