More Uptime Problems For Amazon Cloud 183
1sockchuck writes "An Amazon Web Services data center in northern Virginia lost power Friday night during an electrical storm, causing downtime for numerous customers — including Netflix, which uses an architecture designed to route around problems at a single availability zone. The same data center suffered a power outage two weeks ago and had connectivity problems earlier on Friday."
Cloud takes down cloud (Score:5, Funny)
Nuf said
Re: (Score:2, Informative)
Here's what's going on - Amazon's us-east-1 datacenter has been having some issues with its Relational Database Services (RDS), which is the database system holding all of the chumby data.
What appears to be happening is frequent premature disconnects between the EC2 instances running the web servers and the main database. MySQL has a trigger in it that when too many premature disconnects occur without a successful connection, it assumes it's being hacked and blocks incoming connections from that server unt
Re: (Score:2)
Re: (Score:3)
And Linux shouldn't ever be used for mission critical applications.
Posted using the Linux kernel version 2.2.13
Largest non-hurricane related power outage ever (Score:5, Informative)
I live in the affected area and that's what they're saying. May take 7 days for the last person to have their power restored.
Re:Largest non-hurricane related power outage ever (Score:5, Interesting)
That really shouldn't matter though as long as the Data center's generators are running and they can get fuel. It seems that they are not performing the proper testing and maintenance on their switchgear and generators if they are having this much trouble. The last time the data center in the building where I work went down for a power outage was when we had an arc flash in one of the UPS battery cabinets and they had to shut the data center (and the rest of the building's power for that matter) down.
Re:Largest non-hurricane related power outage ever (Score:5, Insightful)
Re: (Score:2)
Re: (Score:2)
Re:Largest non-hurricane related power outage ever (Score:5, Informative)
The automatic transfer switch(es) would be the first component I would check even without knowing anything. In order to maintain the UL listing on the transfer switch, it must be tested monthly. The idea is, if it is tested monthly, everything is operated and is less likely to seize and fail than if the device is not tested. Modern systems can be designed that the generators can start BEFORE the transfer switch operates when in test mode to reduce the impact of the test (miliseconds without power versus 30 seconds or so).
Re: (Score:3)
I don't know if the state or even just the city is without power it is quite possible the ISPs are borked in the area. After all why bother with too much redundancy if you customers don't have power for their computers than they aren't using the internet anyways. Then Amazon plops down a 200M datacentre in town and ... shit happens.
with cable the nodes need power and there batterie (Score:2)
with cable the nodes need power and there batteries will run down and then the cable co needs to have on site portable generators at the nodes with no power.
The phone systems have RT (less of them then cable systems) that are the same way.
Re: (Score:2)
Why exactly would a cable operator bother with backup power? I mean if the neighborhood has now power than people aren't running T.V.s or computers (unless laptops but still their modem would be down). It is probably a different beast with something the size of a Amazon datacentre though, they probably can go to the ISP and say "hey look we'll by 5M a month of internet for you but we need redundancy. Piss on all your home users for all we care but we get internet no matter what.".
Re: (Score:2)
well there are long runs from the headend to the each neighborhood so some area may have power but hours later the cable goes not as the lines pass though areas that don't have power.
Re: (Score:3)
Because that cable operator also provides phone service.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Chromatic aberration? Are you sure you don't mean numerical aperture?
Re:Largest non-hurricane related power outage ever (Score:4, Interesting)
The problem is that a lot of people cheap out on their backup power. Generators and UPSes are expensive.
I wonder, in comparing the price/performance numbers on the invoices from Dell and the invoices from APC(hint, one of these has Moore's law at its back, the other... Doesn't.) what it would take in terms of hardware pricing and software system reliability design to make these backup power systems economically obsolete for most of the 'bulk' data-shoveling and HTTP cruft that keep the tubes humming...
Obviously, if your software doesn't allow any sort of elegant failover, or you paid a small fortune per core, redundant PSUs, UPSes, generators, and all the rest make perfect sense. If, however, your software can tolerate a hardware failure and the price of silicon and storage is plummeting and the price of electrical gear that is going to spend most of its life generating heat and maintenance bills isn't, it becomes interesting to consider the point at which the 'Eh, fuck it. Move the load to somewhere where the lights are still on until the utility guys figure it out.' theory of backup power becomes viable.
Re: (Score:3)
Re: (Score:2)
"So the problem to me, is that data center redundancy is often an after though, and IaaS hardly has easy answers to this problem yet."
It won't. For a very basic physical reason: it's always cheaper to move data near than far away. If you have a given piece of data in one place you either will lose it if that place goes nuts or you will need to go expensive to make sure such data piece is replicated out of that place fast enough.
IaaS can help comoditizing compute and storage resources but has nothing to of
Re: (Score:2)
it becomes interesting to consider the point at which the 'Eh, fuck it. Move the load to somewhere where the lights are still on until the utility guys figure it out.' theory of backup power becomes viable.
The answer mostly depends on the cost of downtime for you.
The real problem is getting your (customer) data to the same place as your failover solution.
Some websites generate enormous amounts of data and it's not trivial or cheap for them to constantly keep it backed up at another data center.
A station wagon full of hard drives is still faster than any link 99% of us could afford
Re:Largest non-hurricane related power outage ever (Score:4, Informative)
Re: (Score:2)
Re: (Score:2)
I asked a maintenance person at work how long we could go in the event of a power outage. I got a blank look like they couldn't fathom the question, and then told we'd go forever. My workplace has 6 generators with 400? gallons of diesel for each one. One generator will handle the current load. It's all tested monthly. (and the odd times city power gets cut)
Re: (Score:3)
Pepco still has 400,000 people without power (Score:3)
http://www.pepco.com/home/emergency/maps/stormcenter/ [pepco.com]
Re: (Score:3)
But then the question must be asked...
[queue Psycho screeching violins]
How are you posting this now!
Re: (Score:2)
I'm guessing you're talking about population? I was up in Northern Ontario back during the last major ice storm we had. That hit the area, along with southern and mid-northern Quebec. There were places without power 4 months later. In the dead of winter, let me know how well you're going to survive when it's -38C outside will ya? 7 days is bad, no doubt and I know what you're going through, but try 3 months with no power.
Damn was it fucking cold. We ended up living with 4 other families in the asshole
Infrastructure (Score:5, Insightful)
We need to invest trillions in roads, water, and electrical infrastructure to keep this country going.
If you let the basic building blocks of civilization rot, don't be surprised when everything else follows suit.
Re:Infrastructure (Score:4, Insightful)
war is the basic building block of our particular civilization. if we waste money on your frivolities, how will we afford war & keep war machine shareholder value?
Stupid: Military is Insurance (Score:3, Insightful)
What are you, 14? Democracies don't like War, because they don't like their sons, fathers, brothers, and husbands getting killed. It generally takes quite a lot to motivate Democracies into war, because of the hatred of casualties. Even when it is the best option. Example: going to war against Hitler in 1934, or 1936, or in 1938.
Out here in the real world, the sum total of human experience suggests a strong military is like insurance or a seat belt. You hope you never have to use it, but its a godsend if yo
Re: (Score:2, Insightful)
I would say Laos would argue otherwise... The most bombed country in the world because America felt like it and had a lot of extra stock! Oh and they were officially a neutral country.
GO USA!
Re: (Score:2)
Re: (Score:3)
They engage in war to gain control of the natural resources the other country has.
The distinction is subtle, but significant.
Tell us again what natural resources the US wished to control when it engaged in war against Grenada [wikipedia.org] in 1983, or when it engaged in war against Panama [wikipedia.org] in 1989, or when it engaged in war against Afghanistan [wikipedia.org] starting in 2001.
There are many reasons for one state to go to war against another. Gaining control of natural resources is only one (e.g. Iraq's invasion of Kuwait [wikipedia.org]), and is not the commonest.
Re:Infrastructure (Score:4, Informative)
In the case of panama it's control of the panama canal zone, which while by itself isn't a natural economic resource, but it saves a crap load of them in reduced shipping costs.
Though true, wars are generally fought for gold glory and god as one of my past history teachers used to say. I think what she meant is that wars are *started* for gold glory or god. Afghanistan was very much god and glory (for Al Qaeda and the Taliban at least), and it was for them in part about natural resources and control, benefit and possession of the islamic caliphates (yes, that's doesn't actually exist, but that's the kind of level they were thinking at) resources.
The invasion of Grenada is more tricky. By itself Grenada isn't anything, but a major military airfield in Grenada could cover all of the oil export ports from Venezuela, and there was the matter of US prestige on the issue.
Re: (Score:2)
No, I'm not an american. That was at the university of guelph. Though it wouldn't surprise me if the instructor herself was american. Unfortunately I can't remember her name or what she looked like enough to know 10 years on if she's one of the people listed on the history department faculty page.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Insightful)
Dude, if you think a datacenter in Northern Virginia was plopped down here because of the insanely attractive price of real estate or energy, or because of the business-friendly tax rates you're out of your freaking mind. Datacenters are built here because of pre-existing backbone access. Period.
Re:Infrastructure (Score:5, Interesting)
In my past two jobs and over the past 20 years, we've worked with dozens of independent an unrelated vendors with locations around the country, including Virginia. Of all the locations where these companies have operations, the ones in Virginia have been dramatically, almost comically, more disaster-prone than the rest of the country and even the rest of the world. The running joke in the office is that whenever any vendor or service provider drops offline, we first check the weather in Virginia before checking to see if any of our own systems are offline. Every time, we see a post-mortem a few days later disclosing some failed system or backup or contingency, and every time, they say this problem that will never happen again.
You'd think that all the failing locations would share a operations center or service provider or even a single city, but it turns out that the only thing these disaster-prone operations have in common is that they're in Virginia. I have no idea why this is the case. But our company has a policy singling out Virginia saying that no mission-critical components are allowed to be based there.
Re: (Score:2)
"Being in a rural area does not make you statistically more likely to be hit by a tornado.. Tornadoes don't have any sort of inborn preference. Tornado danger is a function of geography, not population density."
You can't be so dense, can you? Do you think that being a tornado area might have something to do with people avoiding such a place -specially given that due to needed geography, tornado areas tend to be in the middle of nowhere?
"The only drawback of being in the sticks is it is harder to access mul
Re: (Score:2)
It's not really economical to bury those.
Fixing something like this [nola.com] is apparently not easy and takes time.
Seems like anything takes down the cloud... (Score:5, Interesting)
It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.
You can only argue that the extra costs and admin involved with cloud hosting outweigh the extra costs of self-hosting and paying competent IT staff for so long. If you read the various forums after an event like this, the mantra from cloud evangelists already seems to have changed from a general "cloud=reliable, and Google's/Amazon's/whoever's people are smarter than your in house people" to a much more weasel-worded "cloud is realiable as long as you've figured out exactly how to set it all up with proper redundancy etc." If you're going to pay people smart enough to figure that out, and you're not one of the few businesses whose model really does benefit disproportionately from the scalability at a certain stage in its development, why not save a fortune and host everything in-house?
Re: (Score:2)
It seems that recently, anything can take down the cloud,
It wasn't just anything that took down the cloud: it was another cloud.
Re:Seems like anything takes down the cloud... (Score:4, Interesting)
And this is ridiculous. How are they not in a datacenter with backup diesel generators and redundant internet egress points? Even the smallest service business I have worked for had this. All they need to do is buy space in a place like Qwest or even better, Equinix and it's all covered. A company like Amazon shouldn't be taken out by power issues of all things. They are either cheaping out or their systems/datacenter leads need to be replaced.
Re: (Score:2)
How are they not in a datacenter with backup diesel generators and redundant internet egress points?
Something about maximizing profits... by cutting corners... perhaps.
it seems like the switching system failed (Score:4, Informative)
it seems like the switching system failed and or the back up power generators did not kick on.
Maybe natural gas ones are better. The firehouses have them. I also see them at a big power sub station as well.
Re: (Score:2)
While failure of the backup systems is a possibility (just look at Fukushima), the backup systems are usually fairly redundant and tested as well. I know most datacenters I have been in test their generators periodically, something like every month or two. Unless there's a fairly large natural disaster, or someone sets off a very large bomb, backup power should be available for at least 24-48 hours. At that point, things could start breaking down because you have to start getting fuel shipped in, but aft
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Natural gas fails when there is an earthquake.
Natural gas generators (or even fuel cells) are commonly used within city limits for a broad number of reasons. First and foremost, you're not permitted to store quantities of flammables in most cities. Another is that the emissions are relatively benign.
OUTSIDE of a city, you can use a propane generator, which can be a converted gasoline generator if you prefer. You can even convert one to be dual-mode so it will run on either gasoline or propane, but that's quite a bit more work. Common dual-mode generato
uh forgot something important (Score:2)
whoops, I forgot to say OUTSIDE of a city you can use a propane generator FROM A PROPANE TANK. Which, of course, means it can still function after a 'quake. And if you live in someplace where it's legal to have a tank AND where you can get city gas, you can get the best of both worlds.
Re: (Score:3)
They expect the customers to pay for the redundancy by using multiple servers in different geographical locations. People buying one server or a bunch only in one datacentre are taking a risk already. I'm assuming someone in Amazon said lets build a few datacentres and skimp on the redundancy at each one. The redundancy is at the multi-datacentre level not at the multi-UPs multi-connection etc level at each datacentre.
Re:Seems like anything takes down the cloud... (Score:5, Insightful)
It seems that recently, anything can take down the cloud, or at least cause a serious disruption for any of the major cloud providers. I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.
I think it's more because a cloud outage affects thousands of customers, so it has more visibility. When Amazon has problems, the news is reported on Slashdot. When a smaller collocation center has an accidental fire suppression discharge taking hundreds of customers offline, it doesn't get any press coverage at all.
But the biggest takeaway from this is - never put all of your assets in one region. No matter how much redundancy Amazon builds into a region, a local disaster can still take out the datacenter. That's why they have Availability zones *and* regions. I have some servers in us-east-1a and they weren't affected at all. If they were down, I could bring up my servers in us-west within about an hour. (I could even automate it, but a few hours or even a day of downtime for these servers is no big deal)
Re: (Score:2)
Almost spot on - in fact don't even put all of your assets into the same cloud even because the day IS going to come when an infrastructure issue takes out even the largest of providers.
Re: (Score:2)
It certainly has become increasingly hard to hide that most of the 'cloud' providers do, er, rather less magic-distributed-reliability
Re: (Score:2)
I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.
This sort of thing never ever happens when you host everything in-house?
Re: (Score:2)
I wonder how many more of these it takes before the cloud-skeptics start winning the debates with management a lot more often.
This sort of thing never ever happens when you host everything in-house?
Obviously they do. But at least you have some control over the recovery, rather than sitting around watching for carefully-worded email and Twitter updates from Amazon about when you just might get access to the shit you are paying for again. That makes communicating real information to your customers a bit easier.
Of course, you can always use the excuse that it's not your fault and blame Amazon ("see...look at all the other people who are down"). But that's largely a marketing decision I suppose.
Re: (Score:2)
Hmm no you don't usually have much control over the recovery either. I was involved in a outage once because some guys trenching cable cut clean through our fiber bundle. There is no controlling anything that happens after that you are just down until the fiber is repaired.
In a cloud environment, given that you have a DR plan you press a button and you are back online.
Re: (Score:2)
Hmm no you don't usually have much control over the recovery either. I was involved in a outage once because some guys trenching cable cut clean through our fiber bundle. There is no controlling anything that happens after that you are just down until the fiber is repaired.
Diverse utility paths are pretty much required for any datacenter. And even that may not be enough, which I will respond to in the next point.
In a cloud environment, given that you have a DR plan you press a button and you are back online.
Two things: that whole concept is not a "cloud environment" thing, it's the way things have been done for a long time. Also, if you have to "press a button" (or perform any action) you are doing it pretty much wrong and have nothing to be smug about. None of this is magic, not unique to "cloud computing". Stop letting your brain fall out of your ear when you hear
Re: (Score:3)
Cloud computing brings availability to the "small guys". It also allows for quick scalability. You can't really accomplish similar things in-house unless you use 100s of servers
Sure, but probably 99% of small businesses don't actually need to scale that fast, or anywhere close. The cloud hosting proposition for most (not all, but most) small businesses is an appeal to wishful thinking, like the bank guy who tells you how they can give you a starter current account today, but they do have several tiers of service and once you're making over 10,000,000 in a year you'll have a dedicated account manager available to make you a coffee any time you want one.
Re: (Score:2)
You realise that this took out one data center? That is, all of those other AWS data centers are working still just fine?
Well, OK then, next time I'll just tell all of those people who can't use their home-grown Heroku-based apps for a few hours to go watch a movie on Netflix instead. It's probably just the little guys who got in trouble on this one, and it's their own dumb fault for not setting up more than one AZ or using different regions or something. Oh, no, wait, loads of people couldn't watch the movie either, and Netflix are HUGE AWS customers with an army of people to maintain a redundant infrastructure.
You really think hosting your own hardware in your own data centers spread across the world will save you a fortune?
False dichoto
Re: (Score:2)
"Several times and for multiple businesses. Have you?"
I'd actually be interesting in hearing your analysis and experience. I'm looking at this myself and finding that cost advantages differ depending on scenario - there just doesn't seem to be a clear cut point at which one solution costs less than the other for all but the most trivial scenarios.
Re: (Score:2)
OK. Obviously I'm posting pseudonymously so I can't give a lot of specifics, but FWIW...
I agree that this isn't a straightforward question, and I think one big problem is that people sometimes start by assuming a false dichotomy: either we're hosting in the cloud or we're kitting out a whole new server room. In reality, there is a broad scale to consider, with all kinds of managed hosting and colo options where a lot of the sysadmin overhead can be outsourced but you basically get to use real hardware with
Re: (Score:2)
"Several times and for multiple businesses. Have you?"
I'd actually be interesting in hearing your analysis and experience. I'm looking at this myself and finding that cost advantages differ depending on scenario - there just doesn't seem to be a clear cut point at which one solution costs less than the other for all but the most trivial scenarios.
Because it really depends on the business and the application. It also depends on how much bandwidth you use and if you have geographical limitations which would make accessing that bandwidth more costly in one or more locations.
If you are in it for the long haul, why not have control over your own cheap commodity machines and "scale into the cloud" for overages until you acquire more hardware? Then you can actually hav control of those little things that let you switch between datacenters easily like.
Re: (Score:2)
So your argument is: Netflix fucked up, so cloud is shit?
No, my argument is that saying this only affected one AWS data center and people elsewhere are fine is clearly not the whole story.
Cloud is usually cheaper and easier at small to medium scale
Cheaper and easier than what? Cloud technologies are basically useful for two things: outsourcing hardware and staff resources so you can adapt to very fast changes in the level of requirements, and being a glorified CDN. What proportion of small/medium businesses ever need to scale so fast that doing it in-house is impractical, or need the generalised capabilities of services l
What, you thought "cloud" meant "no outage"? (Score:5, Insightful)
Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.
Re: (Score:3)
The laugh is that those 1960s sytems had, for additional money, configurations for 24x7 uptime. Here we supposedly design for that with the cloud architecture, and fail. I would not be surprised at all if the modern mainframe were a cost effective alternative to this bloated expensive cloud.
Re: (Score:2)
Are you too stupid to research before spouting off? cutting "all the power" was rather difficult, as it came from two utilities and onsite generation.
Re: (Score:2)
Are you too stupid to research before spouting off? cutting "all the power" was rather difficult, as it came from two utilities and onsite generation.
Never underestimate the power of the universe to shit on you. It's still quite possible to get a perfect storm of problems that takes things offline, such as the main onsite generator being down for scheduled maintenance that overruns, the backup generator only having limited capacity, and a major storm wiping out the power grid completely for 20 miles. At that point, stuff will go down, and at some point it becomes cheaper to have insurance to deal with the losses arising (including reputational losses) in
Re: (Score:2)
I think most are just cheap bastards that are upset that their one server $30/month setup didn't by a redundant datacentre and that opps maybe they should have listened went people said that geo-redundancy: "It's a good thing" TM.
Re: (Score:2)
I think most are just cheap bastards that are upset that their one server $30/month setup didn't by a redundant datacentre and that opps maybe they should have listened went people said that geo-redundancy: "It's a good thing" TM.
Yeah....Netflix is totally one of those places. Oh...wait....no they aren't and they were down anyway.
Re: (Score:2)
I suspect there is a lot of resistance to the concept due to the general early experiences of SaaS and hosting solutions being cloud-washed....
Re: (Score:2)
Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.
Cool. Can we get those old Teletype terminals back? The clattering ones that left little round bits of paper all over the place?
And 8-track tapes while we're at it.
Re:What, you thought "cloud" meant "no outage"? (Score:5, Funny)
And 8-track tapes while we're at it.
We need those tape machines. Stick them in front of the real machines and get something hacked from a Raspberry Pi to spin them back and forth in an interesting pattern, with some extra blinkenlights for good measure, and we'll be able to once again prove to all the management types that we're doing serious computing so they can leave us alone and go back to their golf handicap.
Re: (Score:2)
Cloud computing is nothing more than 1960s timesharing services with modern operating systems. Unless you design for resilience, you're not resilient to problems.
Cloud computing a little more than 1960s timesharing services. Some miniscule differences such as being accessible from anywhere in the world, providing enormously more power and exponentially more capacity, and priced by they penny, but those are tiny differences that matter. Not to mention that as other commenters have mentioned, the Amazon Cloud does provide more redundancy, the people using it just didn't want to pay for it.
The parent is the single stupidest comment possible for this thread and it's m
Re: (Score:2)
No, it really isn't. Modern day cloud computing isn't much more advanced than it was in the 1960's.
All except for the data volumes, timescales, connectivity and pricing. In the '60s, timesharing services didn't ever have to deal with anything like the volume of data that would be found on a modern PC. They'd have a turnaround time of a few days, and connectivity was by courier if you were in a hurry, or driving over there yourself with your stack of punched cards (or paper tape) otherwise. I suppose it would be possible to think that pricing was comparable, especially if you were to ignore inflation, but
Millions of dollars spent for nothing. (Score:5, Interesting)
So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.
They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.
You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.
Now before people say "well this was a major storm system that killed 10 people, what do you expect", my response is that cloud computing is expected to do work for customers hundreds and thousands of kilometres/miles from the actual data centre so this is a somewhat crucial thing that we're talking about - millions of people literally depend on these services; that's my first point.
My second point is it's not like anything happened to the data centre, it simply lost mains energy. It's not like there was a fire, or flood, or the roof blew off the building, or anything like that; they simply lost power and failed to bring all their millions of dollars in equipment up to the task of picking up the load.
If I were a corporate customer, or even a regular consumer I would be seriously questioning the sustainability of at least Amazons cloud computing, Google and Facebook seem to be able to handle it but not Amazon - granted they don't offer identical products the overall data centres seem to stay up 100 or 99.9999999% of the time unlike Amazons.
Re: (Score:3)
A datacenter is a datacenter is a datacenter. You are not in "the cloud" if you can't scape from a datacenter-level incident.
Given that there is no "cloud" provider (not yet, at least) that will automagically protect your services from a datacenter-level incident, is up to you, the customer, to do it.
It's certainly possible with current technology but it's neither cheap nor straightforward, no matter what the "cloud" providers insist in sell and the PHBs in believe.
Re: (Score:2)
"Any decent IaaS cloud provider will offer CDN and GSLB products at a reasonable price."
Which helps you with your authoritative dynamic data exactly how?
And even with your mostly read-only data you will get only the lowest advantage if going the "automagical" route: to take most benefit from CDN or GSLB you need to engineer and develop you apps with those services in mind -which is exactly what I already said.
Re:Millions of dollars spent for nothing. (Score:5, Informative)
So this is the second time this month Amazons cloud has gone down, there should be serious questions being asked of the sustainability of this service given the extremely poor uptime record and extremely large customer base.
They would have spent millions of dollars installing diesel or gas generators and/or battery banks and who knows how much money maintaining and testing it, but when it comes time to actually use it in an emergency, the entire system fails.
You would think having redundant power would be a fundamental crucial thing to get right in owning and operating a data centre, yet Amazon seems unable to handle this relatively easy task.
Well, the entire system didn't fail, my servers in us-east-1a weren't affected at all.
Hardware fails, even well tested hardware... especially in extreme conditions - don't forget that this storm has left millions of people without power, killed at least 10, and caused 3 states to declare an emergency. Amazon may have priority maintenance contracts with their generator and UPS system vendors and fuel delivery contracts, but when a storm like this hits, they vendors are busy keeping government and medical customers online. Rather than spend millions more dollars building redundancy for their redundancy (which adds complexity that can cause a failure itself), Amazon isolates datacenters into availability zones, and has geographically disperse datacenters.
Customers are free to take advantage of availability zones and regions if they want to (which costs more money), but if they chose not to, they shouldn't blame Amazon.
Re: (Score:2)
ELB issues last night did cause problems to services with zone redundancy. We had services with zone redundancy that were experiencing issues because the ELB addresses being served were not functional even though they had working instances connected to them.
Amazon has also had at least one other outage in the last 18 months that affected more than one availability zone.
Region redundancy would be good. But it's quite a bit more complex and costly, what with security groups and ELBs not crossing regions and h
Re:Millions of dollars spent for nothing. (Score:5, Informative)
Sorry, but "Amazon's cloud has gone down" is wildly incorrect. From the sounds of it, *one* of their many data centers went down. We run tons of stuff on AWS and some of our servers were affected but most were not. Most important of all is that we had *zero* service interruption because we deployed our service according to their published best practices, so our traffic was automatically handled in different zones/regions.
Having managed our own infrastructure in the past, it's these sort of outages at AWS that make us grateful we switched and that continue to convince us it was a good move. It might not be for everybody, but for us it's been a huge win. When we started getting alarms that some of our servers weren't responding, it was so cool to see that the overall service continued on its merry way. I didn't even bother staying up late to babysit things - checked it before bed and checked it again this morning.
Firing up a VM on EC2 (or any other provider) != architecting for the cloud.
I live nowhere near Va (Score:5, Interesting)
not just netflix, and not just "electrical storm" (Score:3)
Instagram's servers in that cloud server were also affected, and more people griped about that on my facebook feed than netflix.
as for "an electrical storm", that's a bit of an understatement. The issue was actually more the 80 mph wind gusts as well as the lightning continuing on for 2 hours after the wind and rain had passed (meaning crews couldn't get out there overnight).
The result is some 2 million people without power, 1 million around DC alone. Dominion Power (which services the area where the data center resides, about 5 miles from my house) lost power for more than half of its northern virginia customers, and even now has only restored power to about 60,000* out of 461,000 that lost it. On the Maryland/DC side of the potomac, half a million people may be without power for days through a 100 degree each day heat wave (and more storms like last nights coming...).
* fortunately that would include me...though i'm writing this via my sprint phone as a wifi hotspot 'cause our cable modem is still down ;-)
Re: (Score:2)
This was a bad storm, but could certainly have been far worse. Even still, the grocers and stores are out of ice and people are swarming out of their homes like rats abandoning ship in some areas. These same people would be fucked if the S really HTF.
Wasn't even a big storm (Score:5, Informative)
I was in it - it was not a particularly bad storm. Heavy winds, lots of cloud-to-cloud lightning, but very little rain or cloud-to-ground lightning. I lost power repeatedly, but it was always back up within seconds. And I'm located way out in a rural area, where the power supply is much more vulnerable (every time a major hurricane hits, I'm usually without power for about a week - bad enough that I bought a small generator).
According to TFA, they were only without power for half an hour, and that the ongoing problems were related to recovery, not actual power-lossage. So their problems are more "bad disaster planning" than "bad disaster".
Still, you'd think a major data center would have the usual UPS and generator setup most major data centers have - half an hour without power is something they should have been able to handle. Or at least have enough UPS capacity to cleanly shut down all the machines or migrate the virtual instances to a different datacenter.
Re: (Score:2)
I was in it, and barely missed getting hit by multiple tree branches of the 6+inch diameter variety as I drove the final half-mile home. I lost power long enough to make my UPS whine, but that was before I got there. Every street around me had branches down, some completely blocking main streets. I live in Alexandria near Potomac Yard, got hit by the weather driving through Shirlington.
I had managed to not be out in a storm of this size before, usually I stay in or get back before one hits (I'm a
My instance was down for 9hrs... (Score:2)
Which is the problem. Not the power outage itself. ... in 30 minutes, 1hour... alright, but 9 freakin' hrs ?
If the power outage happened, and the servers where back let's say
In my specific case I didn't suffer as much because I have another instance in different zone with db replication and all that, serving as a backup server, and my project there, although very critical (20 people are getting wages out of it) is very low on resource usage... I can imagine there where quite a lot of people that lost quite
Re:My instance was down for 9hrs... (Score:5, Interesting)
There is a gap between technical and marketing requirements here.
The Amazon infrastructure was initially built to support Amazon retail, and Amazon put a lot of pressure on its engineers to make sure their apps were properly redundant across three or more data centers. At one point, the Amazon infrastructure team used to do "game days" where they would randomly take a data center offline and see what broke. The EC2 infrastructure is mostly independent of retail infrastructure, but it was designed in a similar fashion.
However, Amazon can't tell their customers how to build apps. The customers build what is familiar to them, and make assumptions about up time of individual servers or data centers. As the OP says, it's "the standard people are used to". Since the customer is always right, Amazon has a marketing need to respond by bringing availability up to those standards, even though it isn't technically necessary.
Defined My Saturday Morning (Score:2)
Shitty is the new Acceptable (Score:3)
Didn't you get the memo? Netflix barely runs now and this is working as planned. Time Warner had four internet outages in Raleigh THIS WEEK.
Everything everywhere is slowly grinding to a halt. So let's send more work to China and India. Who cares anymore.
Positive spin (Score:2)
Click (Score:2)
To migrate Click Here!
At least for those that have a DR migration plan.
Re: (Score:2)
"If they don't have proper backup generators, they have no business running a data center."
*Or* they are in a business that recognizes that shit happens, even at the datacenter level, and provide services so you can spread your load out of more than one datacenter, making the x10 expenditures needed to go from a "decent" datacenter to a "top notch" one moot and avoidable.
Hey, doesn't that look like this funny "cloud" concept they are waving so oftenly?
Re: (Score:3)
"No, if you are a professional stuff doesn't 'happen'"
No, if you are a professional you evaluate risks and adjust your behaviour to an acceptable level and you don't expend a bazillion to protect half a bazillion.
In example, Google designed their applications in a way that stand for a failing server: what's the benefit in their case going with RAID10, doubled PSUs and hot swappable RAM and CPUs? What gives to the table but lost money?
Amazon offers out of Fortune 100 people the ability to do the same, only
Re: (Score:2)
Well, this is America, you are welcome to your belief, even if its horribly wrong.
Re: (Score:2)
I personally think it's funny that people would even say that (if yoru a professional stuff doesn't happen BS). As someone who works in the infrastructure business I can tell you with 100% certainty that no design, location or setup will be perfect. Regardless of how well you plan you are one natural disaster away from a service interruption and any single point in the system can be taken down by some guy in a backhoe digging where he shouldn't.
Even if you designed a data center with 100 layers of redundanc