Forgot your password?
typodupeerror
Power The Internet

Data Center Power Failures Mount 100

Posted by timothy
from the send-money-drugs-and-sealed-lead-acid-batteries dept.
1sockchuck writes "It was a bad week to be a piece of electrical equipment inside a major data center. There have been five major incidents in the past week in which generator or UPS failures have caused data center power outages that left customers offline. Generators were apparently the culprit in a Rackspace outage in Dallas and a fire at Fisher Plaza in Seattle (which disrupted e-commerce Friday), while UPS units were cited in brief outages at Equinix data centers in Sydney and Paris on Thursday and a fire at 151 Front Street in Toronto early Sunday. Google App Engine also had a lengthy outage Thursday, but it was attributed to a data store failure."
This discussion has been archived. No new comments can be posted.

Data Center Power Failures Mount

Comments Filter:
  • Be Redundant! (Score:5, Insightful)

    by drewzhrodague (606182) <drew@zhr[ ]gue.net ['oda' in gap]> on Monday July 06, 2009 @08:29PM (#28602063) Homepage Journal
    Anyone seriously oncerned about their web applications, will have redundant sites, and a way to share the load. Few people pay attention to the fact that DNS requires geographically disparate DNS servers *, such that even in the event of a datacenter fire (or nuclear attack), there will still be an answer for your zone. Couple this with a few smaller server farms in separate places, and there won't be any problems. I went to look it up on wikipedia, but didn't find out where it is required for authoritative DNS servers to be in separate geographic regions. Where did I read this, DNS and BIND?
  • by Neanderthal Ninny (1153369) on Monday July 06, 2009 @08:42PM (#28602201)

    My wild guess is they are deferring preventative maintenance on these data centers so we are seeing these major outages now. Fire suppression, UPS, transfer switches, generators, distribution panels, transformers, network gear, server, storage devices and other gear will fail if you don't maintain them properly. As loads increase, the equipment will fail earlier and my guess the people have pushed the limit of this equipment beyond they the lifespan of load rating.

  • by Anonymous Coward on Monday July 06, 2009 @09:06PM (#28602441)

    Surprise surprise...there's a downside to consolidation. Hey morons, the internet was invented as a means to ensure redundant communications paths given nuclear warfare. The old central switch (physical switching) was seen as too cumbersome and vulnerable. Now that we have wonderfully redundant communications, and have done away with most of the downsides of physically distributed systems, morons are building logically centralized systems.

    NEWSFLASH - Redundant communications and physical virtualization do very little for you if you build a logical mainframe.

    Truly distributed systems must be physically AND logically DISTRIBUTED with redundant comms paths in order to gain the full benefits of decentralization. (e.g. Distributed isn't distributed if all your authentication is done at one site or all your traffic must pass through .)

  • by Velox_SwiftFox (57902) on Monday July 06, 2009 @10:32PM (#28603057)

    "Major" data center or not, the one your company employing you at the time is using is the important one.
    In my experiences, data center backups fail about a third the time power is interupted somewhere.

    Servers in an Oakland California center were the victim of the loss of one of three power phases, while the monitoring that would have switched over to the diesel generators was looking at the power level of other phases. UPS systems ran out of power. An extra level of redundancy in the form of rack mount UPSes allowed servers to shut down properly despite the data center's loss of routing.

    Data center #2 was the victim of a simple power outage and immediate failure of the main data center UPS system. According to a security guard I talked to, "it exploded". The diesel backup never had a chance to start.

    Then the doubly-sourced Power Distribution Unit supplying a rack at a third ISP failed in a way that turned off both sources supplying the servers.

    Hint: Add an extra level of UPS redundancy and safe shutdown software daemons, at least. Multiple data centers if you need more nines.

  • by afidel (530433) on Monday July 06, 2009 @10:32PM (#28603065)
    authorize.net are apparently complete idiots, if they are that large and all their equipment is in one datacenter then that's bordering on insane. Heck, my little company of under 1k employees has two facilities. Anyone who's should be running a site with 100k+ customers knows better.
  • by ls671 (1122017) * on Monday July 06, 2009 @11:34PM (#28603589) Homepage

    I would not try to convince him. Just write a memo describing the issues without sounding alarmist. It is up to the boss to evaluate the risks and to take the decision. Once you have written your memo, you are basically covered.

    Now could be a good time to write this memo, just remember not to sound alarmist, just describe the possible issues although the risk is slim. You could say that you have been inspired by recent events in big data centers ;-))

    As per licensing issues, call your Oracle/MS representatives, they offer special deals for fail-over sites. This will be a good point to mention in your memo (cost).

  • by ls671 (1122017) * on Monday July 06, 2009 @11:39PM (#28603645) Homepage

    >> "morons are building logically centralized systems"

    I have worked with such a moron doing architecture on a big government project ;-)) unbelievable...

    His argument was that "The government likes centralized systems" ;-))

  • by BBCWatcher (900486) on Tuesday July 07, 2009 @05:28AM (#28605351)

    No, I think you have it exactly backwards, or at least you're missing an important nuance. It's really, really expensive to duplicate everything across two (or more) data centers. And it's full scope increase in IT costs: most or all cost categories increase. We're talking more than double the costs, in round numbers. Beyond the cost, it's very hard technically to recover hundreds or thousands of servers simultaneously or even near-simultaneously, because you are typically trying to recover not hundreds or thousands of atomistic, independent servers but all the moment-in-time state and functional dependencies among servers. Very, very difficult, which also means hugely expensive and prone to error. Unfortunately, service interruptions are also extremely expensive. What to do?

    You could just buy a pair of mainframes, one at site one and the other (configured with reserve capacity, which is lower cost) at site two. (More only if you need the capacity. Then they just operate like a single machine.) That all works really, really well. As in, credit card holders would have no clue that site #1 just burned to the ground -- the credit cards still keep working. That particular form of consolidation makes disaster recovery a relative breeze. DR is just thoroughly baked into the DNA of such equipment, and the very computing model itself supports rapid recovery. (Down to zero interruption effectively and zero data loss, if that's what you need. Or, in DR lingo, RPO and RTO of zero.)

    The critical nuance here is if you only consolidate sites, which a lot of businesses have done, you're reducing business resiliency, ceteris paribus. Yes indeed, if you merely forklift your hundreds or thousands of servers into a smaller number of data centers and do basically nothing to consolidate applications, databases, operating system images, etc., onto better DR-protected assets, then disaster recovery will be much tougher and much more expensive. Site-wide disasters will be more disastrous. The game-changer (otherwise known as re-learning time-tested lessons :-)) is if you untangle the mess and do real consolidation onto a much smaller number of robust, well-protected servers with some decent DR investments and realistic rehearsals. That'd be mainframes and mainframe IT discipline, basically, or at least something that resembles mainframes (if such a thing exists).

2.4 statute miles of surgical tubing at Yale U. = 1 I.V.League

Working...