Stories
Slash Boxes
Comments
typodupeerror delete not in

Comments: 100 +-   Data Center Power Failures Mount on Monday July 06, @06:58PM

Posted by timothy on Monday July 06, @06:58PM
from the send-money-drugs-and-sealed-lead-acid-batteries dept.
power
internet
1sockchuck writes "It was a bad week to be a piece of electrical equipment inside a major data center. There have been five major incidents in the past week in which generator or UPS failures have caused data center power outages that left customers offline. Generators were apparently the culprit in a Rackspace outage in Dallas and a fire at Fisher Plaza in Seattle (which disrupted e-commerce Friday), while UPS units were cited in brief outages at Equinix data centers in Sydney and Paris on Thursday and a fire at 151 Front Street in Toronto early Sunday. Google App Engine also had a lengthy outage Thursday, but it was attributed to a data store failure."
story

Related Stories

This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by BillyMays (1587805) on Monday July 06, @06:58PM (#28601737)
    I'm guessing that the majority of these were caused by leaks or spilled drinks. If only you guys had listened to me and gotten Zorbeez(tm)[SOAKS UP 10x ITS OWN WEIGHT!] [wikipedia.org].

    -B. Mays
    • I'm guessing that the majority of these were caused by leaks or spilled drinks. If only you guys had listened to me and gotten Zorbeez(tm)[SOAKS UP 10x ITS OWN WEIGHT!] [wikipedia.org].

      Even that wouldn't work. What you have here is your textbook Pepsi Syndrome and only a President in yellow booties can fix it.

  • by StaticEngine (135635) on Monday July 06, @07:16PM (#28601939) Homepage

    "A blown transformer appears to be the culprit"

    I'd heard the new movie was crude, but I didn't realize how crude it actually was!

  • Outages (Score:3, Interesting)

    by Solokron (198043) on Monday July 06, @07:27PM (#28602049)
    Outages happen more than that. We have been in several data centers, ThePlanet and The Fortress both have had major outages in the last two years which has affected business.
    • Re:Outages (Score:5, Interesting)

          I've had equipment and/or worked in many datacenters over the last decade or so. I've worked with even more clients who have had equipment in other datacenters.

          I've only experienced 3 power related outages that I can think of.

          One was a brownout in that area, which cooked the contactors that switched between grid power and their own DC room.

          One was an accident, where a contractor accidentally shorted out a subpanel, and took out about a row of cabinets. I was there for that one. I saw the flash out of the corner of my eye, and by the time I turned my head, he was just flying into the row of cabinets.

          One was a mistake in the colo, where there was a mislabeled circuit, so they cut power to 1/3 of one of our racks.

          There have been even more outages related to connectivity problems. With one major provider who was just terrible (and is now out of business), they had a fault about once a week or less. Every time we called, they said "there was a train derailment that cut a section of fiber in [arbitrary state], which effected their whole network." It was funny at first, but annoying when we started questioning them about why there was no news about all these train derailments. We had to make up our own excuses for the customers, because we couldn't keep telling them the BS story the provider gave. We were smart about it though, and at least had decent excuses, and the whole staff knew which BS story to give for a particular day. The sad part was, we had a T3, and that was huge at the time.

          At my last job, they wanted a full post-mortum done on any fault. If a customer across the country suffered bad latency or packet loss, it was our job to find out why and "fix" it. The management wouldn't accept that there are 3rd party providers who handle some of the transit. So, we'd call our provider demanding it to be fixed (which they couldn't do), and then call the broken provider (who hung up since we weren't their customer), and then got reamed by the boss because we couldn't fix it. Delay tactics worked best after a while. If you're "investigating" a problem long enough, and hold the phone up to your ear enough, the problem will likely be fixed by those who really can. We'd still log a ticket with our provider, because the boss would eventually call the provider referencing the ticket number, and find out there was still nothing that could be done.

          There's pretty much guaranteed to be a fault of some sort between two points on the Internet every day. All anyone can really do is make sure it isn't with your own equipment. That's something I always did before calling to complain about anything. It's embarrassing to hear "did you reboot your router?" and that turns out to really be the problem.

          The only real solution to this is, redundancy. Not just in one facility, but across multiple facilities. If you spread things out enough, sure an isolated problem will effect some people, but not everyone. You want a service to be reliable, redundant machines in each datacenter is the only way to go. When I was running the network (and everything technical) at one job, a datacenter outage wasn't a concern, it was just a minor annoyance. I filed a trouble ticket, and told them to call me when it was fixed. We'd demand reimbursement on the outage time, and made them handle the difference on our 95th percentile bandwidth charges at the end of the month. I wasn't going to take a hit on the bill just because they had an outage in a city, and my other cities had to take the traffic during the outage. When your bill is measured in multiple Gb/s, you have a little more say in how they handle the billing. :)

      • ... which cooked the contactors that switched between grid power and their own DC room.

        I read that as contractors. I apparently saw 'contractor' in the next sentence and did the switcheroo. I was going to call you callous for using the term 'cooked'.

        FYI, arc flash [wikipedia.org] is not something to be taken lightly (no pun intended). It's dangerous as all get-out in high voltage panels that have a lot of available fault current. A typical 480V, 20kA fault [wikipedia.org] can release the same amount of energy as 1.5 lbs of TNT.

  • Anyone seriously oncerned about their web applications, will have redundant sites, and a way to share the load. Few people pay attention to the fact that DNS requires geographically disparate DNS servers *, such that even in the event of a datacenter fire (or nuclear attack), there will still be an answer for your zone. Couple this with a few smaller server farms in separate places, and there won't be any problems. I went to look it up on wikipedia, but didn't find out where it is required for authoritative DNS servers to be in separate geographic regions. Where did I read this, DNS and BIND?
    • Re:Be Redundant! (Score:5, Informative)

      by W3bbo (727049) on Monday July 06, @07:54PM (#28602333)
      The DNS RFCs advise that zone nameservers should be in separate subnets. Specifically RFC 2182 recomends that secondary DNS services be spread around geographically.
    • In the event of a nuclear attack, you probably have more pressing issues to deal with than your server uptime.

      • But then how will you know who is attacking you, and where to go? Not to mention how to best shield yourself from radiation...
    •     Be nice, people don't read the books nor RFC's any more.

          At the biggest operation I ran, I had redundant servers in multiple cities, and DNS servers in each city. If we lost a city, it was never a big deal, other than the others needing to handle the load. With say 3 cities, a one-city outage only accounted for a 16.6% increase in the other two. Each city was set up to handle >100% of the typical peak day traffic, so it was never a big deal. I don't think we ever suffered a two-city simultaneous failure, even though we simulated them by shutting down a city for a few minutes. Testing days were always my favorite. I loved to prove what we could or couldn't do. I peaked out one provider in a city once. We had the capacity as far as the lines went, but they couldn't handle the bandwidth. It was entertaining when they argued, so I dumped the other two cities to the one in question, and they were begging me to stop. "Oh, so there is a fault. Care to fix it?"

          I could quantify anything (and everything) at that place. I could tell you a month or so in advance what the peak bandwidth would be on a given day, and how many of which class of servers we needed to have operating to handle it. I classed servers by CPU and memory, which in turn gave how many users and how much bandwidth each could do. I only wanted our machines to every peak out at 80%, but sometimes it was fun to run them up through 100%. I set the limits a little low, so we could run at say 105% without a failure.

          Such information let us know if we had a server problem, before we knew we did. I'd notice a server was running 10% low, and that really means that it is going to fail. We'd watch for a little while, and it would. :) We'd power it down, and leave it in the datacenter until we had another scheduled site visit.

    • It's required that you have two name servers when you register a domain name.

      Physical separation is not required. It's just good practice. (I do, in separate cities on different ISP networks) Having separate nameservers in different geo regions is implicit because you have to register at least two for each domain name. I've seen some people game this by having a single nameserver with two IP addresses, which strikes me as the height of stupidity, but it's not happening on my watch.

      • I've seen some people game this by having a single nameserver with two IP addresses, which strikes me as the height of stupidity

        If everything referenced by the DNS records (web and email services or whatever) is hosted on the same machine as the name server, then it isn't particularly stupid. It's just a small operation that has a single point of failure; redundant DNS isn't going to change that.

        • If everything referenced by the DNS records (web and email services or whatever) is hosted on the same machine as the name server, then it isn't particularly stupid. It's just a small operation that has a single point of failure; redundant DNS isn't going to change that.

          WITH the single exception I know of, that incoming email will bounce with something like "domain not found" if there is no DNS response at all, vs if there is DNS but the MX record servers can't be reached it'll silently retry. Some totally brain-dead MTAs will bounce, but anything remotely usable will transparently retry later and no one will know it happened.

          And it's not so much a "small operation", as a non-relevant risk. People have a certain expectation of how (un-)reliable email is, due to filtering

          • WITH the single exception I know of, that incoming email will bounce with something like "domain not found" if there is no DNS response at all, vs if there is DNS but the MX record servers can't be reached it'll silently retry.

            Common myth but quite untrue (try it for yourself). If there is no response from any DNS server then it will be considered a temporary failure and delivery attempts will continue at intervals in the background just as if the MX target(s) were not responding.

            Only if a server can be re

    • While geographic diversity is certainly an excellent goal, it's not always that simple. My ISP's network core was located in the Peer 1 suite at 151 Front (whose UPS caused the fire). Power was cut to Peer 1's suite, but not the rest of the building (151 Front has independent power/cooling/etc. per-suite to the extent where each tenant is responsible for getting their own solution).

      Redundant power sources could have mitigated the issue had there not been a fire; running two independent circuits to critical

    • Re: (Score:3, Interesting)

      Best solution for big outfits is to have at least this setup:

      1) One party being the main contractor. This party doesn't do ANY hosting per say but only manages the fail-over strategy, doing the relevant testing once in a while.

      2) A second party being involved in hosting and managing data centers.

      3) A third party, completely independent from party 2, a competitor of 2 is preferable, which also does hosting and manages data centers.

      It is the same principle when you bring redundant internet connectivity to a b

      • 1) Have the fiber from one provider come into the building from, say, the north side of the building.

        2) Have a competitor, unrelated business wise, that doesn't use the same upstream providers bring his fibers in from the South side of the building.

        3) Discover that both fiber runs connect to the same L.E.C. vault 100 feet away and then run parallel the whole way back to the same central office, and/or they both are carried on the same SONET ring just connected to different ADMs (which would at least give you ADM redundancy).

        Seriously though, step 3) is get a copy of the DLR / CLR of the local loop, and have someone analyze them. Of course how the circuit is designed is not necessarily how it is actually routed, which is even funnier.

        Everyone in the t

    • It wasn't me!!
  • by Neanderthal Ninny (1153369) on Monday July 06, @07:42PM (#28602201)

    My wild guess is they are deferring preventative maintenance on these data centers so we are seeing these major outages now. Fire suppression, UPS, transfer switches, generators, distribution panels, transformers, network gear, server, storage devices and other gear will fail if you don't maintain them properly. As loads increase, the equipment will fail earlier and my guess the people have pushed the limit of this equipment beyond they the lifespan of load rating.

    • Frankly, if data centers are going to proclaim their redundancy, they should test by power failing the entire data center once every two weeks at a minimum. A data center that goes down twice in a month would get ahead of any issue pretty fast. Lessons learned from the staff and the management are very valuable.

      The marketing messaging:

      "We power fail our data center every two weeks to ensure our backups work..."

      Sound scary? Just think about the data center that has never been through this process. at tha

      • Semi-monthly pull-the-plug tests would reduce reliability. Monthly load tests on generator and a battery monitoring system ensure electrical reliability quite effectively. Only the most inadequate facilities fail to do this.

        The larger problems come from improper change control, a lack of scripting, or an abnormal failure mode. Lack of testing and maintenance is a real problem, and in data centers it is far too often that it is caused by the IT team not understanding the risks of inaction. If you have an ac

      • Re: (Score:3, Insightful)

        I would not try to convince him. Just write a memo describing the issues without sounding alarmist. It is up to the boss to evaluate the risks and to take the decision. Once you have written your memo, you are basically covered.

        Now could be a good time to write this memo, just remember not to sound alarmist, just describe the possible issues although the risk is slim. You could say that you have been inspired by recent events in big data centers ;-))

        As per licensing issues, call your Oracle/MS representativ

  • by Anonymous Coward on Monday July 06, @08:06PM (#28602441)

    Surprise surprise...there's a downside to consolidation. Hey morons, the internet was invented as a means to ensure redundant communications paths given nuclear warfare. The old central switch (physical switching) was seen as too cumbersome and vulnerable. Now that we have wonderfully redundant communications, and have done away with most of the downsides of physically distributed systems, morons are building logically centralized systems.

    NEWSFLASH - Redundant communications and physical virtualization do very little for you if you build a logical mainframe.

    Truly distributed systems must be physically AND logically DISTRIBUTED with redundant comms paths in order to gain the full benefits of decentralization. (e.g. Distributed isn't distributed if all your authentication is done at one site or all your traffic must pass through .)

    • Re: (Score:3, Insightful)

      >> "morons are building logically centralized systems"

      I have worked with such a moron doing architecture on a big government project ;-)) unbelievable...

      His argument was that "The government likes centralized systems" ;-))

    • 1) The internet wasn't redundant, ARPANET was redundant. The internet hasn't been able to withstand a nuclear attack since it was put online.

      Putting all your eggs in one basket is nothing new under the sun. You ever see Ma Bell's idea of a "redundant" circuit? Two wires in the same condiut. But at least Ma Bell was doing it out of thriftiness and laziness, not ignorance and superstition.

    • Re: (Score:3, Insightful)

      No, I think you have it exactly backwards, or at least you're missing an important nuance. It's really, really expensive to duplicate everything across two (or more) data centers. And it's full scope increase in IT costs: most or all cost categories increase. We're talking more than double the costs, in round numbers. Beyond the cost, it's very hard technically to recover hundreds or thousands of servers simultaneously or even near-simultaneously, because you are typically trying to recover not hundreds or

  • ... saying that it's time to reconsider cost cutting measures. In 15 years in the field I never saw a well designed and well maintained critical power system drop its load. I saw many poorly designed and/or poorly maintained systems drop loads, even catching fire in the process. One such fire in a poorly designed and poorly maintained system took the entire building with it, data center and all. The fire suppression system in that one was never upgraded to meet the needs of the "repurposed space" which was originally a light industrial/office space.

  • Even worse... (Score:5, Informative)

    by Anonymous Coward on Monday July 06, @08:28PM (#28602583)

    I'm one of the guys that services the security system in Fisher Plaza. The damn sprinklers killed half my panels near the scene. Turns out they use gas suppression methods in the data centers, not so much in the utility closets. And the city of Seattle REQUIRES sprinklers throughout the building, even right over the precious, precious servers. In defense of the staff there however, they do not keep them all charged 24/7. Other then that, I have no more info, as they're pretty locked down.

  • by Dirtside (91468) on Monday July 06, @09:05PM (#28602833) Journal
    ...what is the normal (historical) rate of data center power failures, and how does the recent spate compare? Five in a week sounds severe, but what's the normal worldwide average? I can imagine that with thousands of data centers around the globe, there's likely a serious failure occurring somewhere in the world once every couple of days.
  • by Velox_SwiftFox (57902) on Monday July 06, @09:32PM (#28603057)

    "Major" data center or not, the one your company employing you at the time is using is the important one.
    In my experiences, data center backups fail about a third the time power is interupted somewhere.

    Servers in an Oakland California center were the victim of the loss of one of three power phases, while the monitoring that would have switched over to the diesel generators was looking at the power level of other phases. UPS systems ran out of power. An extra level of redundancy in the form of rack mount UPSes allowed servers to shut down properly despite the data center's loss of routing.

    Data center #2 was the victim of a simple power outage and immediate failure of the main data center UPS system. According to a security guard I talked to, "it exploded". The diesel backup never had a chance to start.

    Then the doubly-sourced Power Distribution Unit supplying a rack at a third ISP failed in a way that turned off both sources supplying the servers.

    Hint: Add an extra level of UPS redundancy and safe shutdown software daemons, at least. Multiple data centers if you need more nines.

  • Rackspace in Dallas (Score:5, Informative)

    by Thundersnatch (671481) on Monday July 06, @10:01PM (#28603323) Journal

    We're a Rackspace customer in their DFW datacenter. This is the third power-related outage they've had in the last two years at that supposedly world-class facility.

    The first wasn't really their fault: truck driver with health condition runs into their transformers. Generators kick in, but chillers don't re-start quickly enough. Temps skyrocket in minutes, emergency shutdowns. Maybe the transformes should have had some $50 concrete pylons surrounding them?

    The second outage was the result of a botched generator upgrade.

    This latest outage was the result of a botched UPS maintenance.

    None of the outages was long enough to trigger our failover policy to our DR site, but our customers definitely noticed.

    While their messaging has been very open and honest about the problems, and the SLA credits have been immediate, we pay them nearly $20K per month. Nedless to say, we are shopping, and looking into a "multiple cheap colos" architecture instead of "Tier-1 managed hosting". Nothing beats geographic redundancy.

    • by zonky (1153039) on Monday July 06, @10:40PM (#28603659)
      That isn't quite right, re: their 2007 outage.

      It wasn't a power issue as such, but the way their chillers reponded to two quick power fluctuations in succession:

      This is what they said:

      Without notifying us, the utility providers cut power, and at that exact moment we were 15 minutes into cycling up the data centerâ(TM)s chillers. Our back up generators kicked in instantaneously, but the transfer to backup power triggered the chillers to stop cycling and then to begin cycling back up againâ"a process that would take on average 30 minutes. Those additional 30 minutes without chillers meant temperatures would rise to levels that could irreparably damage customersâ(TM) servers and devices. We made the decision to gradually pull servers offline before that would happen. And I know we made the right decision, even if it was a hard one to make.
  • Sunspots, Anyone? (Score:3, Interesting)

    by Craig Milo Rogers (6076) on Tuesday July 07, @04:24AM (#28605331) Homepage

    All these data centers failed at roughly the same time as the sunspots returned [space.com], but that's just a coincidence, right?

  • I'm almost thinking of taking UPS out of the loop here. They cause nearly all the downtime we have. It would be better to just let the machines power off rather than allowing the UPSs to CAUSE the machines to be taken offline. At least if the UPS isn't in circuit, the machines power back up again when the power comes back, but if there's a fault with the UPS or it's batteries, then the machines stay offline until the batteries have been replaced.

    Why the hell the idiots that design UPSs seem to think it's a

    • Yes, that's clearly Twitter territory.
    • by Anonymous Coward on Monday July 06, @08:30PM (#28602595)

      Because out of all of the data centers in the world, there were problems at five? Riiiiight. Good reporting, Slashdot.

      Can I sign up for broken water main notices here, too, or do I need to go to another website?

      100+ million people daily are "serviced" by these 5 data centers.

      Company's such as authorize.net where COMPLETELY unavailable for payments to hundred of thousands of webmasters sites (ya know the people who make money)

      If you don't think this is serious news then you are still living at home.

      Ya that's what I thought.

      • Re: (Score:3, Insightful)

        authorize.net are apparently complete idiots, if they are that large and all their equipment is in one datacenter then that's bordering on insane. Heck, my little company of under 1k employees has two facilities. Anyone who's should be running a site with 100k+ customers knows better.
      • The Fisher Plaza story is big. I happened to be walking by right after it happened, noticed the generators running and went, 'Hm-m-m". We've toured their facility in the past, and wanted to use them, but they didn't have capacity at the time. They seemed first rate. If a first tier provider can have this happen...

        • by right after it happened, noticed the generators running and went, 'Hm-m-m

          You're some kind of witch aren't you? You broke my internet!

          BURN THE WITCH!!!

      • Re: (Score:2, Funny)

        by Anonymous Coward

        Indeed, 18465. And we shall get off your lawn as well.

    • Safety measures drastically reduce the chance of accidents, while being unprepared, especially if it's just a brief period of unpreparedness, greatly increases the chance of an accident. This makes you wonder if the safety measures were really worth it, but at least you won't have any accidents as long as you remain prepared.
Smoking is, as far as I'm concerned, the entire point of being an adult. -- Fran Lebowitz