Forgot your password?
typodupeerror
Power The Internet IT

Cooling Challenges an Issue In Rackspace Outage 294

Posted by Zonk
from the getting-a-touch-warm-in-here dept.
miller60 writes "If your data center's cooling system fails, how long do you have before your servers overheat? The shrinking window for recovery from a grid power outage appears to have been an issue in Monday night's downtime for some customers of Rackspace, which has historically been among the most reliable hosting providers. The company's Dallas data center lost power when a traffic accident damaged a nearby power transformer. There were difficulties getting the chillers fully back online (it's not clear if this was equipment issues or subsequent power bumps) and temperatures rose in the data center, forcing Rackspace to take customer servers offline to protect the equipment. A recent study found that a data center running at 5 kilowatts per server cabinet may experience a thermal shutdown in as little as three minutes during a power outage. The short recovery window from cooling outages has been a hot topic in discussions of data center energy efficiency. One strategy being actively debated is raising the temperature set point in the data center, which trims power bills but may create a less forgiving environment in a cooling outage."
This discussion has been archived. No new comments can be posted.

Cooling Challenges an Issue In Rackspace Outage

Comments Filter:
  • by Dynedain (141758) <slashdot2.anthonymclin@com> on Tuesday November 13, 2007 @12:08PM (#21337899) Homepage
    Actually this brings up an interesting point of discussion for me at least. Our office is doing a remodel and I'm specifying a small server room (finally!) and the contractors are asking what AC unit(s) we need. Is there a general rule for figuring out how many BTUs of cooling you need for a given wattage of power supplies?
  • Re:Which only shows (Score:3, Interesting)

    by lb746 (721699) on Tuesday November 13, 2007 @12:18PM (#21338051)
    I actually use a vent duct to suck in cold air from outside during the winter to help cool a server in my house. Originally I was more concerned with random object/bugs/leaves so I made it a closed system(like water cooling) to help protect the actual system. It works nicely, but only for about 1/3 or less of the year when the temperature is cold enough to make a difference. I've always wondered about a larger scale of something like this such as how the parent suggested servers in a colder/arctic region.
  • Re:Which only shows (Score:3, Interesting)

    by Ironsides (739422) on Tuesday November 13, 2007 @12:18PM (#21338059) Homepage Journal
    Yes, actually. This was looked into by multiple companies during the late 90's. I'm not sure if any were ever built. I think one of the considerations as a byproduct was the savings of not having to run chillers with the cost of getting fibre and power laid to the facility.
  • by trolltalk.com (1108067) on Tuesday November 13, 2007 @12:19PM (#21338077) Homepage Journal

    Believe it or not, but in one of those "life coincidences", pi is a safe approximation. Take the number of watts your equipment, lighting, etc., use, multiply by pi, and that's the # of btus of cooling. Don't forget to include 100 watts per person for body heat.

    It'll be 90F degrees outside, and you'll be a cool 66F.

  • Re:Which only shows (Score:3, Interesting)

    by Azarael (896715) on Tuesday November 13, 2007 @12:19PM (#21338081) Homepage
    Some data centers also have multiple incoming power lines (which hopefully don't have a single transformer bottle-neck). Anyway, I know for sure that at least one data center in Toronto had 100% uptime during the big August 2004 Blackout, so it is possible to prevent these problems.
  • by MROD (101561) on Tuesday November 13, 2007 @12:21PM (#21338099) Homepage
    I've never understood why data centre designers haven't used a different cooling strategy to re-circulated cooled air. After all, for much of the temperate latitudes for much of the year the external ambient temperature is at or below that needed for the data centre so why not use conditioned external air to cool the equipment and then exhaust it (possibly with a heat exchanger to recover the heat for other uses such as geothermal storage and use in winter)? (Oh, and have the air-flow fans on the UPS.)

    The advantage of this is that even in the worst case scenario where the chillers fail totally during mid-summer there is no run-away, closed loop, self re-enforcing heat cycle, the data centre temperature will rise but it would do so more slowly and the maximum equilibrium temperature will be far lower (and dependant upon the external ambient temperature).

    In fact, as part of the design for the cluster room in our new building I've specified such a system, though due to the maximum size of the ducting space available we can only use this for half the heat load.
  • Re:Which only shows (Score:4, Interesting)

    by blhack (921171) * on Tuesday November 13, 2007 @12:21PM (#21338111)
    I think the problem is availability of power. When you are talking about facilities that consume so much power that, when built, their proximity to a power station is taken into account, you can't just slap one down at the poles and call it good. I would imagine that lack of bandwidth is a MAJOR issue as well..... ...one field where I think storing servers at the poles would be amazing is super computing. Supercomputers don't require the massive ammounts of bandwidth that webservers etc do. You send a cluster a chunk of data for processing, it processes it, and it gets sent back. For really REALLY large datasets (government stuff)...just fill a jet with hard-disks and have it to the server center in a few hours.
  • Re:Which only shows (Score:3, Interesting)

    by afidel (530433) on Tuesday November 13, 2007 @12:22PM (#21338137)
    It sounds like they DID have backup power for the cooling but that they switch over to backup power caused some problems. This isn't really all that unusual because cooling is basically never on UPS power so the transition to backup power may not go completely smoothly unless everything is setup correctly, tested, and there are no or little unusual circumstances during the switchover. I've seen even well designed systems have problems in the real world. One time we lost one leg of a triphase power system so the automatic transfer switch failed to flip over and startup the generator. The UPS realized it wasn't getting good power so it flipped over to battery power. Luckily the UPS sent out its notification and we were able to manually switch over to generator and get the cooling online, but there is almost no chance of it working in we only had 3 minutes to get things corrected.
  • Re:Which only shows (Score:4, Interesting)

    by NickCatal (865805) on Tuesday November 13, 2007 @12:31PM (#21338275)
    I can't stress this enough. When I talk to people about hosting and they rely on 100% availability they NEED to go with geographically diverse locations. Even if it is a single backup somewhere you have to have something.

    For example, Chicago's primary datacenter facility is in 350 E. Cermak (right next to McCormick Place) and the primary interconnect facility in that building is Equinix (which has the 5th and now 6th floors.) A year or so ago there was a major outage there (that mucked up a good amount of the internet in the midwest) when a power substation caught on fire and the Chicago Fire Department had to shut off power to the entire neighborhood. So the backup system started like it should, with the huge battery rooms powering everything (including the chillers) for a bit while the engineers started up the generators. Only thing is, the circuitry that controls the generators shorted out, so while the generators themselves were working, the UPS was working, the chillers were working, this one circuit board blew at the WRONG moment. And this isn't the only time this circuit has been used, they test the generators every few weeks.

    Long story short, once the UPSes started running out of power the chillers started going, lights flickered, and for a VERY SHORT period of time the chillers went out before all of the servers did. Within a minute or two it got well over 100 degrees in that datacenter. Thank god the power cut out as quick as it did.

    So yes, Equinix in that case did everything by the book. They had everything setup as you would set it up. It was no big deal. But something went wrong at the worst time for it to go wrong and all hell broke loose.

    It could be worse, your datacenter could be hit by a tornado [nyud.net]

  • by Leebert (1694) on Tuesday November 13, 2007 @12:39PM (#21338397)
    A few weeks ago the A/C dropped out in one of our computer rooms. I like the resulting graph: http://leebert.org/tmp/SCADA_S100_10-3-07.JPG [leebert.org]
  • by Animats (122034) on Tuesday November 13, 2007 @12:42PM (#21338447) Homepage

    Most large refrigeration compressors have "short-cycling protection". The compressor motor is overloaded during startup, and needs time to cool. So there's a timer that limits the time between two compressor starts. 4 minutes is a typical delay for a large unit. If you don't have this delay, compressor motors burn out.

    Some fancy short-cycling protection timers have backup power, so the the "start to start" time is measured even through power failures. But that's rare. Here's a typical short-cycling timer. [ssac.com] For the ones that don't, like that one, a power failure restarts the timer, so you have to wait out the timer after a power glitch.

    The timers with backup power, or even the old style ones with a motor and cam-operated switch, allow a quick restart after a power failure if the compressor was already running. Once. If there's a second power failure, the compressor has to wait out the time delay.

    So it's important to ensure that a data center's chillers have time delay units that measure true start-to-start time, or you take a cooling outage of several minutes on any short power drop. And, after a power failure and transfer to emergency generators, don't go back to commercial power until enough time has elapsed for the short-cycling protection timers to time out. This last appears to be where Rackspace failed.

    Dealing with sequential power failures is tough. That's what took down that big data center in SF a few months ago.

  • by arth1 (260657) on Tuesday November 13, 2007 @01:17PM (#21338989) Homepage Journal
    (Disregarding your blatant karma whoring by replying to the top post while changing the subject)

    There's several good reasons why the servers are located where they are, and not, say, in Alaska.
    The main one is light speed through fiber, and a cable from Houston to Fairbanks would induce a best case of around 28 ms latency, each way. Multiply by several billion packets.

    This is why hosting near the customer is considered a Good Thing, and why companies like Akamai have made it their business of transparently re-routing clients to the closest server.

    Back to cooling. A few years ago, I worked for a telephone company, and the local data centre there had a 15 degree C ambient baseline temperature. We had to wear sweaters if working for any length of time in the server hall, but had a secure normal temperature room outside the server hall, with console switches and a couple of ttys for configuration.
    The main reason why the temperature was kept so low was to be on the safe side -- even if a fan should burn out in one of the cabinets, opening the cabinet doors would provide adequate (albeit not good) cooling until it could be repaired, without (and this is the important part) taking anything down.
    A secondary reason was that the backup power generators were, for security reasons, inside the server hall themselves, and during a power outage these would add substantial heat to the equation.
  • by Nf1nk (443791) <nf1nk@noSpAM.yahoo.com> on Tuesday November 13, 2007 @01:18PM (#21339007) Homepage
    Personal energy output is a function of a number of variables, but the most important, are the ambient temperature and the movement of air through the room. The 100 watts per person is a conservative estimate based on a roughly 75 F room.

    The Prof in a box experiment has a large issue that contributes to error. He is breathing with a tube, the heat exchange in your lungs is a convection exchange and has too large a magnitude to ignore. If you have doubts about how much heat flows out through breathing next time you are cold in bed pull the covers up over your head and breath under the covers. You will find that the bed gets nice and warm in a very short time.
  • Re:Which only shows (Score:3, Interesting)

    by spun (1352) <loverevolutionary&yahoo,com> on Tuesday November 13, 2007 @01:52PM (#21339547) Journal
    Hmph. We have backup power for the cooling in our server room, but we had to deal with a fun little incident two weeks ago. Trane sent out a new HVAC monkey a month ago for routine maintenance. I was the one who let this doofus in, and let me tell you, he was a slack-jawed mouth-breathing yokel of tender years. He took one look at our equipment and said, I quote, "I ain't never seen nutin' like this'un before, hee-yuck!" I was a bit taken aback, but he seemed to go through all the proper motions.

    Fast forward to three weeks ago. The temp is fine, but the humidity keeps going down. I tell management, but this is a state agency and everything around here takes three times as long as it should. For a state agency, that's outstanding, by the way. Anyway, noting gets done. Then we find out WHY the humidity is going down: seems the HVAC monkey didn't screw in the water bottle all the way and the entire 5 ton fills up with water, until it shorts out at 4 pm on a Friday afternoon and dumps water everywhere.

    Well, we got our four emergency portable coolers in with little tubes leading out into the hall, the fans on, and the doors open right quick, but the temp still shot up to over 100 in under ten minutes. Well, I told hem something was up, and anyway, I'm on the VMware/BladeCenter server consolidation team, and this is just more of an argument to fund us better. But I guess the moral of the story is, don't let slack-jawed mouth-breathing yokels fix your mission critical systems.
  • Re:Which only shows (Score:3, Interesting)

    by afidel (530433) on Tuesday November 13, 2007 @01:59PM (#21339649)
    Ok, I specifically said UPS power, as in it takes time to spinup the generators and switching from one source to the other does not always go perfectly in the real world. One factor is minimum cycle time on the compressors. The 3 minute time frame was from TFA which says that at a density of 5KVA per cabinet thermal shutdown can happen in 3 minutes due to thermal load.

    Oh and as far as the one leg collapsing thing, yes we were VERY pissed at everyone involved in that little problem, it turns out it was a design flaw in the transfer switch. Because it happened during the day we ended up taking more of an outage for replacement of the switch then we did from the incident but it just proves that even a well designed system can have problems. That datacenter was small enough to only have single source power, my current datacenter has dual feed including dual generator and fully redundant cooling so a single transfer switch malfunction wouldn't take it down but you have to work within the parameters set by budget and need.
  • by trolltalk.com (1108067) on Tuesday November 13, 2007 @02:02PM (#21339695) Homepage Journal
    Think for 2 secs ... each kw of electricity eventually gets converted to heat. Resistive heating generates ~ 3,400 btus per kilowatt, so multiplying electrical consumption by pi gives you a decent cooling capacity. Add an extra 10% and you're good to go (you *DO* remember to add in a fudge factor of between 10 and 20% for "future expansion", right?)
  • by markjl (151828) on Tuesday November 13, 2007 @02:40PM (#21340281)

    Disclaimer: I work with SGI, so I can shed some light on their customer's perspective (NASA, gov't, research labs, etc.) and solution to this problem.

    The increasing density of servers is exacerbating the problem of power and cooling in every data center. This week is the SuperComputing trade show [supercomputing.org] where the the new top 500 supercomputers [top500.org] edition was released with "Big Turnover Among the Top 10 Systems," where you can see the first examples to address these issues.

    SGI's new ICE blade system was launched a few months ago, it was designed to address the power consumption, real estate density, and cooling issues everyone will probably experience on their next server cycle. ICE has shipped and one installation is now #3 on the Top 500. It's a welcome sign that SGI is back from bankruptcy. I'm sorry if this seems like an advert, so I'm not going to link to SGI -- you can go find out more easily if you want.

  • Re:Which only shows (Score:3, Interesting)

    by Critical Facilities (850111) on Tuesday November 13, 2007 @05:18PM (#21342547) Homepage

    Although our servers are on uninterrupted power (same as the Air Con)


    I guarantee your HVAC systems are NOT on UPS power. If by some massive failure during construction and commissioning they were and it was missed, I'd recommend firing your entire engineering department and any development contractors involved with building and maintaining your facility. There is no reason to put HVAC systems (chillers, pumps, air handlers, CRACs) on UPS as they can all manage just fine with losing their power and restarting once power is restored (either from utility or generator). To subject your UPS system(s) to the massive inrush current that would occur when various HVAC component loads are thrust on it would be....well, stupid at best.

    Your power systems sound pretty consistent with what is in most Data Centers (the "Essential Power" is often referred to as Emergency Power in Data Center environments). 30 seconds is a pretty good turnaround time for generators to start up, although 15 seconds is better (and very attainable).

    So to answer your question, no, Data Centers do not have a "slacker" design than hospitals. They are actually quite similar in their requirements in terms of HVAC and of course power.
  • by RockDoctor (15477) on Wednesday November 14, 2007 @04:15AM (#21347369) Journal

    the local data centre there had a 15 degree C ambient baseline
    Well that's just incompetent. For one thing, commercial electronics experience increased failure as you move away from an ambient 70 degrees F regardless of which direction you move. Running them at 59 degrees F (15 C) is just as likely to induce intermittent failures as running it at 80 degrees F.
    I was considering asking why the GP poster was bothering with a sweater when working (as opposed to sleeping) in his server room at 15centigrade, but decided that he must just be one of those people who can't stand normal temperatures. But electrical engineers know that a lot of their equipment is going to be used in "ambient" conditions which are not the "ambient" of their climate-controlled office. In my work, for example, 20C would be an abnormally hot temperature for our sensor equipment ; -20C would be by no means unknown; -50C quite credible. On the other hand, some of our analytic equipment has to run for months between service visits at +50C in 90%+ condensing humidity and with forced ventilation carrying salt dust and oil spray. You design your equipment for the conditions that it's going to face, not the conditions in your office today.
    Additionally, you appear to be conflating the air temperature in the data centre (15C) with the temperature of the components. Since having a heat flux requires having a thermal gradient, then the components will be warmer than your heat sink.
    In this town, we can tell the nationality of the boss of any office instantly on walking in - European bosses keep the HVAC (heating ventilation air-conditioning, or climate control) set to about 20C ; American bosses have it re-set to 25C (until over-ruled for wasting money).

    For another, you're supposed to design your cooling system to accommodate all of the planned heat load in the environment. If your generators will be adding heat then the A/C needs to have sufficient capacity to take that heat back out.
    There's an Indian HVAC company (in Abu Dhabi), and a instrumentation engineer (last heard of in Houston, America) who need to be taught this lesson. Again. If you meet them, please apply the clue-bat before agreeing to take the equipment they design out to the Empty Quarter to rig it up.

    [your generators] should be walled off from the data center with exterior air exchange. Otherwise an error in the exhaust ducting risks killing your operators with CO poisoning.
    Your carbon dioxide flood for fire suppression would be as effectively lethal. Operators would need to be kept out of the controlled zone while enclosed generators are running; the fire suppression system should be overridden while operators are in the controlled zone, or you need to be rigged up with cascade air supplies and work-pack SCBA while working in the control zone. This isn't rocket science - there are plenty of corpses that point the way to proper management of work in potentially lethal atmospheres. (Of course, there are plenty of work places that like to cut corners and put their workers at risk. Don't work there and do report them to the relevant authorities.)

The typical page layout program is nothing more than an electronic light table for cutting and pasting documents.

Working...