Data Center Power Failures Mount 100

Posted by timothy on Monday July 06, 2009 @07:58PM from the send-money-drugs-and-sealed-lead-acid-batteries dept.

1sockchuck writes "It was a bad week to be a piece of electrical equipment inside a major data center. There have been five major incidents in the past week in which generator or UPS failures have caused data center power outages that left customers offline. Generators were apparently the culprit in a Rackspace outage in Dallas and a fire at Fisher Plaza in Seattle (which disrupted e-commerce Friday), while UPS units were cited in brief outages at Equinix data centers in Sydney and Paris on Thursday and a fire at 151 Front Street in Toronto early Sunday. Google App Engine also had a lengthy outage Thursday, but it was attributed to a data store failure."

Data Center Power Failures Mount

This discussion has been archived. No new comments can be posted.

Search 100 Comments Log In/Create an Account

Comments Filter:

Outages (Score:3, Interesting)

by Solokron ( 198043 ) writes: on Monday July 06, 2009 @08:27PM (#28602049) Homepage

Outages happen more than that. We have been in several data centers, ThePlanet and The Fortress both have had major outages in the last two years which has affected business.

Re:Outages (Score:5, Interesting)

by JWSmythe ( 446288 ) writes: <jwsmytheNO@SPAMjwsmythe.com> on Monday July 06, 2009 @09:05PM (#28602421) Homepage Journal

I've had equipment and/or worked in many datacenters over the last decade or so. I've worked with even more clients who have had equipment in other datacenters.
I've only experienced 3 power related outages that I can think of.
One was a brownout in that area, which cooked the contactors that switched between grid power and their own DC room.
One was an accident, where a contractor accidentally shorted out a subpanel, and took out about a row of cabinets. I was there for that one. I saw the flash out of the corner of my eye, and by the time I turned my head, he was just flying into the row of cabinets.
One was a mistake in the colo, where there was a mislabeled circuit, so they cut power to 1/3 of one of our racks.
There have been even more outages related to connectivity problems. With one major provider who was just terrible (and is now out of business), they had a fault about once a week or less. Every time we called, they said "there was a train derailment that cut a section of fiber in [arbitrary state], which effected their whole network." It was funny at first, but annoying when we started questioning them about why there was no news about all these train derailments. We had to make up our own excuses for the customers, because we couldn't keep telling them the BS story the provider gave. We were smart about it though, and at least had decent excuses, and the whole staff knew which BS story to give for a particular day. The sad part was, we had a T3, and that was huge at the time.
At my last job, they wanted a full post-mortum done on any fault. If a customer across the country suffered bad latency or packet loss, it was our job to find out why and "fix" it. The management wouldn't accept that there are 3rd party providers who handle some of the transit. So, we'd call our provider demanding it to be fixed (which they couldn't do), and then call the broken provider (who hung up since we weren't their customer), and then got reamed by the boss because we couldn't fix it. Delay tactics worked best after a while. If you're "investigating" a problem long enough, and hold the phone up to your ear enough, the problem will likely be fixed by those who really can. We'd still log a ticket with our provider, because the boss would eventually call the provider referencing the ticket number, and find out there was still nothing that could be done.
There's pretty much guaranteed to be a fault of some sort between two points on the Internet every day. All anyone can really do is make sure it isn't with your own equipment. That's something I always did before calling to complain about anything. It's embarrassing to hear "did you reboot your router?" and that turns out to really be the problem.
The only real solution to this is, redundancy. Not just in one facility, but across multiple facilities. If you spread things out enough, sure an isolated problem will effect some people, but not everyone. You want a service to be reliable, redundant machines in each datacenter is the only way to go. When I was running the network (and everything technical) at one job, a datacenter outage wasn't a concern, it was just a minor annoyance. I filed a trouble ticket, and told them to call me when it was fixed. We'd demand reimbursement on the outage time, and made them handle the difference on our 95th percentile bandwidth charges at the end of the month. I wasn't going to take a hit on the bill just because they had an outage in a city, and my other cities had to take the traffic during the outage. When your bill is measured in multiple Gb/s, you have a little more say in how they handle the billing. :)

Re:Be Redundant! (Score:5, Interesting)

by JWSmythe ( 446288 ) writes: <jwsmytheNO@SPAMjwsmythe.com> on Monday July 06, 2009 @09:15PM (#28602505) Homepage Journal

Be nice, people don't read the books nor RFC's any more.
At the biggest operation I ran, I had redundant servers in multiple cities, and DNS servers in each city. If we lost a city, it was never a big deal, other than the others needing to handle the load. With say 3 cities, a one-city outage only accounted for a 16.6% increase in the other two. Each city was set up to handle >100% of the typical peak day traffic, so it was never a big deal. I don't think we ever suffered a two-city simultaneous failure, even though we simulated them by shutting down a city for a few minutes. Testing days were always my favorite. I loved to prove what we could or couldn't do. I peaked out one provider in a city once. We had the capacity as far as the lines went, but they couldn't handle the bandwidth. It was entertaining when they argued, so I dumped the other two cities to the one in question, and they were begging me to stop. "Oh, so there is a fault. Care to fix it?"
I could quantify anything (and everything) at that place. I could tell you a month or so in advance what the peak bandwidth would be on a given day, and how many of which class of servers we needed to have operating to handle it. I classed servers by CPU and memory, which in turn gave how many users and how much bandwidth each could do. I only wanted our machines to every peak out at 80%, but sometimes it was fun to run them up through 100%. I set the limits a little low, so we could run at say 105% without a failure.
Such information let us know if we had a server problem, before we knew we did. I'd notice a server was running 10% low, and that really means that it is going to fail. We'd watch for a little while, and it would. :) We'd power it down, and leave it in the datacenter until we had another scheduled site visit.

Former critical power field engineer here... (Score:5, Interesting)

by asackett ( 161377 ) writes: on Monday July 06, 2009 @09:16PM (#28602519) Homepage

... saying that it's time to reconsider cost cutting measures. In 15 years in the field I never saw a well designed and well maintained critical power system drop its load. I saw many poorly designed and/or poorly maintained systems drop loads, even catching fire in the process. One such fire in a poorly designed and poorly maintained system took the entire building with it, data center and all. The fire suppression system in that one was never upgraded to meet the needs of the "repurposed space" which was originally a light industrial/office space.

Qld Health datacentre disaster (Score:1, Interesting)

by Anonymous Coward writes: on Monday July 06, 2009 @09:21PM (#28602549)

See story of Qld Health datacentre disaster on ZDnet recently:
http://www.zdnet.com.au/news/hardware/soa/Horror-story-Qld-Health-datacentre-disaster/0,130061702,339297206,00.htm

Re:"bad week to be a piece of electrical equipment (Score:5, Interesting)

by Anonymous Coward writes: on Monday July 06, 2009 @09:30PM (#28602595)

Because out of all of the data centers in the world, there were problems at five? Riiiiight. Good reporting, Slashdot.
Can I sign up for broken water main notices here, too, or do I need to go to another website?
100+ million people daily are "serviced" by these 5 data centers.
Company's such as authorize.net where COMPLETELY unavailable for payments to hundred of thousands of webmasters sites (ya know the people who make money)
If you don't think this is serious news then you are still living at home.
Ya that's what I thought.

So the real question is... (Score:4, Interesting)

by Dirtside ( 91468 ) writes: on Monday July 06, 2009 @10:05PM (#28602833) Journal

...what is the normal (historical) rate of data center power failures, and how does the recent spate compare? Five in a week sounds severe, but what's the normal worldwide average? I can imagine that with thousands of data centers around the globe, there's likely a serious failure occurring somewhere in the world once every couple of days.

Power Fail Often (Score:3, Interesting)

by blantonl ( 784786 ) writes: on Monday July 06, 2009 @10:38PM (#28603111) Homepage

Frankly, if data centers are going to proclaim their redundancy, they should test by power failing the entire data center once every two weeks at a minimum. A data center that goes down twice in a month would get ahead of any issue pretty fast. Lessons learned from the staff and the management are very valuable.
The marketing messaging:
"We power fail our data center every two weeks to ensure our backups work..."
Sound scary? Just think about the data center that has never been through this process. at that point, the wet paper bag you tried to market your way out of dried rather quickly and you are now faced with the prospect of slapping around inside of a zip-lock.

Re:Be Redundant! (Score:3, Interesting)

by ls671 ( 1122017 ) * writes: on Monday July 06, 2009 @11:25PM (#28603507) Homepage

Best solution for big outfits is to have at least this setup:
1) One party being the main contractor. This party doesn't do ANY hosting per say but only manages the fail-over strategy, doing the relevant testing once in a while.
2) A second party being involved in hosting and managing data centers.
3) A third party, completely independent from party 2, a competitor of 2 is preferable, which also does hosting and manages data centers.
It is the same principle when you bring redundant internet connectivity to a building :
1) Have the fiber from one provider come into the building from, say, the north side of the building.
2) Have a competitor, unrelated business wise, that doesn't use the same upstream providers bring his fibers in from the South side of the building.
Putting all your eggs in the same basket by dealing with only one business entity constitute a less robust solution.

Sunspots, Anyone? (Score:3, Interesting)

by Craig Milo Rogers ( 6076 ) writes: on Tuesday July 07, 2009 @05:24AM (#28605331) Homepage

All these data centers failed at roughly the same time as the sunspots returned [space.com], but that's just a coincidence, right?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Data Center Power Failures Mount 100

Data Center Power Failures Mount More Login

Data Center Power Failures Mount

Outages (Score:3, Interesting)

Re:Outages (Score:5, Interesting)

Re:Be Redundant! (Score:5, Interesting)

Former critical power field engineer here... (Score:5, Interesting)

Qld Health datacentre disaster (Score:1, Interesting)

Re:"bad week to be a piece of electrical equipment (Score:5, Interesting)

So the real question is... (Score:4, Interesting)

Power Fail Often (Score:3, Interesting)

Re:Be Redundant! (Score:3, Interesting)

Sunspots, Anyone? (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot