Multiple Sites Down In SF Power Outage 423
corewtfux writes with word of a major outage apparently centered on 365 Main, a datacenter on the edge of San Francisco's Financial District. Valleywag initially claimed that a drunken person had gotten in and damaged 40 racks, but an update from Technorati's Dave Sifry says the problem is a widespread power outage. Sites affected include Technorati, Netflix (these display nice "We're Dead" pages), Typepad, LiveJournal, Sun.com, and Craigslist (these just time out).
I work in the Financial District (Score:5, Interesting)
Comment removed (Score:5, Funny)
Oblig.... (Score:5, Funny)
trashin ur racks
Re:Oblig.... (Score:5, Funny)
>
> trashin ur racks
Lizzie Borden did teh h4x,
Got drunk and unplugged 40 racks.
When she saw what she had done,
She unplugged number 41.
(Lawn. Off. Git.)
Re:Oblig.... (Score:5, Funny)
Millions were paged, and cried out in despair (Score:3, Interesting)
http://tastic.brillig.org/~jwb/dorks.jpg [brillig.org]
Redundant? (Score:5, Insightful)
Re: (Score:2)
Re: (Score:3, Funny)
Perhaps they could begin their vengeful wrath by hiring a few (more?) winos...
They should run this on OLPCs (Score:2)
Re:Redundant? (Score:5, Informative)
Re:Redundant? (Score:5, Funny)
I think I see the flaw in your plan.
Re:Redundant? (Score:5, Interesting)
Sun.com going down is a good example of someone totally screwing up. They have absolutely NO excuse. The others - maybe they can get away with it and we won't care. If Sun can't keep their own site up, how can I expect them to keep mine up?
Re:SAN? Huh? (Score:4, Interesting)
HP has a nice overview of building systems which can failover between widely distributed nodes called Designing Disaster Tolerant High Availability Clusters [hp.com]. It's a bit old, and is focused on ServiceGuard [hp.com], but is still interesting.
Re: (Score:2)
Re:Redundant? (Score:4, Insightful)
The thing is, letting something happen may be a better decision than trying to stop it.
If you're going to have a fully-redundant setup, it's going to cost you twice as much as having just one setup. And if you're not going to have a fully-redundant setup, your backup site is going to buckle under the full load of normal traffic anyway.
The correct business decision might just be "I just saved a bunch of money on my data center insurance," and if you lose a day's business, oh well, that was still cheaper than keeping a backup data center around.
Re:Redundant? (Score:5, Informative)
For what it's worth, the datacenter which is adjacent to 365 Main, called 360 Spear, did not suffer from this outage.
Other sites.. (Score:3, Informative)
Re: (Score:3, Informative)
Anyway, PG&E says it's over now, but they still don't have an explanation as to why. Shyeah (rolls eyes)
Re: (Score:2)
Redundent power supply? (Score:3, Interesting)
Re: (Score:2)
Re:Redundent power supply? (Score:4, Informative)
Re: (Score:3, Informative)
*I live out in the middle of nowhere and I get a power failure exceeding 5 minutes about once per year. The longest I've had at my current location was just over 2 hours.
Re:Redundent power supply? (Score:5, Interesting)
They have the HiTec rotary UPSs in all their facilities, which link a generator to a flywheel UPS. It's stupid to not have backup fuel for that type of system; you can only run for 13 seconds before the load crashes.
It is possible that they got a number of small hits and the generators failed to re-start after a few. Good procedures are to stay on generator until utility stabilizes if you have more than one "hit."
Be interesting to find out what happened.
Re: (Score:3, Informative)
There were 5 individual power failures, each no longer than 5 minutes, over a roughly 30 minute period. A couple of them were in quick succession.
Re:Redundent power supply? (Score:4, Funny)
Re: (Score:2)
5 minutes a year is nearing "five nines" for reliability (and you don't want to rely on the power supply being your only source of downtime in that situation.) I'm not sure if their customers have "99.999% uptime guaranteed" in their contract, but if so, I'm sure they did have their tanks in working order. Some old press releases of theirs [365main.com] are touting 100% uptime.
I realize that a press release from 2004 is hardly relevant, but this is slashdot... so here is a choice paragraph:
By surpassing the five-nin
Re: (Score:3, Funny)
Well, how about one from today [365main.net] ?
Re: (Score:2)
also m
Re: (Score:2)
how many data centers? (Score:4, Interesting)
netflix.com is working (Score:2)
Re: (Score:3, Funny)
Protrade.com also down. (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
LiveJournal?? (Score:2, Funny)
Re:LiveJournal?? (Score:5, Funny)
Re: (Score:3, Funny)
The Scoop from SFGate.com (Score:3, Informative)
Re:The Scoop from SFGate.com (Score:5, Funny)
"Officials say the power outage may affect some websites, including the site that hosts Slashdot.org's preview button."
It all seems to be back up now.
- RG>
From Technocrati: (Score:5, Funny)
I think that's admin speak for:
I warned these idiots eight months ago during my review that the datacenter had outgrown its generator capacity. But did they listen? Fuck no, they just kept counting money and worrying about the bottom line. The beancounters looked at me like I'd asked them for a blowjob from their grandmothers when I submitted the workup for additional generator capacity. And now that the shit's hit the fan, whose ass are they screaming for? Screw this, I'm applying at Taco Bell.
Re: (Score:3, Insightful)
Whaddya bet some poor mid-level admin gets blamed and tossed for this? And the upper-management guy who ignored the recommendations for testing or redundancy still gets his bonus for good fiscal performance.
Re: (Score:3, Funny)
It was renamed to +1 Insightful to appease the people who hate curse words.
Re: (Score:2)
Thanks for the laughs, even if they led to a sad realization.
Re:From Technocrati: (Score:5, Funny)
Re:From Technocrati: (Score:5, Funny)
Tell my family I loved them.
- RG>
Netflix outage seems unrelated (Score:2)
Libel, anyone? (Score:2)
Someone came in shitfaced drunk, got angry, went berserk, and fucked up a lot of stuff. There's an outage on 40 or so racks at minimum.
Libel lawsuit in 3...2...
LOLcurrent (Score:3, Funny)
One Market going on and off (Score:2)
Just called a friend at One Market, the big office tower downtown at the end of Market Street, and she says the power has been going on and off there for hours. Building alarms were sounding, but nothing serious was happening other than power loss.
Kiss of Death? (Score:5, Funny)
UPS system - it's a Hytec flywheel/diesel combo (Score:4, Interesting)
Data sheet for 365 Main [365main.com]:
The company's San Francisco facility includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators, avoiding costly downtime for tenants and keeping the data center continuously running.
They use a Hytec Continuous Power System [pageprocessor.nl], which is a motor, generator, flywheel, clutch, and Diesel engine all on the same shaft. They don't use batteries.
With this type of equipment, if for some reason you lose power and the generator doesn't start before the flywheel runs down, you're dead. There's no way to start the thing without external power. Unless you buy the optional Black Start feature [pageprocessor.nl], which has an extra battery pack for starting the Diesel. "Usually the black start facility will not be often needed but it won't hurt to consider installing one. Just imagine if you were unable to start up your UPS system because the mains supply is not available.". Did 365 Main buy that option?
Re:UPS system - it's a Hytec flywheel/diesel combo (Score:5, Interesting)
And this was in addition to the 48VDC battery backup.
In the entire history of electromechanical switching in the Bell System, no central office was ever down for more than 30 minutes for any reason other than a natural disaster. That record has not been maintained in the computer era.
If you have to build reliable systems, it's worth understanding electromechanical telephone switching. Because the components weren't that reliable, the systems had to be engineered so that the system as a whole was far more reliable than the components. Read up on Number Five Crossbar. [wikipedia.org] The Wikipedia article isn't really enough to understand the architecture, but other references are available.
Re:UPS system - it's a Hytec flywheel/diesel combo (Score:4, Informative)
The battery systems that are installed in the "Bomb cages" as we called because the larger ones were often underground and appeared similar to a 3 person bomb-shelter where quite impressive. Typically, they were two full banks of twenty four, 2 Volt, 375 AMP batteries. Each of them physically twice the size of a truck battery. They were most often lead-acid mammoths at the time since lead acid was reliable for a measurable period of time and inexpensive in comparison to the lithium-ion variety in the same capacity.
The batteries were always rated at 10 years life from the manufacturer, but the telephone companies had tested in real-world environments and would rotate the cells out at 4 year intervals instead since down-time on the network to replace power systems was far more expensive then being prepared instead. After all, each one of these cabinets would typically handle as many as 15,000 telephone lines and would often contain fibre repeaters for higher speed lines connecting the boxes all together and then to the central.
The biggest problem with these installments was that a single battery in a shipment would show signs of early fatigue, most typically visible from the appearance of bubbling in the plastic walls, then it was policy to replace the entire batch of cells immediately, not just the single battery displaying fatigue. This was because it was clear that if a single battery in the group showed fatigue then all the cells in the bank would probably be susceptable to the same issue. It could be something as simple as a manufacturing screw up or it could be due to a cooling system problem in the box, or any of a lot of other environmentally related issues.
It's really quite impressive the cost and efforts the telephone company would go through just to maintain and prevent issues with the UPS system which thankfully, rarely ever gets exercised in places where people are intelligent enough not to live on fault lines or high risk hurricane paths.
The greatest flaw in the design of the batteries systems was that they were always trickle-charged. The chargers were unintelligent and simply kept the batteries topped off. This caused "memory issues" as we're all familiar with, especially thanks to notebook batteries.
What we learned about the cells where I was engineering was that, if a cell could physically survive as long as 7 years without environmentally related damage (bubbles), then it should be possible to detect early stages of design related fatigue within a single cell.
We also found that if a weekly or monthly power cycle of a bank of cells were to be performed, the batteries would last substantially longer than the 4 years expectancy. So, in the case of Bomb Cages where at least two full banks of cells were available (that's pretty much a minimum configuration), on a proper schedule, using a huge-ass resistor bank, we would fully drain a bank of cells until we could detect nearly 0 current across the resistor. Then we would perform a full charge on the cells again, monitoring each cell more than 10 times per second. Batteries that failed to charge in sync with the other cells were typically early replacement candidates.
Well, all that being said, one thing I'm 100% confident of is that data centers lack the experience and the interest to budget this kind of research for their systems. The telephone companies are amazingly well prepared in comparison.
On a side note, just last week, I installed my first 48V DC powered RAID rack. I designed a high efficiency hard drive case that contained no fans. Each case was 1U and shallow enough to install two back-to-back in a rack. We installed 96 units in a single rack with 4 drives each and no-air conditioning in the room. The design was extremely simple.
1) Use Telco
No wonder technorati wasn't working for me... (Score:2)
So I edited my hosts.conf so technorati points at my localhost.
Can't say that's degraded my blog-reading experience in the least.
Valleywag's Guess (Score:2, Funny)
July 24th: RedEnvelope Press Release by 365 Main (Score:4, Interesting)
It was released today....
About Emergency Power (Score:5, Informative)
As to diesel storage, use of diesel is widespread for emergency use everywhere from hospitals to emergency-services to hospitals. Those systems are run regularly - typically weekly. The use of biocides, stabilizers, and mobile fuel-scrubbing services, and extra filtration systems can maintain the fuel quality. Our colo currently maintains a 1-week fuel-supply and has multiple quick-refuel contracts in place. I can't imagine any colo having less than 24-48 hours in-the-tank with quick-refill on-call.
But one thing that is missing is cooling. Our colo has a typical contract that says something like blah-blah won't exceed 80F for more than 4 hours blah blah. OK, but a rack full of blade servers can crank out 15-20kW of heat load and a data center can heat up real quick without AC. By contract, 150F for 3.5 hours would be in-spec.
"We're Dead" (Score:2, Funny)
Reply from 199.185.137.3: bytes=32 time=239ms TTL=236
Pinging freebsd.org [69.147.83.40] with 32 bytes of data:
Reply from 69.147.83.40: bytes=32 time=191ms TTL=47
Pinging netbsd.org [204.152.190.12] with 32 bytes of data:
Reply from 204.152.190.12: bytes=32 time=213ms TTL=241
Lost irony.
Google street view out? (Score:2)
Doesn't seem to be showing airborne/satellite images either.
I need to change my reading order (Score:2)
So then I checked my Netflix queue, and couldn't get to it (got a 404 error there, though, not a "nice \"we're dead\" message" - two sites in a row indicate the problem might be local.
Good thing slashdot was my next stop, not one of the many others. I had no idea all those sites were run out of the same location in SF.
San Francisco has always seemed to m
Sun (Score:2)
Not that uncommon (Score:3, Interesting)
A client of mine had a number of servers in a Sterling, Virginia data center managed by Verio/NTT. It's a good data center and seems to be well-run.
Last September, the data center experienced two complete power failures in the span of three days. To their immense credit, data center management was straight with customers about what had happened. For those who might be interested, their statements about the problem appear here. [dedicatedserver.com]
My point? Make sure you know how to bring your systems back up from a completely cold start, and that you find a way to test this periodically. While we work to ensure that this sort of situation occurs rarely, the fact remains that these sorts of failures DO occur, and they're not as uncommon as the sales and marketing folks would like you to believe.
Phil
365 Main deletes press release about uptime (Score:3, Informative)
The press release "RedEnvelope Reports Two Years of Continuous Uptime at 365 Main's San Francisco Data Center", which was on the 365 Main web site earlier today, has disappeared from there. [365main.com]
But they sent the press release to PR Newswire, [prnewswire.com] and you can still read it there.
The word directly from 365 (Score:4, Informative)
At 1:49 p.m. on Tuesday, July 24, 365 Main's San Francisco data center was effected by a power surge caused when a PG&E transformer failed in a manhole under 560 Mission St.
An initial investigation has revealed that certain 365 Main back-up generators did not start when the initial power surge hit the building. On-site facility engineers responded and manually started effected generators allowing stable power to be restored at approximately 2:34 p.m. across the entire facility.
As a result of the incident, continuous power was interrupted for up to 45 mins for certain customers. We're certain colo rooms 1, 3 and 4 were directly affected, though other colocation rooms are still being investigated. We are currently working with Hitec, Valley Power Systems, Cupertino Electric and PG&E to further investigate the incident and determine the root cause.
All generators will continue to operate on diesel until the root cause of the event has been identified and corrected. Generators are currently fueled with over 4 days of fuel and additional fuel has already been ordered.
We understand the seriousness of this issue and will provide full details once they come available. We sincerely apologize for the impact this has had on your operations.
Regards,
Vice President, Security
365 Main
"The World's Finest Data Centers"
Just send me a big fat check and all is forgiven.
Re: (Score:3, Funny)
Wow, on-site engineers took 45 minutes just to be able to turn on generators? The generator for our facility has a master switch and a big green button. I think a monkey could get it running in 20 seconds by slinging poo at it. So, what other problems did they have that they aren't telling us? Someone else mentioned a flywheel system.
Re: (Score:2, Interesting)
Re:No Generators? (Score:5, Insightful)
If the "power outage" theory is correct and the "drunken employee" theory is incorrect, as a customer I'd be pissed that the data center I pay tons of money to can't keep my site up in the event of a power outage, which is one of the main perks of hosting at a data center in the first place.
Re:No Generators? (Score:5, Insightful)
For me it would be other way around. A technology failure I could understand. Letting a drunk employee near my server rack, I could not.
Re: (Score:3, Insightful)
Re: (Score:3, Funny)
*swipe*
*bip* *beep* *beep* *boop* *bleep*
[deep breath]
*whoosh*
Alcohol Level: 0.15
*beeeeeeeep*
Damnit!!
Re: (Score:3, Interesting)
Re: (Score:3, Informative)
I posted an ad the day BEFORE the outage and it never showed up on the site, nor in search.
On their status page (before the outage), they acknowledged they had problems and were promising to fix them sometime "before fall". Really competent...not.
If you have problems with your ad being pulled at random by idiots flagging it for lame excuses like all caps headlines (the rules say AVOID all caps, not "we will pull your ad for it"), the only recourse you have is to get sent to the help forum
Re: (Score:2)
Re:No Generators? (Score:4, Insightful)
I would think these large sites would understand the concept of not putting all your eggs (servers) in one basket. There is a reason why smart companies use replication and clustering, and datacenters spread across the country.
Re:No Generators? (Score:5, Funny)
Re: (Score:3, Informative)
Re:No Generators? (Score:4, Funny)
Sheepshagger Intel (Score:3, Funny)
I don't follow your math. Did you do it with an Intel chip, by any chance?
Poor Intel (boo hoo!), they messed up 13 years ago and people are still making jokes about it. Reminds me of the old joke (stolen from here [bofh.org.uk]):
A man goes into a pub in a small town and, for whatever reason, gets introduced to the clientele. There's Farmer Jack, Barman Jim, Maurice "Dancer" and Sheepshagger John. After a few pints, the visitor's curiosity gets the better of him and he asks John what's with the nickname.
"See this pub?" asks John, "I built it, but they don't call me Pubbuilder John? I'm the local doctor, I saved Barman Jim's life once when he choked on a peanut, but they don't call me Lifesaver John. Every year, I supply a huge Christmas tree for the village green, but the don't call me Christmas Tree John.
"But you shag one lousy sheep..."
(Note; since that Austin Powers film came out, I assume that you Yanks know what "shagging" is now).
Re: (Score:3, Informative)
Re: (Score:3, Informative)
Many datacenters didn't expect the growth they experianced. As a result, many UPS and generator sets are undersize or the entire load is not onboard. In some cases, the critical serviers are up to post the we are down page, but the HVAC system and main floor are down. What good is having a datacenter up if the building AC is down? Sometimes you are forced to shut down simply because the support AC is down and
Re:No Generators? (Score:5, Informative)
Brownouts sometimes fail to trigger generators, even though they should. If only one phase goes down, depending on the design, it may not trip (and would cause a somewhat random outage, like some drunk shutting down racks).
If the generator runs on diesel, they usually only plan for a few hours of backup. If they didn't recalculate the generator runtime as they added equipment, the load may have caused the fuel consumption to go up higher than anticipated. Is it hot in SF today? Air handlers may be straining to keep the place cool, or maybe the generator got running too hot.
Often times, as equipment is added, the load gets out of balance between phases. It is usually a good idea to keep the load as even as possible, but in a high traffic data center, I would imagine there would be a lot of stuff moving in and out, expanding and contracting, and it may become hard to keep track of the loads across phases. A good facilities manager should be able to tell you the current load off the top of his head, but too often these details get left out.
This is just stuff I've seen in cable TV headends over the years. Granted, this facility should have a power manager/engineer on staff, but so often the power is one of the first things to get cut from the budget.
Re:No Generators? (Score:5, Interesting)
No kidding. years ago in my former job on traffic systems we had a great UPS with a generator on site and the ability keep it fueled up indefinitely. A security contractor came in on the weekend to install something and tried to wire up a new circuit hot. He slipped with a screwdriver and shorted the white phase to the chasis of the breaker panel. I don't think the tip of the driver actually touched ground, but the burn mark is still there to show how close he got.
The resuting current spike blew the 100A fuses (heavy metal strips) both going in to and out of the UPS. With the UPS effectively broken the generator set failed to start and the system gracefully shut down 40 minutes after the incident. Thats not bad. The batteries were only specified to work long enough for the genny to settle at 50Hz.
In the process of blowing the fuses a spike got back into the power supply of one of our DEC Alphas and took out the power supply. The system was redundant at the software level so I didn't notice immediately.
The UPS guy came out and didn't have enough fuses to replace the blown one, but we found that with a bit of brute force and filing attacks some others could be made to fit.
Please type the word in this image: problems
zombies .... (Score:5, Funny)
Re: (Score:2)
Re:No Generators? (Score:5, Informative)
Re: (Score:2)
Re: (Score:3, Informative)
Funny enough, there was a press release put out today talking about how the 365 Main facility had given 100% uptime over the past 2 years. Yes, 100% uptime for a facility is very possible. All it needs is to stay online and providing power and cooling.
According to their own press release... (Score:4, Funny)
The irony of issuing a press release like that, and then to be hit with a power outage and apparent simultaneous failure of all backup systems later that day, is beyond measure.
I don't know about God, but it's enough to make me believe in karma.
Insane level of backup... (Score:5, Interesting)
The only places I've actually seen the insane levels of backup that some would like is in some telco central offices. The one I was associated with the longest had eight-hour-plus battery backup and 8 days of fuel for the diesels. Some of our really remote microwave sites had 24 hour battery and 30 day diesel.
Of course one of those sites failed high up in a mountain range in a mid-winter storm (Tieton, 1978) when the commercial power failed, and the starter battery for the diesel froze. When one of the techs finally got there (after burying his Sno-Cat and walking the last couple miles), he had to chip ice off the steel door to get inside, where he was able to get the diesel started with a little "rewire" of one of the backup battery sets. Oh, his two-way radio also failed during his hike, since it was outside his snowsuit, and the lack of communication caused the company to start two more Sno-Cats and a helicopter in that direction.
The site was out for nearly six hours, IIRC.
Even the BEST designs are subject to failure.
--
Tomas
Re:Insane level of backup... (Score:5, Funny)
On Black Bute in Oregon, a communications site went out in the middle of winter following a power outage. The generator ran a short while and shut down because it overheated. The air intake 20 feet in the air was covered in snow.
Re: (Score:3, Interesting)
--
Tomas
Re: (Score:3, Insightful)
Re:GameFAQs (Score:5, Informative)
Re: (Score:3, Funny)
Re: (Score:3)
Re: (Score:2)