Extreme California Heat Knocks Key Twitter Data Center Offline (cnn.com) 62
Extreme heat in California has left Twitter without one of its key data centers, and a company executive warned in an internal memo obtained by CNN that another outage elsewhere could result in the service going dark for some of its users. CNN reports: "On September 5th, Twitter experienced the loss of its Sacramento (SMF) datacenter region due to extreme weather. The unprecedented event resulted in the total shutdown of physical equipment in SMF," Carrie Fernandez, the company's vice president of engineering, said in an internal message to Twitter engineers on Friday. Major tech companies usually have multiple data centers, in part to ensure their service can stay online if one center fails; this is known as redundancy.
As a result of the outage in Sacramento, Twitter is in a "non-redundant state," according to Fernandez's Friday memo. She explained that Twitter's data centers in Atlanta and Portland are still operational but warned, "If we lose one of those remaining datacenters, we may not be able to serve traffic to all Twitter's users." The memo goes on to prohibit non-critical updates to Twitter's product until the company can fully restore its Sacramento data center services. "All production changes, including deployments and releases to mobile platforms, are blocked with the exception of those changes required to address service continuity or other urgent operational needs," Fernandez wrote. In a statement about the Sacramento outage, a Twitter spokesperson told CNN, "There have been no disruptions impacting the ability for people to access and use Twitter at this time. Our teams remain equipped with the tools and resources they need to ship updates and will continue working to provide a seamless Twitter experience."
As a result of the outage in Sacramento, Twitter is in a "non-redundant state," according to Fernandez's Friday memo. She explained that Twitter's data centers in Atlanta and Portland are still operational but warned, "If we lose one of those remaining datacenters, we may not be able to serve traffic to all Twitter's users." The memo goes on to prohibit non-critical updates to Twitter's product until the company can fully restore its Sacramento data center services. "All production changes, including deployments and releases to mobile platforms, are blocked with the exception of those changes required to address service continuity or other urgent operational needs," Fernandez wrote. In a statement about the Sacramento outage, a Twitter spokesperson told CNN, "There have been no disruptions impacting the ability for people to access and use Twitter at this time. Our teams remain equipped with the tools and resources they need to ship updates and will continue working to provide a seamless Twitter experience."
Instead... (Score:5, Interesting)
Re: (Score:2)
Facebook has real redundancy. Twitter does not.
Actually Twitter is very redundant. Same crap every single day when I have looked at it.
Re: (Score:2)
Maybe be like old television broadcasts, and just shut down periodically. It's not like those services are necessary for human life. "OMG I can't get to Twitter!" said no one unironically.
(What's the opposite of iron? It would be what you use for the opposite of irony. Cheese, he suggests, cheesely.)
Re: (Score:3)
Some people really need Twitter at 2am. But then some people are idiots.
Re: (Score:2)
Re: (Score:2)
Pretty sure Facebook's West Coast region is in Oregon, not California.
Re: (Score:2)
I like how you deadpan responded to an anti-creimer troll and got modded up to +5 Interesting doing so. Sometimes it's fun putting trolls on the radar by responding to them with high karma, but you really took the cake with this one!
Re: (Score:2)
Maybe California could take unessential datacenters (twitter, facebook, etc..) offline before asking the citizens to endure power restrictions.
They do. One of the stages of response is having datacenters to switch over to their backup generators. They don't shut things down, but they do go off-grid for a few hours.
Twatter (Score:5, Funny)
As a result of the outage in Sacramento, Twitter is in a "non-redundant state"
Oh no, no, no, no, no. It's redundant, baby. It's redundant.
Climate Change (Score:5, Funny)
Re: (Score:2, Redundant)
Same! Global warming is good for something after all!
Re: (Score:3)
Well gee looks like I have redundancy after all! Unlike Twitter. Go me?!?!
Re:Climate Change (Score:4)
Yup..if the other data centers go out and twitter goes fully down...
The world will be a much nicer place for awhile.
Re: (Score:2)
Would be better to have a global fan,
What margin did they build in to their cooling ? (Score:2, Insightful)
Was the temperature SO extreme that they could not possibly CONCEIVE of such a high temperature when they designed their cooling system ?
And of course, now everyone else is forewarned by their mistake, so the next time it happens it is DEFINITELY incompetence or corner-cutting.
Re:What margin did they build in to their cooling (Score:5, Insightful)
Was the temperature SO extreme that they could not possibly CONCEIVE of such a high temperature when they designed their cooling system ?
And of course, now everyone else is forewarned by their mistake, so the next time it happens it is DEFINITELY incompetence or corner-cutting.
I was thinking the same thing but apparently nobody could have foreseen cooling problems when building a data centre in an area prone to heatwaves anymore than anybody could have foreseen a meltdown when building a nuclear plant in an earthquake & tsunami zone. Sounds like a whole bunch of engines and business people need some serious lessons in obvious problem prediction.
Re: (Score:3, Funny)
Re: (Score:2)
Sounds like a whole bunch of engines and business people need some serious lessons in obvious problem prediction.
The entire business model was built around nothing but hot air from the start...
Re:What margin did they build in to their cooling (Score:5, Insightful)
Sounds like a whole bunch of engines and business people need some serious lessons in obvious problem prediction.
Typically in these cases, the engineers did a good job but made a key mistake: They presented "options" to management with different probabilities of things failing. And management predictably took the cheapest one that still seemed somewhat reasonably to them, but, as usually, management did not understand what was going on.
I learned from my mother how to do this right: Present 3 options to the customer. Make one so obviously bad the customer will see that themselves. Do one you want them do chose. And then add one that is more expensive with non-necessary frills and some gold-plating. The customer will usually chose the one you wanted and sometimes the gold-plated one. Only the most stupid customers will select the bad one and usually you can still convince them otherwise.
Re: (Score:2)
Sounds like a whole bunch of engines and business people need some serious lessons in obvious problem prediction.
Typically in these cases, the engineers did a good job but made a key mistake: They presented "options" to management with different probabilities of things failing. And management predictably took the cheapest one that still seemed somewhat reasonably to them, but, as usually, management did not understand what was going on.
I learned from my mother how to do this right: Present 3 options to the customer. Make one so obviously bad the customer will see that themselves. Do one you want them do chose. And then add one that is more expensive with non-necessary frills and some gold-plating. The customer will usually chose the one you wanted and sometimes the gold-plated one. Only the most stupid customers will select the bad one and usually you can still convince them otherwise.
And what happens when they pick option #1 because they are greedy morons? It's easy to lay the blame at the feet of 'management', those people usually do their homework before investing huge quantities of money in a project but sometimes you run into a bunch of greedy salivating morons. The fact that any of this was allowed to happen represents an epic chain of failures all the way from engineering, through risk assessment, business planning and up to CEO level. If management wants you do do something that
Re: (Score:2)
If these people are building datacentres for twitter instead of key infrastructure like nuclear plants, we're making progress. Maybe we can get them jobs building datacentres for facebook and tiktok too!
Re: (Score:2)
If these people are building datacentres for twitter instead of key infrastructure like nuclear plants, we're making progress. Maybe we can get them jobs building datacentres for facebook and tiktok too!
Given the history of cost overruns, engineering failures, meltdowns and other disasters in the history of nuclear power it seems a fair few of them are indeed building nuclear power plants.
Re:What margin did they build in to their cooling (Score:5, Insightful)
Was the temperature SO extreme that they could not possibly CONCEIVE of such a high temperature when they designed their cooling system ?
That's not how you engineer things. You make an assessment on how likely extreme events are and do a cost benefit analysis. And "we already have redundant data centres, so losing one temporarily isn't a big deal" comes into that equation.
Re: (Score:2)
Was the temperature SO extreme that they could not possibly CONCEIVE of such a high temperature when they designed their cooling system ?
That's not how you engineer things. You make an assessment on how likely extreme events are and do a cost benefit analysis. And "we already have redundant data centres, so losing one temporarily isn't a big deal" comes into that equation.
And if the company is less than honest, the cost a rare but likely event that can take the entire system off line is balanced against extra profits from not having to engineer for it. Thus the company is more profitable, and the consumers may be without a necessary service (not talking twitter here, actual necessary services) for a period of time long enough to be a problem, but the company is more profitable.
Re: (Score:2)
Moreover, when you design a data center for high efficiency you lose a few buffers. Things like air cooled chillers are often used for backup/redundant/low-water operations, but they have a hard upper limit on outside air temperature where they can still provide meaningful cooling.
When you design a facility with a cold aisle temperature of 82-85F you don't have much time if it creeps up to 90F.
Re: (Score:3)
could not possibly CONCEIVE of such a high temperature
Planning for every eventuality is not possible, which is why one tries to develop a fault flexible architecture. One product I worked on was designed to have survived nine out of eleven simultaneous data center outages with zero downtime. They simulated three out of six, then called that "good enough". I would have tested it to the full 9's but I don't have to pay the bills either.
Almost every answer to "Why don't they..." is the same.
Cost, effort, tim
Re: (Score:2)
Yes. Yes it was. If your data center is designed to handle X days of 40C+ and you get X+30 days of 40C+ at one point you're not going to be able to cool properly anymore.
Unless you want everything to be a lot more expensive to be able to handle what were, at the time of construction, extremely low chance events.
Re:What margin did they build in to their cooling (Score:4)
You can conceive anything. What you can't do is ever justify the cost of buildings something to withstand everything you can conceive. When someone says something was designed to withstand up to a 1/100 year event, what they mean is there's a 1% chance it may fail in any given year.
Re: (Score:2)
I was bemoaning a product I helped worked on. The product manager said to not feel bad, it has a 99% success rate. I responded that we sold a millions of them.
Re: (Score:2)
And? Absolute numbers themselves are not sufficient information. What did the failure of your product do, that is the relevant part. Did it mildly inconvenience people like a McDonalds toy which broke causing a kid to cry? Did it leave someone without internet for a week during an RMA process? Did it kill tens of thousands of your customers?
99% success rate (or in some cases even far lower) is perfectly acceptable for a low consequence product. If on the other hand we were talking about the brakes of a car
Re: (Score:2)
A high consequence product, sold to industries not consumers. Failure means a high expense to the customer, possibly other consequences.
Re: (Score:1)
True, you can indeed conceive anything.
But I find it hard to believe that - knowing that global warming is occurring - they did not conceive that they would encounter the temperatures that they did just encounter. That those temperatures were soooo far outside the norm.
Re: (Score:2)
Taking that to absurd conclusions: should they have also built the place with a 3 meter thick reinforced concrete dome over it too, because meteor strikes can't be ruled out and can be conceived of as well, and could take the whole facility offline too. Or, maybe they should have built it with protection from electromagnetic pulses - you never know what those crazy Russians are going to do...
At some point the probability of an event gets low enough that it's not worth the extra cost to harden against it, e
Re: (Score:1)
"At some point the probability of an event gets low enough [...]"
That is the point. Was the probability of encountering the temperature that they have just encountered *really* so low that they could assume they would never have to deal with it, knowing that global warming is occurring ?
I find that hard to believe.
Re: (Score:2)
Certainly never in Sacramento, a city founded when they discovered molten gold flowed there in the summer.
Re: (Score:2)
They probably would have had to utilize something like geocooling to keep the systems running. That's not always feasible.
And yet (Score:5, Funny)
Re: And yet (Score:2)
Redundant vs Replication and "going dark" (Score:3, Interesting)
Usually on a weekend, slashdot editors take a wile joyride with word choice. This time it extends past the weekend.
Redundancy is not replication. Redundancy means some effort at play to mitigate loss. Replication means
duplication with the goal of maintaining sustained availability in the face of outages. The Sacramento
datacenter is an example of the latter.
Finally, the "going dark" thing. This is a Fear Uncertainty Doubt (FUD) thing that is used by everyone
from LEOs and encryption to ermagod FB having to turn down a datacenter. It's just become absurd.
We don't care about "going dark" if all that means is less twitter, less FB, and less BeauHD. We care
when the underlying data, article, UGC, or site is of some value. None of these are.
Replication is the key to avoid these issues, and it's soundly in place.
You can go back to sleep now, knowing no matter what happens in California, the swill you crave
will be online just fine.
Re: (Score:1)
We don't care about "going dark" if all that means is less twitter, less FB, and less BeauHD...
When you sustain a product that answers to shareholders, outages matter.
Stop making stupid assumptions.
Re: (Score:1)
> When you sustain a product that answers to shareholders, outages matter.
Products don't answer to shareholders. Period.
Product outages don't answer to shareholders. Period.
Management's abilities to have duplicate datacenters such that RANDOM NONPAYING PEOPLE (note: not shareholders) can access RANDOM DATA uploaded by other RANDOM PEOPLE (UGC and again nothing to do with shareholders) has nothing to do with anything.
In a public company management's role and goal is to increase shareholder value. In so
Re: (Score:2)
good, i dont see a problem (Score:1)
Nothing of value was lost (Score:1)
This is mother earth... (Score:1)
...telling us that Twitter is toxic.
Pray for more extreme events at twitter data centers. We still might be able to save humanity before twitter becomes skynet.
Say what you want about republicans (Score:2)
But they know how to play a long game. Spending years to fight against anti-climate change initiatives and support ever increasing pollution all to knock offline a datacentre who blocked their favourite president.
Why California? (Score:2)
Redundancy works (Score:2)
Oh, the horror! (Score:2)
Hate and stupidity halted in their tracks for who knows how long.
yea... (Score:1)
And there was much rejoicing!
More technical details please? (Score:2)
Was this datacentre shut off because it couldn't get sufficient power for cooling, or was it shut off because it couldn't get any power at all?
Re: (Score:2)
It was shut off because the existing cooling system was insufficient to keep the equipment cool enough.
Re: (Score:2)
Kind of makes you wonder if they could just let the CPUs run in lower p-states? Or would they be so slow operating thusly to be worth keeping online?
Idiots (Score:2)
You would think an SF company would realize that Sacto is like the hottest place in the summer, while SF is the coolest. But, SF electric power is provided by expensive capitalist PG&E, while Sacto has socialist electric power, provided by cheap SMUD.