Can Maintenance Make Data Centers Less Reliable? 185

Posted by samzenpus on Sunday November 27, 2011 @02:16PM from the if-it-isn't-broken dept.

miller60 writes "Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'The most common threat to reliability is excessive maintenance,' said Steve Fairfax of 'science risk' consultant MTechnology. 'We get the perception that lots of testing improves component reliability. It does not.' In some cases, poorly documented maintenance can lead to conflicts with automated systems, he warned. Other speakers at the recent 7x24 Exchange conference urged data center operators to focus on understanding their own facilities, and then evaluating which maintenance programs are essential, including offerings from equipment vendors."

Can Maintenance Make Data Centers Less Reliable?

This discussion has been archived. No new comments can be posted.

Search 185 Comments Log In/Create an Account

Comments Filter:

In between maybe? (Score:5, Insightful)

by anarcat ( 306985 ) writes: on Sunday November 27, 2011 @02:20PM (#38182956) Homepage

Maybe there's a sweet spot between "no testing at all" and "replacing everything every three months"? In my experience, there is a lot of work to do in most places to make sure that proper testing is done, or at least that emergency procedures are known and people are well trained in them. Very often documentation is lacking and the onsite support staff have no clue where that circuit breaker is. That is the most common scenario in my experience, not overzealous maintenance.

Can faulty logic make data centers less reliable? (Score:5, Insightful)

by DragonHawk ( 21256 ) writes: on Sunday November 27, 2011 @02:44PM (#38183110) Homepage Journal

From TFS:
"... poorly documented maintenance can lead to conflicts with automated systems ..."
That doesn't mean maintenance makes datacenters less reliable. It means cluelessness makes datacenters less reliable.
Sheesh.

Maintenance-induced failure. (Score:5, Insightful)

by Animats ( 122034 ) writes: on Sunday November 27, 2011 @02:46PM (#38183120) Homepage

There's something to be said for this. Back when Tandem was the gold standard of uptime (they ran 10 years between crashes, and had a plan to get to 50), they reported that about half of failures were maintenance-induced. That's also military experience.
The future of data centers may be "no user serviceable parts inside". The unit of replacement may be the shipping container. When 10% or so of units have failed, the entire container is replaced. Inktomi ran that way at one time.
You need the ability to cut power off of units remotely, very good inlet air filters to prevent dust buildup, and power supplies which meet all UL requirements for not catching fire when they fail. Once you have that, why should a homogeneous cluster ever need to be entered during its life?

Re:The key to achieving high uptime ... (Score:5, Insightful)

by Smallpond ( 221300 ) writes: on Sunday November 27, 2011 @03:03PM (#38183222) Homepage Journal

Which means for every online server you need an offline test machine -- and a way to simulate the operating environment in order to test. Not many companies have the skill of cash to do that.

Re:Can faulty logic make data centers less reliabl (Score:4, Insightful)

by HalAtWork ( 926717 ) writes: on Sunday November 27, 2011 @03:10PM (#38183260)

Exactly.
vigorous maintenance
excessive maintenance
poorly documented maintenance
Those are all qualified as out of the ordinary. Anything in excess (on either side of the scale, whether it is too much or not enough) is a problem. Of course maintenance must be performed, but I guess some data centers have a strange idea of best practices, or they do not follow them.

Re:Maintenance-induced failure. (Score:5, Insightful)

by DarthBart ( 640519 ) writes: on Sunday November 27, 2011 @03:13PM (#38183274)

There's also been a shift in the mentality of how well computers operate. It went from not tolerating any kind of downtime to the Windows mentality of crashing and "That's just how computers are".

Re:Can faulty logic make data centers less reliabl (Score:5, Insightful)

by FaxeTheCat ( 1394763 ) writes: on Sunday November 27, 2011 @03:27PM (#38183338)

Precisely my thought.

Maintenance, like anything else you do in a datacenter or wherever you work, must be done correctly. If maintenance reduces the reliability of the maintained entity, then per definition, it was not correctly performed.

Doing something correctly requires knowledge, planning and training. Just like everything else.

The quality of the people matters a lot (Score:4, Insightful)

by petes_PoV ( 912422 ) writes: on Sunday November 27, 2011 @04:08PM (#38183622)

Although everyone makes mistakes, some people make hundreds of times more errors than others. Whether that's due to inherent lack of ability, poor training, lacking oversight, laziness, time pressures or just a slapdash attitude varies with each person. One place I was involved with (as an external consultant) made over 12,000 changes to their production systems every year. It turned out that well over half of those were backing out earlier changes, correcting mistakes/bugs from earlier "fixes" or other activities (a lot that resulted in downtime, and far too much of it unscheduled or emergency downtime) that should not have happened and could have been prevented.

Re:soft vs hard reboot (Score:4, Insightful)

by Bigbutt ( 65939 ) writes: on Sunday November 27, 2011 @04:21PM (#38183714) Homepage Journal

You must not deal with any Oracle database servers. They leak like a sieve.
[John]

Re:This is well known from Formula One (Score:5, Insightful)

by scattol ( 577179 ) writes: on Sunday November 27, 2011 @04:23PM (#38183716)

Those cars, to be competitive, were engineered to fall apart on the other side of the finish line. Without maintenance they would have failed. They are now engineered to last a few races instead of just one. Odds are they are slightly slower in one form or the other but it being a level playing field, it doesn't matter.

Re:In between maybe? (Score:5, Insightful)

by CyprusBlue113 ( 1294000 ) writes: on Sunday November 27, 2011 @04:26PM (#38183742)

Do you know why satellites last so long in a hostile environment?... because nobody touches them.
"If it's not broken, don't fix it."
Actually I'm pretty sure it's the millions that are spent engineering each individual one so that it specifically can survive many years in said hostile enviroment.
If we spent anywhere near what is spent on proper engineering in time and money, everyday crap would be pretty damn reliable too, just not nearly as cost effective

Re:In between maybe? (Score:5, Insightful)

by mabhatter654 ( 561290 ) writes: on Sunday November 27, 2011 @06:21PM (#38184530)

if that's the case, you don't have CONTROL over your equipment.
That was acceptable for Windows 95 but not even for desktop PCs anymore, let alone server equipment. My opinion is that your equipment isn't stable UNTIL you can turn it off and on again reliably. And yes... that is an ENORMOUS amount of work.
If you can't reliably replace individual pieces then you don't have control for maintenance... sure you can stick your head in the sand and just not touch anything... but that's just piling up all the things you didn't take time to figure out until come critical time later.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Can Maintenance Make Data Centers Less Reliable? 185

Can Maintenance Make Data Centers Less Reliable? More Login

Can Maintenance Make Data Centers Less Reliable?

In between maybe? (Score:5, Insightful)

Can faulty logic make data centers less reliable? (Score:5, Insightful)

Maintenance-induced failure. (Score:5, Insightful)

Re:The key to achieving high uptime ... (Score:5, Insightful)

Re:Can faulty logic make data centers less reliabl (Score:4, Insightful)

Re:Maintenance-induced failure. (Score:5, Insightful)

Re:Can faulty logic make data centers less reliabl (Score:5, Insightful)

The quality of the people matters a lot (Score:4, Insightful)

Re:soft vs hard reboot (Score:4, Insightful)

Re:This is well known from Formula One (Score:5, Insightful)

Re:In between maybe? (Score:5, Insightful)

Re:In between maybe? (Score:5, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot