Can Maintenance Make Data Centers Less Reliable? 185
miller60 writes "Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'The most common threat to reliability is excessive maintenance,' said Steve Fairfax of 'science risk' consultant MTechnology. 'We get the perception that lots of testing improves component reliability. It does not.' In some cases, poorly documented maintenance can lead to conflicts with automated systems, he warned. Other speakers at the recent 7x24 Exchange conference urged data center operators to focus on understanding their own facilities, and then evaluating which maintenance programs are essential, including offerings from equipment vendors."
Maintenance and prevention are not always the same (Score:4, Interesting)
Reliability Centered Maintenance (Score:5, Interesting)
===
"Is preventive maintenance on data center equipment not really that preventive after all? With human error cited as a leading cause of downtime, a vigorous maintenance schedule can actually make a data center less reliable, according to some industry experts.'
===
It isn't just human error: the very act of performing intrusive tasks under the theory of "preventative maintenance" can greatly reduce reliability of systems built of reasonably reliable components. This was studied extensively by the US airlines, US FAA, and later the USAF in the 1970s when the concept of reliability centered maintenance was developed for turbine engines and eventually full airliners. Look up the classic report by Nowland & Heap. Very much counter-intuitive if one has been trained to believe in the classics of "preventative teardowns" and fully known failure probability distribution functions, but matches up well to what experience field mechanics have been saying since the days of the pyramid construction.
sPh
Of course, today there is a huge "RCM" consulting industry, 7-step programs, etc that bears little resemblance to the original research and theories; don't confuse that with the core work.
More MBA Constultant BS... (Score:3, Interesting)
Seriously...I sometimes think the average IQ is dropping on a daily basis (and, yes, I get the irony)...Both with what I read, and my own experiences working in IT, I become more and more convinced that society will eventually collapse under the weight of bad advice from consultants (and, no, I don't own a fallout shelter)...and I spend more and more time thinking about ways that I can profit off of the stupidity of leadership.
Re:In between maybe? (Score:5, Interesting)
I suppose that I'd agree. Back in the early 90s, I inherited from a friend a fear of rebooting, turning off, or performing maintenance on a computer. Half the time he opened the case, the computer would become unbootable or never turn back on. Luckily, as a talented engineer, he could usually fix whatever the problem was, but it was a huge pain in the ass. Of course, back then, commodity computer hardware was hugely unreliable, with vast gaps in quality between price ranges, and we were working with pretty cheap stuff. Still, to this day, I dread the thought of turning off a computer that has been working reliably. You never know when some piece of crap component is nearing the end of its life, and the stress of a power cycle could what pushes it over the edge into oblivion (or highly unreliably behavior). I used to be fond of constantly messing with everything, fixing it until it broke, but his influence moderated that impulse in me, to the point where I usually freak out when anyone suggests unnecessarily rebooting a computer. Surely, there's something to say for preventative maintenance, and I'd rather be caught with an unbootable PC during regularly scheduled maintenance than suddenly experiencing catastrophic failure randomly, but there's something to be said for just leaving the shit alone and not messing with it. Every time you touch that computer, there's a slight chance that you'll accidentally delete a critical file directory, pull out a cable, or knock loose a power connector. The fewer the times you come into contact with the thing, the better. If I could build a force field around every PC, I probably would.
Transfer switch ratings (Score:5, Interesting)
Check your transfer switch ratings. I guarantee it will be spec'd much lower than you think. The electricians think it'll only be switched a couple times in its life. The diesel service provider thinks you're running it twice a week. Whoops. If you run it once a week, it'll only survive a couple years, then you'll get a facility wide multi-hour outage. I've personally seen it over and over again over the past two decades. The best part is "we have a procedure" so it'll only be run during maint hours and the desk jockeys 200 miles away will run it rain or shine, so its guaranteed that the xfer switch destroys itself at 2 am during a blizzard and it'll take half a day to repair.
Very few xfer switches are more reliable than commercial utility power. Installing a UPS actually lowers reliability in almost all professional situations.
My favorite power outage was caused by a gas leak a couple blocks away, where the utility co shut down the AC and then threatened to take an axe to the gen/UPS if not also shut off. This was not in the official written report, just word of mouth.
Re:Maintenance and prevention are not always the s (Score:4, Interesting)
Planned obsolescence has been promoted in all aspects of life since post WW2 and now it is hard to imagine the world without it. That line of thinking has been creeping into everything even in areas where it doesn't seem to apply.
Does this play a factor on the perception of preventative maintenance or its frequent application? I think it probably does in at least a couple ways, don't you?
Useless article with no data. (Score:5, Interesting)
I read through the entire article, and saw zero data to support his assertion. I'm sure he has the data, but the article didn't reference a single piece of it. Without any data to support the theory all we have is a fluff opinion piece. Shame on Data Center Knowledge for writing an article about a scientific investigation, and not presenting a single piece of scientific evidence.
This is well known from Formula One (Score:5, Interesting)
The purpose was partly to stop qualifying being its own arms race, with cars in completely different specification than for the race, and partly to reduce costs and the number of travelling staff. At the same time, "T Cars" --- a third car, available as a spare --- were banned, so that if a driver destroys a car in practice the team either have to rebuild it or not race. They're allowed to travel with a spare monocoque, but it cannot be built-up and it does not get pit space.
There were endless howlings from the teams, claiming that without a complete strip-down after qualifying, with a large crew working overnight to check everything on the car, reliability would go through the floor and races would finish with only a handful of stragglers fighting a durability battle (our US viewers may find this ironic in light of a certain US Grand Prix, of course).
The same argument was advanced, mutatis mutandis, over limitations on engines and gearboxes, limitations on the number of gear clusters available, limitations on certain forms of telemetry and a wide variety of "the cars can't just be left to run themselves, you know" interventions.
In fact, reliability is now far greater than ten years ago. It's not uncommon for there to be no mechanical retirements, certainly not from the longer-standing teams, and the days of engines imploding on the track are long gone. A front-running driver will probably only have one, if even that, mechanical DNF per season. The teams deliver a functioning car when the pit lane opens at 1pm Saturday, and that car then runs twenty or thirty laps in qualifying and sixty or seventy in the race, a total of perhaps 250 miles, without much maintenance work beyond tyres, fluids and batteries (section 34.1 on page 18 of the sporting regulations [fia.com]).
So again, we see that "preventative maintenance" turns out to really be "provocative maintenance", and leaving working machines alone is the best medicine for them.
Re:In between maybe? (Score:5, Interesting)
"If it's not broken, don't fix it."
Re:In between maybe? (Score:2, Interesting)
If your buying new or refurbished electronics are THAT unreliable, why the !%!@#$!@%! are you using them?
If a router fails to come up because a cap is ready to blow, what happens when it blows WHILE IT'S RUNNING?
I had that happen with 2 Cisco ASA firewalls. One was 5 years old, the other was a few months. They were using HSRP and decided fighting amongst each-other for control was a great idea because one of the ports was going out. We took the old one offline; wouldn't turn on anymore. The new one? Worked fine.
Over a long enough time-line the failure rate for equipment is 100%. Equipment is usually rated with a MTBF; there's LOTS of documentation on when you replace. You replace Laptops Every 2 years, Desktops and Servers every 3, Networking equipment every 4, appliances per the manufacturers specs, and the lan copper & fiber either when you're doing a major rebuild or when the kit is being replaced.
If management is too incompetent to tell what the TCO for a mission critical project is and budget the cash for replacements, why are you working for them?
Rebooting servers is something that needs to happen, depending on the OS, monthly, quarterly and for high-end enterprise systems, biannually. What happens if you don't reboot and purge errors on a schedule? E.G. For a Windows Fileserver; you reboot monthly, run chkdsk, export settings via config files (or run it in a VM) at the BARE minimum and run backups. When you build a database you need to build a routine to purge bad data every once in awhile. For a web server, a nightly reboot is commonplace.
I worked at a warehouse a few years back; 500k+ sq feet, 500+ employee's. They didn't invest in their tech and when their Oracle DB went corrupt, they didn't even have backups. Someone at corporate devised a way to use the corporate records to rebuild their records; 2 weeks later they were back up and running but not before losing 2 vendors. The cost of three 9's for them was right around 80k for the install and ~20k/year thereafter. The cost of the failure was nearly 2 million; the vendors that did stay required they provide expedited shipping to their customers. Did I mention it went down during the Christmas shipping season?
Who paid for that?
If you're running in an environment that badly maintained, You're the managerially-acceptable fall-guy to justify their bonuses; if the equipment is in such a bad state you're afraid of you should be looking for work at a company that does things right.
Re:In between maybe? (Score:5, Interesting)
i can't agree. i used to but now i cannot afford to.
we recently experienced 2 catastrophes (datacentre-wide downtimes, you know things that NEVER happen) and the results were unbelievable. GRUBs failed to load OSes, machines were without a bootloader (due to emergency disk hotswaps), some machines simply didn't turn on, services didn't autostart, a few virtual servers autostarted on multiple hosts (instead of just one), fsck on some of our volumes took hours to finish, 30% of supermicro IPMI cards were unresponsive, etc. it revealed that almost nobody had followed procedures properly.
after that, every single service we have is built in a clustered manner with nodes spread across multiple datacentres. I now restart machines and pull cables at regular intervals to test bgp/ospf, clustering, recoveries, to check filesystems, etc. i am now also ABLE TO SLEEP.