Are Data Center "Tiers" Still Relevant? 98
miller60 writes "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices? That question is at the heart of an ongoing industry debate about the merits of the tier system, a four-level classification of data center reliability developed by The Uptime Institute. Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators. Uptime says that many industries continue to require mission-critical data centers with high levels of redundancy, which are needed to perform maintenance without taking a data center offline. Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar."
pointless marketing (Score:5, Informative)
Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators
I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up. And yes I've been involved in numerous power failure incidents (dozens) at numerous companies, and only experienced two incidents of successful backup of commercial power loss.
Transfer switches that don't switch. Generators that don't start below 50 degrees. Generators with empty fuel tanks staffed by smirking employees with diesel vehicles. When you're adding capacity to battery string A, and the contractor shorts out the mislabeled B bus while pulling cable for the "A" bus.
Experience shows that if a companies core competency is not running power plants, they would be better off not trying to build and maintain a small electrical power plant. Microsoft has conditioned users to expect failure and unreliability, use that conditioning to your advantage... the users don't particularly care if its down because of a OS patch or a loss of -48VDC...
Re:RAID (Score:3, Informative)
Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID: Redundant Array of Inexpensive Datacenters.. Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?
That all depends. A 5 9s datacenter is a full ten times more reliable than a 4 9s datacenter (mathematically speaking). So, all things being equal (again, mathematically), you would need ten 4-9 centers to be as reliable as your one 5-9 center. However geographic dispersion, outage recover lead time, bandwidth costs, maintenance, etc. can all factor in to sway the equation either way. It really comes down to itemizing your outage threats, pairing that with the cost of redundancy for each threatened component, and then looking at the cost of downtime as part of the business process. It's rarely as simple as "why not just build two at twice the price".
Re:But it's never the software... (Score:3, Informative)
Perhaps this TDWTF article [thedailywtf.com] is what you were thinking of?
--- Mr. DOS
Re:No (Score:3, Informative)
Because of this our Data center has redundant UPS and Redundant Generators. All but the least critical servers have dual power supplys, plugged into independent circuits.
We have multiple ACs but they are not strictly set up to be redundant. When one breaks down we have to haul standing fans to the area to keep the machines cool enough while the AC is repaired.
The stupid thing though is that most of the smaller switches have a single power supply and most machines are plugged into a single switch. So our last UPS failure resulted in two whole racks of servers being inaccessible for 15 minutes, while I ran over there, figured out what the problem was and plugged the switch into a neighboring RACK.
Re:RAID (Score:1, Informative)
Inaccurate math aside, "4 Nines" is 4 minutes per month. ie: restart the machine at midnight on the first of the month. "5 nines" is 5 minutes a year, a restart every Jan 1st. Properly managed, neither of these is particularly disruptive.
If your concern is unplanned outages, then two independent "4 nines" data centers have eight nines of reliability, because there's a 99.99% probability that the second data center will be funtional when the first one goes down. Of course, you can't predict susceptibility to unplanned outages, so "4 nines" or "5 nines" in that context is a made up number.