Forgot your password?
typodupeerror
Data Storage IT

Are Data Center "Tiers" Still Relevant? 98

Posted by timothy
from the german-datacenters-have-tieren dept.
miller60 writes "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices? That question is at the heart of an ongoing industry debate about the merits of the tier system, a four-level classification of data center reliability developed by The Uptime Institute. Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators. Uptime says that many industries continue to require mission-critical data centers with high levels of redundancy, which are needed to perform maintenance without taking a data center offline. Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar."
This discussion has been archived. No new comments can be posted.

Are Data Center "Tiers" Still Relevant?

Comments Filter:
  • by pyster (670298)
    And they never were.
    • by aaarrrgggh (9205)

      The Tier Guidelines as Uptime Institute presents it is utterly useless beyond C-Suite penis posturing. However, it is important for a company to establish what their needs are.

      However, most serious players customize it based on their own needs and risk assessments. Redundant UPS systems create a more valuable benefit than redundant utility transformers as an example. Redundant generators offer less benefit than redundant starting batteries and proper maintenance and testing. Mechanically, over-sizing so

      • Re: (Score:3, Informative)

        by Forge (2456)
        Sometimes people do irrational things in DATA center. I.e. Where I live/work the Electricity company is notoriously unreliable. We had a 5 minute outage this morning for no apparent reason, We had 3 last week of varied durations. This in the heart of the business district where power is most reliable.

        Because of this our Data center has redundant UPS and Redundant Generators. All but the least critical servers have dual power supplys, plugged into independent circuits.

        We have multiple ACs but they ar
  • It depends (Score:5, Interesting)

    by afidel (530433) on Tuesday September 22, 2009 @12:26PM (#29505457)
    If you are large enough to survive one or more site outages then sure you can go for a cheaper $/sq ft design without redundant power and cooling. If on the other hand you are like most small to medium shops then you probably can't afford the downtime because you haven't reached the scale where you can geographically diversify your operations. In that case downtime is probably still much more costly than even the most expensive of hosting facilities. I know when we looked for a site to host our DR site we were only looking at tier-IV datacenters because the assumption is that if our primary facility is gone we will have to timeshare the significantly reduced performance facilities we have at DR and so downtime wouldn't really be acceptable. By going that route we saved ~$500k on equipment to make DR equivalent to production at a cost of a few thousand a month for a top tier datacenter, those numbers are easy to work.
    • It's very confusing to me. These guys say they are the only tier III in the US according to ATAC whoever they are? Apparently there are no level IV's that outsource. So I would think your statement to be correct. --haven't reached the scale where you can geographically diversify your operations-- means probably don't need above Tier II and their own backups.

      http://www.onepartner.com/v3/atac_tier.html [onepartner.com]

  • by CherniyVolk (513591) on Tuesday September 22, 2009 @12:28PM (#29505475)

    Infrastructure is more important than "best practices". Infrastructure is more of a physical, concrete aspect. Practices really aren't that important once the critical, physical disasters begin. As an example, good hardware will continue to run for years. Most of the downtime in regards to good hardware will most likely be due to misconfiguration, human error that sort of thing. A Sys Admin banks on some wrong assumption, messes up a script or hits the wrong command, but nonetheless the hardware is still physically able and therefore the infrastructure has not been jeopardized. If the situation is reversed, top notch paper plans and procedures... with crappy hardware. Well... the realities of physical discrepancies are harder to argue than our personal, nebulous, intangible, inconsequential philosophies of "good/better/best" management procedures/practices.

    So to me the question "In their efforts at uptime, are data centers relying too much on infrastructure and not enough on best practices?" is best translated as "To belittle the concept of uptime and it's association with reliability, are data centers relying too much on the raw realities of the universe and the physical laws that govern it and not enough on some random guys philosophies regarding problems that only manifest within our imaginations?"

    Or, as a medical analogy... "In their efforts in curing cancer, are doctors relying too much on science and not enough on voodoo/religion?"

    • Its really a false dichotomy. You need a little of both. You need to have your procedures matched to the reliability,performance and architecture of your infrastructure.

      Or, as a medical analogy... "In their efforts in curing cancer, are doctors relying too much on surgery and not enough on chemotherapy?"

  • by japhering (564929) on Tuesday September 22, 2009 @12:30PM (#29505521)

    Data center redundancy is a need thing. However, most data center designs for get to address the two largest causes of down time ... people and software. People are people and will always make mistakes, as such there are still things that can be done to reduce the impact of human error.

    Software, very rarely is designed for use in redundant systems. More likely, the design is for use in a hot-cold or hot-warm recovery scenario. Very rarely is it designed for multiple hot across multiple data centers.

    Remember, good disaster avoidance is always cheaper than disaster recovery when done right.

    • by dkleinsc (563838)

      Disaster avoidance is good, for sure, but that's not what your DR efforts are really for.

      Here's the story (fairly well-covered by /. at the time) of why you have a disaster recovery system and plan in place: A university's computing center burned to the ground. The entire place. All the servers, all the onsite backups, all the UPS units, gone. Within 48 hours, they were back up and running. Not at 100% capacity, but they were running.

      • Re: (Score:3, Insightful)

        by japhering (564929)

        And if you had two identical data centers, where each in and of itself was redundant with software designed to function seamlessly across the two in a hot-hot configuration .. there would have been NO downtime.. the university would have been up the entire time with little to no data loss.

        So say I'm Amazon and my data center burns down.. 48 hours with ZERO sales for a disaster recovery scenario vs normal operations for the time it takes to rebuild/move the burned data center..

        I think I'll take disaster avoi

        • Re: (Score:3, Insightful)

          by aaarrrgggh (9205)

          Unless you were doing maintenance in the second facility when a problem hit the first. That is what real risk management is about; when you assume hot-hot will cover everything, you have to make sure that is really the case. Far too often there are a few things that will either cause data loss or significant recovery time even in a hot-hot system when there is a failure.

          Even with hot-hot systems, all facilities should be reasonably redundant and reasonably maintainable. Fully redundant and fully maintain

          • by Vancorps (746090)

            I agree, as someone that runs two data centers that are both hot and both independently capable of handling the company load I think it is still wise for a scaled back DR at yet another location. For me it's as simple as a tier 3 storage server with a 100tb tape library backing things up properly. This gives me email archiving and records compliance. Of course the tapes aren't stored at the DR site.

            The problem is that there is no effective limit to how much redundancy you can have. The industry default in

          • by japhering (564929) on Tuesday September 22, 2009 @04:19PM (#29508365)

            Precisely, I've spent the last 12 years (prior to be laid off) working in a hot-hot-hot solution. Each center was fully redundant and ran at no more then 50% dedicated utilization. Each data center got 1 week worth of planned maint every quarter for hardware and software updates when that data center was completely off line leaving a hot-hot solution.. if something else happened we still had a "live" data center while scrambling to recover the other two.

            We ran completely without change windows as we would simply deadvertize an entire data center do the work and readvertize, them move on to the next data center. In the event of high importance, say a cert advisory requiring an immediate update, we would follow the same procedures just as soon as all the requisite mgmt paperwork was complete.

            And yes, we were running some of the most visible and highest traffic websites on the internet.

            • Were you running relational databases? What did you do about schema changes?

              (i.e. presumably if you were running relational DBs then there would be one big data set which would be shared between all three sites; you couldn't e.g. deadvertize one site, change the schema, then readvertise, as then the schemas would be different...)

              • by japhering (564929)

                Actually, we would deadvertize, and stop the synchronization, then change the code and the schema in the database and readvertize leaving the sync off, move to the second site do the same thing but restart the sync between sites 1 and 2..

                When site 3 was done.. then all three sites would after a few minutes, be back in sync.

                • But surely if you readvertize and leave the sync off, then data inconsistencies will start to occur? (e.g. a modification to the one database, and a different modification e.g. to the same row in another database). How are these inconsistencies then reconciled?
                  • by japhering (564929)

                    Depends on the schema change involved. Never saw any with add or delete a column, which was 95+% of what happened in my environment. As for the remaining changes, I don't ever remember things becoming inconsistent, might have been pure luck , really good design and implementation or just bad memory on my part (I'll have to query some of my former colleagues).

                    One thing to remember is that while all three sites were running the same application, the end user never, ever switch sites (unless the site failed

        • by sjames (1099)

          Not necessarily. If the VALUE of those 48 hours worth of sales is less than the COST of a hot-hot configuration then you're wasting money. You also have to consider the number of sales NOT lost in the 48 hours. Depending on your reputation, value, and what you're selling some people will just try again in a day or two. In other cases potential customers will just go to the next seller on Google. You need to know which scenario is more likely.

  • "A stick of RAM costs how much? $50?"

    I don't remember the source of that quote, but it was in relation to a company spending money (far more than $50) to reduce the memory use of their program. Sure, there's a lot of talk in computer science curricula about using efficient algorithms, but from what I've seen and heard, companies almost always respond to performance problems by buying bigger and better hardware. If software weren't grossly inefficient, how would that affect data centers? Less power consumpti

    • by alen (225700)

      with the new Intel CPU's it's still cheaper to buy hardware than pay coders. our devs need more space. turns out it's cheaper to buy a new HP Proliant G6 server than just more storage for their G4 server. and if we spent a bit more we could buy the power efficient CPU which will run an extra few hundred $$$. a coder will easily run you $100 per hour for salary, taxes, benefits and the enviromentals. a bare bones HP Proliant DL 380 G6 server is $200 more than the lowest priced iMac

      • by Maximum Prophet (716608) on Tuesday September 22, 2009 @12:54PM (#29505873)
        Code scales, hardware doesn't. If you have one machine, yes, it cheaper to get a bigger, better machine, or to wait for one to be released.

        If you have 20,000 machines, even a 10% increase in efficiency is important.
      • by Sarten-X (1102295)

        There's the CPU, plus the energy cost to produce it, the environmental waste of disposing the old unit, the fuel to ship it, the labor to install it... Somebody pays for all of it, even if it isn't put on the 'new hardware' budget.

        Also, I'm not suggesting paying for more programmers, or even demanding much more from existing programmers. All I suggest is that companies ought to push programmers to produce slightly-better programs, especially when they're going to be deployed in a data center environment.

        Giv

    • Re: (Score:3, Insightful)

      That works if you have one program that you have to run every so often to produce a report. If your datacenter is more like Google, where you have 100,000+ servers, a 10% increase in efficiency could eliminate 10,000 servers. Figure $1,000 per server and it would make sense to offer a $1,000,000 prize to a programmer that can increase the efficiency of the Linux kernel by > 10%.

      B.t.w Adding one stick of RAM might increase the efficiency of a machine, but in the case above, the machines are probably
    • Re: (Score:3, Informative)

      by Mr. DOS (1276020)

      Perhaps this TDWTF article [thedailywtf.com] is what you were thinking of?

            --- Mr. DOS

      • by Sarten-X (1102295)

        I do believe it is. Thanks!

        The case presented there is ridiculously going to the other extreme, but the principle is sound. A few rare memory leaks aren't a problem, but using a bubble sort on a million-item list is.

  • by jeffmeden (135043) on Tuesday September 22, 2009 @12:36PM (#29505593) Homepage Journal

    Given the recent series of data center outages and the current focus on corporate cost control, the debate reflects the industry focus on how to get the most uptime for the data center dollar.

    Repeat after me: There is no replacement for redundancy. There is no replacement for redundancy. Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*. Redundancy is irreplaceable. If you rely on your servers (the servers housed in one place) you had better have redundancy for EVERY. SINGLE. OTHER. ASPECT. If not, you can expect downtime, and you can expect it to happen at the worst possible moment.

    • There is no replacement for redundancy..

      Sorry, I had to add a 3rd one to repeat.. I'm a bit more risk averse than you!
    • by Jared555 (874152)

      The issue is when the systems designed to create redundancy actually cause the failure (a transfer switch causing a short, etc.) Also with a couple seconds of searching I was able to find one extended downtime caused by safety procedures and not lack of redundancy:

      http://www.datacenterknowledge.com/archives/2008/06/01/explosion-at-the-planet-causes-major-outage/ [datacenterknowledge.com]

      I have seen other cases where entire datacenters were shut down because some idiot hit the shutdown control (required by fire departments for safet

      • by aaarrrgggh (9205)

        There is also N/2 redundancy when you talk about EPO systems-- each button only kills one cord per server, so you have to actually hit two buttons to shut everything down...

        Increased complexity increases risk; the most elegant redundant systems are never tied together, and provide the greatest simplicity. The others ensure job security until the outage happens...

        • by afidel (530433)
          EPO buttons aren't allowed to work like that in most jurisdictions, the fire fighters want to know that when they hit the red button that ALL power to the room is off.
          • by aaarrrgggh (9205)

            That used to be the case, but we have successfully argued for it in every jurisdiction we have tried. With the 2008 NEC, claiming it is a COPS system will quickly let you eliminate an EPO in the traditional sense.

            Dating back to 1993, there was never a NFPA requirement for a single button to kill everything; they allowed you to combine HVAC and power into a single button if desired.

      • by jeffmeden (135043)
        While it's hard to argue that outages would still occur from things like fires and explosions in a fully redundant environment, it's easy to connect the dots and notice that fully redundant systems rarely experience fires or explosions, if only for the fact that they spend almost all of their service lives operating at less than 50% capacity. Many "bargain basement" hosting companies (I won't name names) choose to run far closer to 80% or 90% of nameplate capacity because it's cheaper. Also, the question
        • by afidel (530433)
          Exactly, unless the short in the transfer switch somehow gets through the UPS how is it going to affect a truly redundant setup? I know if one of my transfer switches dies it wouldn't do anything as the systems would just go along powered by the other power feed. If there is ANY single point of failure in your design it WILL fail at some point, that's why the design guidelines matter.
        • by sjames (1099)

          I saw a datacenter go down because one of the batteries in one of the UPS burst. The fire department then came in and hit the EPO. (There exists no point where 100% of everything is fully accounted for. Just when you think you've covered every last contingency some country that's afraid of boobies will black hole your IPs for you.

          Meanwhile, the cost of each 9 is exponentially higher than the last one was.

          • by mokus000 (1491841)

            Meanwhile, the cost of each 9 is exponentially higher than the last one was.

            And its value is exponentially smaller.

      • "The issue is when the systems designed to create redundancy actually cause the failure" - exactly.

        For example we had two Oracle systems (hot-cold) and one disk array connected to both systems. The second Oracle was triggered to start automatically when the first Oracle died. One time the second Oracle thought the first Oralce had died and started, even though the first Oralce hadn't died. (We never knew why it started.) Then we had two live instances writing to the same set of data files, and not knowin

        • by mindstrm (20013)

          What was missing is colloquially called STONITH: Shoot The Other Node In The Head.

    • Re: (Score:3, Insightful)

      Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*.

      No, I've also heard about cases where both redundant systems failed at the same time (due to poor maintenance) and where the fire department won't allow the generators to be started. Everything within the datacenter can be redundant, but the datacenter itself still is a single physical location.

      Redundancy is irreplaceable.

      Distributed fault-tolerant systems are "better", but they're also harder to build. Likewise redundancy is more expensive than lack of redundancy, and if you have to choose between $300k/year for a redundant location

    • by R2.0 (532027)

      Redundancy is a necessary condition for uptime, but not a sufficient condition. You can have N+a kagillion levels of redundancy, but is the equipment is neglected or procedures aren't followed, it means jack shit.

      Added levels of redundancy can actually hurt overall reliability, if it encourages maintenance to delay repairs and ignore problems because "we have backups for that".

      One facility I worked on had a half again more processing equipment than needed on the floor. Why? "Well, when one fails we just

    • by drsmithy (35869)

      Every outage you read about involves a failure in a feature of the datacenter that was not redundant and was assumed to not need to be redundant... assumed *incorrectly*.

      IME, most outages are due to software or process failures, not hardware.

    • by sjames (1099)

      The question is where to put the redundancy. If you have a DR site and the ability to do hot cut-over, you now have a redundant everything (assuming it actually works). While you wouldn't likely want to have no further redundancy, realistically you just need enough UPS time to make a clean cut-over. If you skip the N+1 everything else you might even be able to afford the much more valuable N+2 data centers.

    • Our company, OnePartner, was referenced in the article. I agree wholheartedly. There is no replacement for redundancy. There's a point made in the original article that I find interesting. The folks on the "con" side of Certification argue that every application doesn't require Tier III or IV. That *might* make sense if the cost of a Tier III were substantially higher than an uncertified data center. Our cabinet rates are lower than many of the uncertified data centers with established brands. If the
  • pointless marketing (Score:5, Informative)

    by vlm (69642) on Tuesday September 22, 2009 @12:43PM (#29505693)

    Critics assert that the historic focus on Uptime tiers prompts companies to default to Tier III or Tier IV designs that emphasize investment in redundant UPSes and generators

    I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up. And yes I've been involved in numerous power failure incidents (dozens) at numerous companies, and only experienced two incidents of successful backup of commercial power loss.

    Transfer switches that don't switch. Generators that don't start below 50 degrees. Generators with empty fuel tanks staffed by smirking employees with diesel vehicles. When you're adding capacity to battery string A, and the contractor shorts out the mislabeled B bus while pulling cable for the "A" bus.

    Experience shows that if a companies core competency is not running power plants, they would be better off not trying to build and maintain a small electrical power plant. Microsoft has conditioned users to expect failure and unreliability, use that conditioning to your advantage... the users don't particularly care if its down because of a OS patch or a loss of -48VDC...

    • Re: (Score:3, Insightful)

      by Ephemeriis (315124)

      I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up.

      A lot of folks don't really contemplate what a loss of power means to their business.

      Some IT journal or salesperson or someone tells them that they need backup power for their servers, so they throw in a pile of batteries or generators or whatever... And when the power goes out they're left in dark cubicles with dead workstations. Or their manufacturing equipment doesn't run, so it doesn't really matter if the computers are up. Or all their internal network equipment is happy, but there's no electricity

      • by afidel (530433)
        We have a remote telco shelf powered by enough batteries to last 48 hours (not that they have ever been drained down past 30 seconds except during a battery test) and the equipment they talk to is likewise powered by two sets of such batteries (but only 1 generator). Soon we are going to have feeds to two CO's which take different egress paths from the city (one East, one West). We have dual generators, dual transfer switches, dual UPS's and all equipment is obviously dual power supplied. The only potential
      • I'll stand behind a few batteries for servers... Enough to keep them running until they can shut down properly... But actually staying up and running while the power is out? From what I've seen that's basically impossible

        Many businesses have dozens or hundreds of remote offices / branches / stores. If those stores depend on the HQ site to be running (as many or most do), then having a very reliable generator is critical.
        Sure, if you lose power for a single site, your customers at that single site will

        • Many businesses have dozens or hundreds of remote offices / branches / stores. If those stores depend on the HQ site to be running (as many or most do), then having a very reliable generator is critical.
          Sure, if you lose power for a single site, your customers at that single site will be forgiving and don't expect you to have a generator at every store.
          However, if your HQ is in Chicago and loses power for 12 hours from an ice storm, your customers that can't shop at your Palm Beach location are going to be pissed that you are now closed nationwide.

          If you're that big, I'd expect you to have multiple data centers distributed geographically. If your data center in Chicago loses power for 12 hours from an ice storm, I'd expect the Palm Beach store to be accessing a data center somewhere else.

          Even with generators and whatnot... If there's an ice storm in Chicago you're likely looking at an outage. You'll have lines down, trees falling over, issues with your ISP and whatever else. Just keeping your data center up in the middle of that kind of havoc isn

          • I work in the midwest, the last two locations I worked for had 20-40 locations. At each place (5 years at one, 4 at the next), I had an 8 hour power outage. The first place didn't have generators, so all of their retail stores lost a day's work of sales. The second place did have a generator, and everything worked fine. In both cases, the fiber optic service didn't fail. Since it was 100% underground, that was expected. Power lines just go down more than sonet fiber optic does, pure and simple.
            I'd sa
      • by mindstrm (20013)

        Basically impossible? All it takes is an adequate UPS setup, with a proper transfer switch and a diesel generator - and a proper maintenance plane to go with it. There's nothing hard or magical about it - it just costs more. Maintenance and fuel.

        Plenty of places have proper backup facilities.

        The main problem, at least in most of the 1st world, is that people are so used to reliable grid power that they don't think about it or see the risk. Look at any operation running somewhere where the power goes o

        • The main problem, at least in most of the 1st world, is that people are so used to reliable grid power that they don't think about it or see the risk. Look at any operation running somewhere where the power goes out on a frequent basis, and you'll find the above mentioned scenario very common.

          That may very well be true... I've never done any work outside of the US, so I have no idea what kind of scenario is common elsewhere. And maybe I've just been exposed to some fairly clueless people... But I've yet to see a backup power system do what people thought it was going to do - allow them to stay open for business while the grid goes down.

          Basically impossible? All it takes is an adequate UPS setup, with a proper transfer switch and a diesel generator - and a proper maintenance plane to go with it. There's nothing hard or magical about it - it just costs more. Maintenance and fuel.

          The actual quote was "From what I've seen that's basically impossible." I never claimed to be omniscient or omnipresent. I'm just basing my statements on my

      • I've been involved in this field for about 15 years. The funniest misconception I've run into, time and time again, is that an unmaintained UPS, unmaintained battery bank, unmaintained transfer switch, and unmaintained generator will somehow act as magical charms so as to be more reliable than the commercial power they are supposedly backing up.

        A lot of folks don't really contemplate what a loss of power means to their business.
        Some IT journal or salesperson or someone tells them that they need backup power for their servers, so they throw in a pile of batteries or generators or whatever... And when the power goes out they're left in dark cubicles with dead workstations. Or their manufacturing equipment doesn't run, so it doesn't really matter if the computers are up. Or all their internal network equipment is happy, but there's no electricity between them and the ISP - so their Internet is down anyway.
        I'll stand behind a few batteries for servers... Enough to keep them running until they can shut down properly... But actually staying up and running while the power is out? From what I've seen that's basically impossible.

        I've never had the headache of maintaining a business infrastructure, but must cope with our small setup at home. The LAN printer is the only IT thing without UPS power. The server, router, and optical switch are on one UPS. Two PCs each have their own smaller UPS which also power ethernet switches, and there's a laptop which obviously has battery power built-in. All of the computers, including the server, are configured to shutdown if the batteries go down to 20% (for the laptop, it's 10%).
        We live in the

        • The server, router, and optical switch are on one UPS.

          The optical fiber never seems to go down, so I guess they have good power at the other end and at any intermediate units.

          I love how everyone else on the planet has fiber to their home now. Even folks in the countryside.

          We moved out of town while two of the local ISPs were in the process of rolling out fiber all over town. We're only about 1 mile outside of the city, and all we have available is dial-up, cable, or satellite. It sucks.

          We live in the countryside, so power outages happen (too often), especially the annoying 1-10 minute outages which mean someone is working on the power line.

          I'm in a similar situation at home. I've got the individual desktops on batteries, and our server, and the network hardware. Pretty much everything except the printer. But our cable Internet

    • Well, my experience is the opposite of your anecdata - our remote sites often experience grid power failures and the building UPS keeps the equipment running the whole time. However, those are smaller sites, not full size datacenters I'm talking about.

      I will however say this about "high availability is hard": Often the redundancy mechanisms themselves are the source of outages. Not just power, but equipment, software, protocols... Maybe your RAID controller fails, instead of the drive. Maybe the HSRP/V

    • by R2.0 (532027) on Tuesday September 22, 2009 @02:23PM (#29507027)

      It's not just in IT. I work for an organization that uses a LOT of refrigeration in the form of walk-in refrigerators and freezers. Each one can hold product worth up to $1M and all can be lost in a temperature excursion. So we started designing in redundancy: 2 separate refrigeration systems per box, backup controller, redundant power feeds from different transfer switches over divers routing (Brown's Ferry lessons learned). Oh, and each facility had twice as many boxes as needed for the inventory.

      After installation, we began getting calls and complaints about how our "wonder boxes" were pieces of crap, that they were failing left and right, etc. We freak out and do some analysis. Turns out that, in almost every instance, a trivial component had failed in 1 compressor and the system had failed over to the other system, ran for weeks-months, and then that failed too. When we asked why they never fixed the first failure, they said "What failure?" When we asked about the alarm the controller gave due to mechanical failure, we were told that it had gone off repeatedly but was ignored because the temperature readings were still good and that's all Operations cared about. In some instances the wires to the buzzer was cut, and in one instance, a "massive controller failure" was really a crash due to the system memory being filled by the alarm log.

      Yes, we did some design changes, but we also added another base principle to our design criteria: "You can't engineer away stupid."

      • Re: (Score:1, Redundant)

        by iamhigh (1252742)
        Used my points already, but that was interesting to read.
      • I wouldn't call it "stupid" at all. Failure to consider the human element and designing something for yourself is a classic mistake. Hell yeah if I'm getting paid crap, I'm sure as hell not caring about some guy's alarm. Is it broke? No, then don't fix it. Not surprising to me it turned out that way at all. You should look at old telco systems, they knew how to design around people.
      • by Part`A (170102)

        How about IBM's approach? Have the system contact and request a technician directly and charge them for a support contract or call out fee?

    • three things to say to this:

      - unmaintained UPS is worse than none
      - you need actual risk assessment to decide what quality of power backup you need
      - a good line filter is essential (unless you don't care if all your equipment gets toasted)

      If you are in an area where mains power is very reliable, your UPS will need to be very good to beat it, ie be rather expensive (so only useful if outages are very expensive for you); if you're looking at two outages from storms a year, at least getting something that will

  • RAID (Score:5, Interesting)

    by QuantumRiff (120817) on Tuesday September 22, 2009 @12:43PM (#29505703)
    Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID:
    Redundant Array of Inexpensive Datacenters..

    Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?
    • Re: (Score:3, Informative)

      by jeffmeden (135043)

      Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID: Redundant Array of Inexpensive Datacenters.. Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?

      That all depends. A 5 9s datacenter is a full ten times more reliable than a 4 9s datacenter (mathematically speaking). So, all things being equal (again, mathematically), you would need ten 4-9 centers to be as reliable as your one 5-9 center. However geographic dispersion, outage recover lead time, bandwidth costs, maintenance, etc. can all factor in to sway the equation either way. It really comes down to itemizing your outage threats, pairing that with the cost of redundancy for each threatened comp

      • I think your probability calculation might be a bit off. The math doesn't go through.

        I should say ahead of time, I don't know much about these 4-9s vs 5-9s. I interpret them as probability of not failing. IE, 4-9, means 99.99%, which means the probability of failure is .0001. If that's wrong, the rest of this doesn't work out.

        Lets try different numbers. Choice A has a probability of 25% of failing, Choice B has a probability of 1% of failing.

        How many A do we need such that the probability of t
      • by arcade (16638)

        Wrong.

        0.01*0.01 = 0.0001

        Which is ten times better than 0.001

      • by aaarrrgggh (9205)

        Even that can over-simplify the problem; when you have to take one system offline, what redundancy to you have left? Will one drive failure take you down?

        To the GP's point, the problem isn't going from 1x 5x9s to 2x 4x9's, usually companies try to do 2x 3x9's facilities instead.

        Redundancy is not Reliability is not Maintainability.

      • Re: (Score:1, Informative)

        by Anonymous Coward

        Inaccurate math aside, "4 Nines" is 4 minutes per month. ie: restart the machine at midnight on the first of the month. "5 nines" is 5 minutes a year, a restart every Jan 1st. Properly managed, neither of these is particularly disruptive.

        If your concern is unplanned outages, then two independent "4 nines" data centers have eight nines of reliability, because there's a 99.99% probability that the second data center will be funtional when the first one goes down. Of course, you can't predict susceptibilit

      • Actually, I think your math is a bit off.

        A 4 9s datacenter fails .0001% of the time. The chances of two 4 9s datacenters failing simultaneously is .0001% squared (.0000001%). The 5 9s data center fails .000001% of the time. Therefore, two 4 9s datacenters are ten times as reliable as one 5 9s datacenters (assuming I did my math right). That's why RAID works.
    • by drsmithy (35869)

      Why go with a huge, multiple 9's datacenter, when you can go the way of google, and have a RAID: Redundant Array of Inexpensive Datacenters..

      Because most systems don't scale horizontally and most businesses don't have the resources of Google to create their own that do.

    • by dkf (304284)

      Is really better to have 1000 machines in a 5-9's location, or 500 systems each in a 4-9's, with extra cash in hand?

      Remember that the main problems with these datacenters are in networking (because that can propagate failures) and automated failover systems. Given that, go for the cash in hand, since you can do other stuff with that (including buying disaster recovery insurance if appropriate).

  • uptime matters (Score:3, Insightful)

    by Spazmania (174582) on Tuesday September 22, 2009 @12:54PM (#29505861) Homepage

    Designing nontrivial systems without single points of failure is difficult and expensive. Worse, it has to be built in from the ground up. Which it rarely is: by the time a system is valuable enough to merit the cost of a failover system, the design choices which limit certain components to single devices have long since been made.

    Which means uptime matters. 1% downtime is more than 3 days a year. Unacceptable.

    The TIA-942 data center tiers are a formulaic way of achieving satisfactory uptime. They've been carefully studied and statistically tier-3 data centers achieve three 9's uptime (99.9%) while tier-4 data centers achieve four 9's. Tiers 1 and 2 only achieve two 9's.

    Are there other ways of achieving the same or better uptime? Of course. But they haven't been as carefully studied which means you can't assign a high a confidence to your uptime estimate.

    Is it possible to build a tier-4 data center that doesn't achieve four 9's? Of course. All you have to do is put your eggs in one basket (like buying all the same brand of UPS) and then have yourself a cascade failure. But with a competent system architect, a tier-4 data center will tend to achieve at least 99.99% annual uptime.

  • European bank IT people are some of the most conservative and risk-averse people on the planet. If you ask them which is more important, infrastructure or best practices, they will answer "Yes."
    ----------
    Change is inevitable. Progress is not.
    • Re: (Score:1, Interesting)

      by Anonymous Coward

      I work for a very very large European bank. And yes - we're highly risk averse.

      Here's the interesting thing - we built a bunch of Tier 3 and Tier 4 datacenters because the infrastructure guys thought that it was what the organization needed.

      But they didn't talk to the consumers of their services - the application development folks.

      So what do we have -

      Redundant datacenters with redundant power supplies with redundant networks with redundant storage networks with redundant WAN connections with redundant data

  • On a strict IT budget cost-effectiveness basis, the most uptime for your dollar will be Windows (Windows admins practically grow on trees, so they are cheap) on some commodity Pizza Box servers, connected to some cheap NAS storage and networked with crap switches. If you are an IT manager looking for your short-term bonus before you move onto greener pastures, this is a great idea! There is a good chance you will be able to hold things together long enough to get your bonus, and then get outta there.

    Of co

    • Or, to be slightly more robust, windows or linux on redundant commodity boxes, with mid grade disk and network components, set up in redundant locations, will serve a lot of needs for lower cost. Not to go all MBA on you or anything, but a smart management team would look at the cost of providing the last 9 of reliability, against the cost of x days of outage, multiplied by some reasonable percentage of the likelihood of the outage, and then ask, does it make financial sense to ensure against the extremely

"I have more information in one place than anybody in the world." -- Jerry Pournelle, an absurd notion, apparently about the BIX BBS

Working...