Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Why Power Failures Can Always Lead To Data Loss

Posted by timothy on Wed Jul 23, 2008 12:03 PM
from the when-velcro-snags-shoelaces dept.
bigsmoke writes "So, all your servers run on RAID. You back up religiously. You're even sure that your backups are recoverable. But do you also need a UPS? According to Halfgaar (on Slashdot before to promote better Linux backup practices), yes, usually you do. He argues that despite technological advancements such as file system journaling, power failures can still cause data loss in most setups."
+ -
story

Related Stories

[+] Backing up a Linux (or Other *nix) System 134 comments
bigsmoke writes "My buddy Halfgaar finally got sick of all the helpful users on forums and mailing lists who keep suggesting backup methods and strategies to others which simply don't, won't and can't work. According to him, this indicates that most of the backups made by *nix users simply won't help you recover, while you'd think that disaster recovery is the whole point of doing backups. So, now he explains to the world once and for all what's involved in backing up *nix systems."
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by Skyshadow (508) * on Wednesday July 23 2008, @12:05PM (#24306965) Homepage

    Power losses can cause data loss? Gee, you mean that my system that relies on electricity for everything it does can be adversely effected by power outages even if I take precautions? That's some good admin work there, Lou -- if only there was some sort of law that covered the tendency of things that can go wrong to go wrong...

    Next week: Fires can make things warm, floods can make things wet.

    • by Anonymous Coward on Wednesday July 23 2008, @12:08PM (#24307031)

      I don't know about you, but my servers run on the power of cotton candy and happy thoughts.

      • by Skyshadow (508) * on Wednesday July 23 2008, @12:10PM (#24307069) Homepage

        I don't know about you, but my servers run on the power of cotton candy and happy thoughts.

        As a former sysadmin, I would think that any machine reliant on 'happy thoughts' would be the most crash-prone system in the history of computing.

        • by Anonymous Coward on Wednesday July 23 2008, @12:36PM (#24307523)

          I can offer you a Happy Thought UPS. It's a box of puppies. Be careful though, it only has 500 puppy Amps of capacity.

          • by mweather (1089505) on Wednesday July 23 2008, @03:05PM (#24310031)
            I tried one of those. You gotta keep adding food to it or it stops working after a week or two. Starts stinking, too.
              • Mmmm! Puppies!!! (Score:5, Interesting)

                by Gription (1006467) on Wednesday July 23 2008, @02:20PM (#24309381)
                Less filling but tastes great!


                Ok back on subject
                A UPS isn't even a panacea... I had a server lose 3 out of 4 HDs in a 4 hour period. (The 3rd drive went at 4:57 PM Thursday Dec 11th 1997. Not that I would remember...) When I looked at the service history on it it had been losing drives for 8 months at an accelerating rate.

                Turns out that the 3000va rack mount wonder UPS from that big, well known vendor was the problem. The switching unit in it was sending spikes into the equipment.

                They wouldn't warranty it so I ended up putting a Triplite ISObar surge suppressor between it and the server in our test environment and it was in service for years after that.

                Never trust any piece of equipment...
                • by maglor_83 (856254) on Wednesday July 23 2008, @06:50PM (#24312799)

                  They wouldn't warranty it so I ended up putting a Triplite ISObar surge suppressor between it and the server in our test environment and it was in service for years after that.

                  Never trust any piece of equipment...

                  You mean like a Triplite ISObar surge suppressor?

        • by ArsonSmith (13997) on Wednesday July 23 2008, @12:46PM (#24307725) Journal

          Except the server that runs http://youporn.com/ [youporn.com]

      • by NFN_NLN (633283) on Wednesday July 23 2008, @12:24PM (#24307291)

        My servers run on Electricity but the RAID controller has battery backed up RAM so any cached data will persist a power failure and the disks are in writethrough mode.

        I like this setup, but please. Tell me more about this cotton candy technology? Is it superior.

        • by MightyMartian (840721) on Wednesday July 23 2008, @12:32PM (#24307443) Journal

          My servers run on Electricity but the RAID controller has battery backed up RAM so any cached data will persist a power failure and the disks are in writethrough mode.

          That is until the 10,000 volt spike when the power company improperly brings the grid back up bakes the RAM, the battery, RAID controller and the hard drives.

          • Voltage Spikes (Score:5, Informative)

            by natoochtoniket (763630) on Wednesday July 23 2008, @01:12PM (#24308259)

            The typical small UPS system has some amount of surge protection built-in. But it's typically only good for at most a couple thousand joules. But then, if you get a spike that is big enough to blow a varister, you also get to buy a new ups.

            A better solution is to put a "whole house" surge protector on the circuit-breaker panel. It protects everything, with a much higher number of joules. Five or six pounds of varisters can absorb a lot more shock than one ounce of varisters. They cost about $100, and can be found at most big hardware stores or electrical supply houses. That doesn't eliminate the need for a ups. It does protect the ups, along with the other equipment, from most voltage spikes.

            Last year, lightning hit the power pole 20 feet from my house. We know where it hit because the pole caught fire. My next-door neighbors on both sides lost every single piece of electrical equipment -- not just computers, TV's, and stereos, but also fridge, microwave, water heater, and range. All of it was damaged beyond repair. We barely noticed the hit, except for the bright flash of light, and had no damage at all.

              • Re:Voltage Spikes (Score:5, Informative)

                by natoochtoniket (763630) on Wednesday July 23 2008, @03:26PM (#24310331)

                The path-to-ground is really important, as is the quality of the ground. The length of the path is the reason why whole-house devices are installed at the service entrance panel. But, that assumes that your service-entrance ground is a good ground.

                If your ground is not good, shorting to ground won't do much good. A lot of houses around here are grounded to plumbing pipe that is buried just 12" deep. During a dry spell a few years ago, I detected variable voltage where it shouldn't have been. The voltage problems cleared up after I added an 8-foot vertical ground rod to the system.

                The thing that kills a surge protector is too many amps for too long. If it shorts the power to ground (low-resistance), but the ground is not really well-grounded, then the whole thing can float close to line-voltage. In that case, that voltage can destroy your other devices, while the surge unit never gets enough current to burn the varisters.

              • by jcochran (309950) on Wednesday July 23 2008, @02:41PM (#24309667)

                All you need to do is have the grid power feed some high wattage light bulbs. And near the light bulbs is some solar cells. The output from the solar cells is used to charge batteries which feed an inverter that actually powers the computer. Of course there is some power loss in the conversion process, and you need to have some (ok, a lot), of the input power to the system commited towards running a cooling unit to keep things at a reasonable temperature. But the resulting device provides clean power with no possibility of any surges getting thru to the protected equipment.

                Of course, if you go to this level of trouble for your power source, then I'd also suggest opto-isolating all signal lines to and from the server. And enclose the server in a well grounded faraday cage. And it wouldn't be a bad idea to have a dedicated comm link to a duplicate server located else where. Preferably on a different tectonic plate.

                • Re:Ah, that's easy (Score:4, Interesting)

                  by rcw-work (30090) on Wednesday July 23 2008, @03:01PM (#24309959)

                  All you need to do is have the grid power feed some high wattage light bulbs. And near the light bulbs is some solar cells.

                  You now have a 1% efficient power supply.

                  A slightly more practical option (with better isolation than a standard electromagnetic transformer, but unfortunately also some inductive effects) would be to couple two motors with an insulative shaft.

        • by supersat (639745) on Wednesday July 23 2008, @01:55PM (#24309025)
          Are you sure your disks are in write-through mode? Have you checked [livejournal.com]? Brad Fitzpatrick (of LiveJournal, memcache, OpenID, etc. fame) discovered that many disks lie about being in write-through mode, and wrote a utility to check it.
    • by Anonymous Coward on Wednesday July 23 2008, @12:11PM (#24307091)

      Ok, people who don't just read the executive summary knew this all along, but perhaps it's necessary that someone spells it out for the rest: Journaling and RAID do not prevent data loss in case of a power outage (and many more circumstances). If you know why, just skip the article. If you're wondering how you can lose data if you write everything to two disks and your filesystem guarantees its own consistency, then perhaps this is the wake up call that you need.

    • if only there was some sort of law that covered the tendency of things that can go wrong to go wrong.

      I hear Murphy might have one :)

    • No, it really does have some interesting observations, with some very scary implications:

      One of the first things that will happen, is that the memory DIMMs will no longer be refreshed properly (DRAM needs to be refreshed constantly otherwise it will loose it's data) and very rapidly, the memory will contain only garbage. The hard drives and DMA controller however, will run a bit longer; so if data is being written to disk, the DMA controller will keep reading data from memory, but it has no idea that this data is corrupted.

      However, we've recently seen that RAM holds state well enough to preserve crypto keys thru a power cycle [hackaday.com]. This has very scary implications: the RAM knows what's happening, and behaves differently (loses data immediately on power-off or remembers it for several seconds) in order to cause the most difficulty for the owner of the machine.

      Not only are computer components intelligent and self-aware, they're also out to get us!

  • Illiteracy (Score:5, Funny)

    by carou (88501) on Wednesday July 23 2008, @12:06PM (#24307005) Homepage Journal

    From TFA:

    (DRAM needs to be refreshed constantly otherwise it will loose it's data)

    Fly, little data! Be free!

  • by internerdj (1319281) on Wednesday July 23 2008, @12:07PM (#24307009)
    Definitely maybe?
  • by Zebadias (861722) on Wednesday July 23 2008, @12:07PM (#24307011)
    UPS smooths out all those nasty spikes as well as stopping your servers from going down to a 1 second power cut.

    UPS is more than just saving your data.

    • by linuxpyro (680927) on Wednesday July 23 2008, @12:17PM (#24307173) Homepage

      It's also important to get a decent UPS too, if you're using it for something like a server. I think the cheapy ones basically just use a transfer relay, where as the higher end ones actually run the hardware off of the battery via the inverter all the time. While I would think that with the former (called "standby" UPSs maybe?) the transfer time wouldn't be enough to cause too many problems, you still don't have the buffer that you'd get with a true uninterruptible power supply.

      I think a lot of the cheaper ones don't put out a true sine wave either, though for their intended purpose of letting you shutdown your desktop cleanly again they're probably fine.

      • by SuperQ (431) * on Wednesday July 23 2008, @12:45PM (#24307709) Homepage

        Yup the 3 major types of battery UPSs I know of:

        Offline - Relay or simple failover. (APC Backups)

        Line Interactive - Can correct line over/under voltage to a point (APC Smartups)

        Online - Full AC -> DC -> AC conversion. (APC Symetra, Liebert, anything that doesn't suck)

        Basically outside of home use you want an online type UPS.

        There are other systems like motor/generator flywheel types, but they need a very fast backup generator to sustain anything more than 30 seconds of outage. But they're great for smoothing out some types of line issues.

  • Duh! (Score:5, Insightful)

    by mlwmohawk (801821) on Wednesday July 23 2008, @12:08PM (#24307029)

    I remember a discussion on the PostgreSQL hacker's list about recoverability and transaction logs.

    You can't make a system that will not lose data, you can only make a system that knows the last save point of 100% integrity.

    There are too many variables and too much randomness on a cold hard power failure. You absolutely need a UPS that gives you time to shut down cleanly.

    • Re:Duh! (Score:4, Insightful)

      by sm62704 (957197) on Wednesday July 23 2008, @12:31PM (#24307433) Journal

      You're still hosed if your server's power supply goes titsup. Or if your hard drive crashes. Or if the building burns down.

      Gotta love these slashvertisements, I wonder whose UPSes they're pimping? Its not like we don't all know you need a UPS. What's next, a FA about how you need fire insurance?

  • by pembo13 (770295) on Wednesday July 23 2008, @12:12PM (#24307103) Homepage
    APC is the only UPS maker on the market that has at least spent some small effort so that their UPSs can be properly integrated with a Linux machine. I made the mistake of purchasing an Ultra UPS as it was cheaper than the APC.
  • by JesseL (107722) on Wednesday July 23 2008, @12:13PM (#24307121) Homepage Journal

    is a weak spot in the design of most computers.

    Computer power supplies should be built with enough spare capacitance to run things long enough for the computer to save critical data, and operating systems and critical apps should be able to handle an emergency shutdown and save critical data in very short order.

    This is old hat in embedded systems.

    • by mlwmohawk (801821) on Wednesday July 23 2008, @12:19PM (#24307219)

      Computer power supplies should be built with enough spare capacitance to run things long enough for the computer to save critical data

      Here's a question for you: Calculate the size of the capacitor needed that can hold enough power to run a 200W load for 5 minutes and maintain a voltage level within a specific usable range.

      Hint: its BIG. batteries are more space efficient, but the chemicals and outgassing make them inappropriate for location INSIDE the computer box.

      • by JesseL (107722) on Wednesday July 23 2008, @12:27PM (#24307319) Homepage Journal

        Who the hell is talking about 5 minutes!? I'm saying you should be able to get a clean shutdown in 5 seconds if you prioritize it correctly.

      • by Locklin (1074657) on Wednesday July 23 2008, @12:28PM (#24307355) Homepage

        Why 5 minutes? It usually takes less than a second to run a sync on the disks depending on how active they are. A couple seconds of runtime should be enough to do an "emergency shutdown" and avoid data corruption.

        ####@johncash:~$ time sync

        real 0m0.004s
        user 0m0.004s
        sys 0m0.000s

        • by Firehed (942385) on Wednesday July 23 2008, @12:45PM (#24307705) Homepage

          Other than the lack of communication at present between the PSU and the rest of the system (on a hardware and software level), what you're describing really seems to be the computer equivalent of throwing your hands in front of your nuts as you spot the incoming baseball. It helps the immediate problem of data (or testicle) loss, but it's really just a small amount of damage control.

          This is why a proper UPS that can trigger a full system shutdown once you hit a certain power remaining threshold is far preferable. Granted I'd rather have a controlled crash than the risky nonsense that would come from the power cord being yanked, but (right now) computers can only go so far to help themselves in a couple-second window.

        • by jimicus (737525) on Wednesday July 23 2008, @02:09PM (#24309241) Homepage

          Why 5 minutes? It usually takes less than a second to run a sync on the disks depending on how active they are. A couple seconds of runtime should be enough to do an "emergency shutdown" and avoid data corruption.

          ####@johncash:~$ time sync

          real 0m0.004s
          user 0m0.004s
          sys 0m0.000s

          That will sync the disks, but it won't stop the database from accepting incoming data. It won't stop cron jobs which might be just about to trigger. It won't deal with tasks that are in the middle of a big operation which involves a lot of writing to disk.

    • by Macman408 (1308925) on Wednesday July 23 2008, @12:27PM (#24307321)

      This is old hat in embedded systems.

      Yes, but embedded systems usually have lower power requirements, or at the very least, a smaller range of power requirements. You can't add 3 PCIe cards, a few extra drives, and a few more GB of RAM to most embedded systems.

      I worked on the design of an embedded system a few years ago that had a holdup spec - I think it was supposed to survive for 50 ms with no power. So a 50 ms power interruption would result in continued operation, while an outage longer than that was allowed to reset the board. However, the power draw on the board was around 200 Watts; being able to supply that much power for that long in a fairly compact form factor was a huge hurdle. It also caused airflow problems, because the giant capacitors would prevent air from getting to other components on the board, like the CPU. In the next version of the spec, I believe the holdup requirement was eliminated - apparently we weren't the only ones having trouble meeting that requirement.

    • Our Tandem (Score:5, Interesting)

      by PIPBoy3000 (619296) on Wednesday July 23 2008, @12:37PM (#24307541)
      This reminds me of my favorite power loss story. The facility was doing a generator test, where we were supposed to switch over from city power to the generator. Unfortunately it didn't happen smoothly and the UPS kicked in. Sadly it turned out that so many servers had been added since the original design, the UPS was really only good for fifteen minutes or so. The final problem was that our operator didn't notice the issue quickly enough and so the next thing everyone in IT knew is that our main data center just lost power.

      We spent most of the day getting our servers back up from various states of disrepair (confirming the article, power loss is superbad). It turns out that our main medical software ran on a Tandem. Though the drives and such lost power, the CPU had a backup of D-batteries and survived the power loss just fine. Needless to say, we stopped making fun of their seemingly primitive emergency backup power.
  • by Joebert (946227) on Wednesday July 23 2008, @12:13PM (#24307123) Homepage
    The funny part is someone had to have thought they were safe without a UPS for this to become news.
  • by sco_robinso (749990) on Wednesday July 23 2008, @12:14PM (#24307133)
    In my company, everything is behind UPSs. Our SAN is even behind 2 separate UPSs. We thought everything was configured properly, but you'd be surprised what comes to roost when you test everything.

    We recently had a test night where all we did was test the UPS system and shutdown procedures, and there was a couple gotchas. Interestingly, by default the APC powerchute app we were using defaulted to shutting down the UPS completely after the [first] server went down - not good. This was buried fairly deeply in the configuration.

    Equally important to any protection measure, be it RAID, Power Protection, whatever - is testing!
    • by Darkk (1296127) on Wednesday July 23 2008, @12:31PM (#24307419)

      I 100% agree with the idea of testing under controlled conditions. The oops you guys discovered is a good thing to be caught early on. I can imagine the look on your support team's faces when the UPS suddenly turned itself off while the remaining servers still trying to perform a safe shutdown. I'm sure the secondary UPS was left running as a precaution until the test is successful.

      I have seen a screw up where somebody cut into a live power cord thinking it was a tie wrap caused a major short in the PDU. The guy thought he was safe until he discovered whoever installed the servers didn't double check the power connections and loads so it created a cascade failure in several racks and lost several tons of data. Recovery took awhile.

      Least to say it was not a good day.

  • Get a UPS (Score:4, Insightful)

    by Chemisor (97276) on Wednesday July 23 2008, @12:20PM (#24307229) Journal

    I really can't understand people who don't have a UPS. Don't you care about your data? At all? The UPS is not very expensive (My BackUPS 900 is very nice and only $100), and will last a long time (you just replace the batteries now and then). Once you are on UPS, you can stop worrying about any power issues, journalling file systems, crash recovery, and all that. The computer will never fail due to power. If you run Linux, it will also never fail due to the OS. If you are a normal user, that means your computer will never fail, period. Seriously, there is no excuse for not having a UPS. Go and get one right now!

  • by alta (1263) on Wednesday July 23 2008, @12:23PM (#24307279) Homepage Journal

    Ok, now everyone has something to give to your kid for the sysadmin-in-traning class.

    For the rest of us... back to work, nothing here you didn't learn your first year.

    For the poster... Shame shame... Turn in your card.

  • by E-Lad (1262) on Wednesday July 23 2008, @12:31PM (#24307425) Homepage

    ...by design. TFA doesn't delve into too much detail, but a sudden power loss on such software RAID systems is a condition that ZFS accounts for. Its Copy-on-write (COW) and write-length stiping strategy prevents things such as the RAID5 write hole [sun.com] condition, a condition that has the biggest chance of occurring when a power loss event happens.

  • Last night we had a power outage. I shut down the desktop and was able to continue working for almost 2 hours on the laptop because with the Desktop down the UPS was only carrying the DSL router and the WiFi box.

    At work. Power is a whole enterprise within the company I work for.

    Dual gas powered Generators at each location, Rooms full of Batteries for the Telecoms gear (most is straight DC) and Inverters for the Servers. (DC PSUs are available for some of the servers we use but at so high a premium that the inverters are cheaper.)

    We can handle a dozen Power cuts in a day with no service interruption or data loss ("Tested" 2 weeks ago) and we can stay up without external power for more than a week. After that we have to start trucking in additional diesel.

    Yep. That's right. With sufficient fuel we can be online indefinably. Which we will have to do if we get hit by a major hurricane.

    Which means the phone network is a lot more reliable than the Power grid where I live.

    As for Data loss. I have over the years done a lot of recovery work. "Morfy" of "Murfy's Law" fame isn't a guy or a girl. He is a deamon from the darkest pits of hell sent to torment the souls of IT workers everywhere.

    Imagine a server, where UPS #2 is down for repairs, UPS #1 fails during a power cut, When everything comes back up we find 2 failed hard drives in the RAID 5 on the email server.

    despite previous testing and confirmation that the backups work the most recent tapes failed to read.

    Eventually we sent the failed drives off to a Data recovery company in Florida because

    #1. The customer can afford it.
    #2. Simply "skipping" a few days of Email is not an option for a bank (hence the ability to afford data recovery).

    So yeah. A UPS is essential. Just like RAID, Clustering and Backups but in the end it can all fail.

    Best advise? Memorize all your important data. That way if you loose your mind, you are not responsible for the lost Data (or anything else).

  • by rwa2 (4391) * on Wednesday July 23 2008, @12:34PM (#24307481) Homepage Journal

    UPS units are relatively cheap, it's well worthwhile to invest in one, not just to protect from data loss:

    * Hardware loss: I've seen a lot of hardware blown up from power interruptions. Do you trust your power company that much to provide clean power to you? Sure surge protectors help a bit, but a decent UPS costs maybe twice as much as a good surge protector.

    * Time lost restoring your session after blackouts / brownouts: OK, maybe you're used to restarting your computer every morning anyway. But I like to leave things open and return to my desktop just the way I left it arranged.

    * Stats: Using NUT and Munin, you get to monitor and log your power, so you can see things like exactly when your electricity went out and for how long, what load your PC is drawing after that last upgrade, etc. e.g.: http://hairball.bumba.net/cgi-bin/nut/upsstats.cgi?host=apc@localhost [bumba.net]

    * Graceful shutdown: you have a chance to tell your buddies that your power just went out, and you'll be coming back once it's restored.

    Frankly, I'm a little surprised a backup battery isn't built into PC power supplies already, so they'd work a bit more like laptops. Same with networking gear.

  • by jalet (36114) on Wednesday July 23 2008, @12:58PM (#24308007) Homepage

    This morning we had a planned shutdown of 100 servers for eletricity works, all were on the same 40 kVA UPS. All went fine, we shutdown all servers to be safe, and kept some stuff online for montoring and the like, then main power was shut off. The UPS gladly took the load, with an estimated battery life of 75 minutes, more than what was needed for the electrical work. Once this was done, the electrician put the main power back on, and... the UPS shutdown !

    Since all servers were stopped already we didn't lose anything, but we had to put the UPS in bypass mode for a while, then back on, and now we hope for the best waiting for the UPS to be repaired, crossing most of our fingers because of the holidays...

    In summary : testing that the UPS can handle the power coming back is as important as testing for it to be able to handle the power shutting down.

  • by SleptThroughClass (1127287) on Wednesday July 23 2008, @01:16PM (#24308311) Journal
    The author did not mention having the system set up to have the UPS trigger an automatic shutdown.

    If you're not at the machine, or don't know how to shutdown without a CRT, the disk can get messed up when the UPS runs out of power. Unless you only have a desktop machine with no network applications writing to disk (no BitTorrent); then you might be OK if you just walk away from your keyboard and let the system become quiescent before it loses power.