Forgot your password?
typodupeerror
Data Storage IT

Why Mirroring Is Not a Backup Solution 711

Posted by kdawson
from the pointed-lesson dept.
Craig writes "Journalspace.com has fallen and can't get up. The post on their site describes how their entire database was overwritten through either some inconceivable OS or application bug, or more likely a malicious act. Regardless of how the data was lost, their undoing appears to have been that they treated drive mirroring as a backup and have now paid the ultimate price for not having point-in-time backups of the data that was their business." The site had been in business since 2002 and had an Alexa page rank of 106,881. Quantcast said they had 14,000 monthly visitors recently. No word on how many thousands of bloggers' entire output has evaporated.
This discussion has been archived. No new comments can be posted.

Why Mirroring Is Not a Backup Solution

Comments Filter:
  • rm -rf / (Score:5, Informative)

    by corsec67 (627446) on Friday January 02, 2009 @01:29PM (#26301341) Homepage Journal

    rm -rf /

    That is one reason why mirroring isn't a backup, and why backups should ideally be off-line.

  • Ouch (Score:4, Informative)

    by scubamage (727538) on Friday January 02, 2009 @01:30PM (#26301357)
    We do data hosting, and I can't imagine how catastrophic that would be. Jebus. Let this be an ultimate example of why numerous backups are needed. Always. Without question.
  • by zaibazu (976612) on Friday January 02, 2009 @01:31PM (#26301367)
    It is an inexpensive protection against a total harddisc failure, but effective at this part. A software going rogue or a user deleting the wrong files can't be helped by it.
  • The rules of backups (Score:5, Informative)

    by Anonymous Coward on Friday January 02, 2009 @01:40PM (#26301511)

    The rules of backups:

    1. Backup all your data
    2. Backup frequently
    3. Take some backups off-site
    4. Keep some old backups
    5. Test your backups
    6. Secure your backups
    7. Perform integrity checking

  • by uncledrax (112438) on Friday January 02, 2009 @01:41PM (#26301521) Homepage

    It's more an issue that some people think that HA == DR.. which obviously this story reminds us that it is not the same thing.

    Mirroring / RAID == HA.. if one of your HDDs let the smoke out, you still don't incur downtime. If you have a hot-spare, you're even better.. all it does it let you have alittle time to correct the
    issue (ie: "It can wait until morning").

    Also, one other very important thing.. mirroring doesn't prevent/restore data corruption. If you're mirroring your rm -rf (as pointed out by Corsec67 below), your RAID will happy do what it does.. and span your command to all your disks.... Congrats, you just successfully gave yourself HA to your disk erasing! :]

    Backups are DR.. If your RAID croaks.. your SOL if you don't off-machine backups. If you accidently nuke your disks with an rm or something, you can still go back and restore data.. sure you'll likely loose -some- data, but -some- is better then all in this case.

  • by emag (4640) <slashdot@gREDHATurski.org minus distro> on Friday January 02, 2009 @01:51PM (#26301695) Homepage

    Even the greenest IT employee knows that mirroring is to protect against hard drive failure and not software corruption.

    I only wish that were true. I've given up arguing with friends about this, who insist that their mirrors are good enough backups. I just stare at colleagues who think such, especially those who SHOULD know better. And I *know* coworkers are doing this @ work, too, and I'm just waiting for about 50TB of data to suddenly go missing...

  • Professionalism (Score:1, Informative)

    by theskipper (461997) on Friday January 02, 2009 @01:51PM (#26301699)

    From TFL:

    The data server had only one purpose: maintaining the journalspace database. There were no other web sites or processes running on the server, and it would be impossible for a software bug in journalspace to overwrite the drives, sector by sector.

    The list of potential causes for this disaster is a short one. It includes a catastrophic failure by the operating system (OS X Server, in case you're interested), or a deliberate effort. A disgruntled member of the Lagomorphics team sabotaged some key servers several months ago after he was caught stealing from the company; as awful as the thought is, we can't rule out the possibility of additional sabotage.

    First, it's somewhat lame/unprofessional to list "sabotage" as a possibility. Even if it's the strongest possibility. Adding the OSX comment and that a bug in their code is impossible is even lamer.

    More importantly, if the key servers were sabotaged months ago, the first thing that I'd want to do is make a full image stored in multiple offsite locations. Ignorance of the RAID/backup issue is one thing, but knowing that the sabateur could have sprinkled the db with crap is even scarier.

    Smells like there's more to the story than this. Or not.

  • by computersareevil (244846) on Friday January 02, 2009 @01:52PM (#26301709)

    They also purposely blocked archive.org via a robots.txt exclusion, so the bloggers can't use that to try and recover some of their blogs.

  • by z_gringo (452163) <.z_gringo. .at. .hotmail.com.> on Friday January 02, 2009 @01:55PM (#26301749)
    nightly dumps of the database and rsync of the data directories to servers in different locations should be adequate. If you have lots of data, I don't see how tapes are really going to do the daily backup jobs.

    backing up nightly to a large mirrored NAS and a periodic copy to a removable device seems like a good way to go these days. I haven't used tapes for years.
  • Re:Just give up? (Score:3, Informative)

    by IMarvinTPA (104941) <IMarvinTPA@NoSPAM.IMarvinTPA.com> on Friday January 02, 2009 @02:10PM (#26302015) Homepage Journal

    The article says the data recovery company has found the drives wiped. There is no recoverable data.

    It seems like the actual site failure was on the 23rd or so.

    IMarv

    PS, the internet archive was blocked by their robots, so there isn't even that to look at. http://web.archive.org/web/*/http://www.journalspace.com [archive.org]

  • Google cache diving (Score:5, Informative)

    by Chris Pimlott (16212) on Friday January 02, 2009 @02:10PM (#26302021)

    Looks like at least some content is still in Google's cache [google.com], those looking to salvage their journals should act quickly.

    You can limit google's search results to a particular site by using the "site:domainname.com" search term (example [google.com]) and then click the "Cached" [209.85.173.132] links of each result to see Google's copy.

    There's also a Greasemonkey script [userscripts.org] for Firefox that can automatically add Google Cache links next to page links, so you can navigate from one cached page to another easier.

  • by xyphor (151066) on Friday January 02, 2009 @02:14PM (#26302079)

    DR is Disaster Recovery

    HA is High Availability

  • by ba_hiker (590565) on Friday January 02, 2009 @02:15PM (#26302091)
    how 'bout this though.. hot-swap mirrored drives. pull 1/2 of the mirror at any time to make a backup. replace the pulled drives with blanks. keep a short stream of backup drives, say 8 or 9. drives are cheap. store in well padded metal boxes, offsite.
  • Re:Professionalism (Score:3, Informative)

    by moderatorrater (1095745) on Friday January 02, 2009 @02:21PM (#26302185)

    Adding the OSX comment and that a bug in their code is impossible is even lamer.

    The drives were overwritten sector by sector on a machine that didn't have any of their code running on it. Their application couldn't have done it because it couldn't execute arbitrary code on that server. The "impossible" comment makes sense to me.

    As for it being lame/unprofessional to name the possibilities, I disagree. He states the OS it was running on and said that it was either an OS problem or sabotage. There might be a few possibilities, but that about sums them up right there. He was being thorough and open; what's the problem with that?

  • Re:Just give up? (Score:3, Informative)

    by girlintraining (1395911) on Friday January 02, 2009 @02:35PM (#26302427)

    Since you're the only poster to reply without yelling "idiot" (thanks, btw) -- Zeroing the drive makes software recovery impossible. It doesn't make data recovery impossible. There are ways to read the offset data, though this is getting harder as magnetic densities increase every year. Ontrack data recovery specializes in that kind of thing. I've seen them do it. Granted, it's not a 100% thing -- you don't get back something that even resembles a filesystem. At least a third of it is uselessly garbled binary.

  • by Trixter (9555) on Friday January 02, 2009 @03:01PM (#26302855) Homepage

    Amen, and why I just love ZFS (or any filesystem that supports instant snapshots). I use mirroring to cover drive failures, and I use weekly snapshots as backups. Once every three months, I offline to external disk. Minimum cost, more than reasonable protection.

  • Re:Double Duh! (Score:3, Informative)

    by CarpetShark (865376) on Friday January 02, 2009 @03:11PM (#26302989)

    All they needed was a large enough USB attached disk

    Correction: all they needed was a large enough, functional, external disk.

    Finding functional external drive products isn't so easy, I've discovered.

  • Re:Double Duh! (Score:2, Informative)

    by Anonymous Coward on Friday January 02, 2009 @03:13PM (#26303009)

    Not saying OS X is not pretty good on the desktop/laptop, just that for your servers you should use Linux or possibly Solaris or BSD, but not OS X or Windows.

    So, you say they should use BSD (among other options) yet say they shouldn't use BSD or windows?

    I do agree with your windows point, but as to your confusing BSD comment, i'd have to say you were right the first time and wrong the second.

    And since i know this joke will go right over your head, heres a tip:
    OS X is BSD

  • by mortonda (5175) on Friday January 02, 2009 @03:17PM (#26303065)

    Backups must be:

    1) Automated - if you need human intervention, it will fail

    2) Point-in-time - the system must be able to provide restores for a set of times, as fitting for the turn around on your data. A good default is: daily backups for a week, weekly for a month, and monthly for a year

    3) TESTED: You must fully test the restoration process (if this can be automated, even better). Backups that you can't restore from a bare machine are worthless.

    For better disaster recovery, backups should be:

    4) offsite - if a fire or tornado hits, is the backup somewhere else?

    5) easily accessible - how long will it take to get the restore going?

  • by The Blue Meanie (223473) on Friday January 02, 2009 @03:22PM (#26303125)

    People who care about their data and their business know what they mean.

    Although, at my particular shop, we use the term "BC" instead of "HA".
    BC = Business Continuance (HA = High Availability)
    DR = Disaster Recovery

    BC = "Looks like we just lost a drive in the array. Better replace that right away." or "Oops, broke one of the multiple fibers to the SAN. Where's the spare again?"
    BC also applies to our load-balanced clusters of web servers and application servers that allow for the offlining or loss of entire machines without losing functionality. You need more than your data existing on media to Continue Business - you and your customers need to be able to GET to it somehow.

    DR = Your building just burned to the ground, taking every single piece of furniture, equipment, paper, and magnetic media inside along with it. Now what?
    Please note that the coolest, slickest, snapshotted NAS with terabytes and terabytes of awesome cheap SATA storage in it is worth exactly JACK in this scenario if it's in the same building as the source material. Offsite backups are not optional, and offsite storage of hard drives isn't exactly the easiest thing to do.

  • Re:Double Duh! (Score:5, Informative)

    by MarkRose (820682) on Friday January 02, 2009 @03:22PM (#26303129) Homepage

    Not quite. Backing up a live database can be a bit tricky. By the time you finish copying part of the database, the first bit can change again. So you have to create a snapshot of some kind. And that has to be supported in the database setup (at the application or server level) in order for the backup to be in a consistent state. And you don't want your backup process to degrade site performance, either. So a simple file copy is totally inadequate.

    A common solution is replication. Backup is then performed by creating a replication point on the slave database machine then taking a snapshot and copying that while while master database machine continues serving at full speed. Replication can then catch up when the backup is complete. Another advantage to having replication is duplication on the machine level -- if the master fails, go live to the slave with minimal to no downtime. Set both machines up in a master-master configuration and you can swap back and forth as needed, allowing live maintenance and backup with no performance degredation.

  • by jbezorg (1263978) on Friday January 02, 2009 @03:37PM (#26303299)

    The temptation for "But Macs can't fail!" bashing was strong, but I resisted. It did lead me to a question though. That is: Had they been mislead by the Mac culture? Could there been something in Apples ads or documentation that would lead them to this mistake?

    The answer? No. At least not from Apple.

    From page 32 of TFM: http://images.apple.com/server/macosx/docs/Server_Administration_v10.5_2nd_Ed.pdf [apple.com]

    Defining Backup and Restore Policies
    All storage systems will fail eventually. Either through equipment wear and tear, accident, or disaster, your data and configuration settings are vulnerable to loss. You should have a plan in place to prevent or minimize your data loss.

  • Re:Double Duh! (Score:5, Informative)

    by MBCook (132727) <foobarsoft@foobarsoft.com> on Friday January 02, 2009 @04:09PM (#26303589) Homepage

    *BZZZZT*

    The GP was 100% correct. If you had kept reading, you'd see that the suggestion was to use replication so you can lock the DB into a consistent state while backing up. When the backup is done, the box starts replicating again. If you didn't have the backup box, you'd have to lock the production database while your backup was going on.

    He was suggesting replication purely as a way to avoid having to pause the application during backup, not as the backup it's self.

  • Re:Double Duh! (Score:2, Informative)

    by mksql (452282) on Friday January 02, 2009 @04:12PM (#26303625)

    > Backing up a live database can be a bit tricky.

    Seriously? If your database of choice is a chore to backup while live, you need to rethink your choice.

    Full or incremental backups should be a trivial operation, with support for intra-backup change capture only a little more effort (log shipping, replication, etc.)

    Of all the reasons to lose data, "Backups are hard!" should not be in the list.

  • Re:Double Duh! (Score:3, Informative)

    by mabhatter654 (561290) on Friday January 02, 2009 @04:16PM (#26303701)

    Except if somebody issued a drop table... then the repliants get dropped to... nice faithful mirroring!!! Been there, done that.

    A good solution is to use mirroring like this, and then take the replicant offline to do real, full backups without taking down the production box. Then you have a live copy if drives or processors go bad to bring up immediately, and a backup tape to cover boo boos like this one. I believe that's what parent is getting at.

  • Re:rm -rf / (Score:3, Informative)

    by TheRaven64 (641858) on Friday January 02, 2009 @04:37PM (#26303959) Journal
    What happened to deltree?
  • by DamnStupidElf (649844) <Fingolfin@linuxmail.org> on Friday January 02, 2009 @06:07PM (#26304999)

    ACID compliant databases use a log, much like a filesystem journal, that contains all the changes made to the database before those changes were actually written out to the main database storage. When you back up the raw database, you back up all the logs since at least the time you started backup up the raw files until the time the backup was finished, and when you need to restore the database you put the raw data back and then let the database replay the logs.

  • Hanlon's Razor (Score:1, Informative)

    by Anonymous Coward on Friday January 02, 2009 @09:23PM (#26307429)

    I'm the guy you reach, assuming you have a valid support contract, in situations like this. In real life, I work as a backline support engineer for midrange disk arrays, which range in price from 70K bux to over 500K bux. I've also taken operating system, network and security cases in the past with a previous employers. Our team takes these Severity-Cluster F*ck-data loss cases daily. We can salvage the data, sometimes. Frequently, there is nothing that can be done. After that, I write the Root Cause Analysis document and assist with the presentation to the customer's management.

    I can understand where this guy is coming from. This is a potential career limiting move, and no one ever admits they screwed up. When you're truly scared, you loose the capacity for rational analysis, and grasp at straws. Unfortunately, this is when you are most likely to make mistakes.

    A Microsoft filesystem support engineer once gave a really good analogy. Assume the big expensive disk array is a brand new Ferrari sports car. Just because it's really expensive with all the bells and whistles doesn't make it immune to flat tires or rear end collisions, does it?

    Malicious actions *are not* the most likely cause of data loss or corruption. All of the specific situation below never occurred in the same data loss event, but I've personally seen each.

    1) The array firmware is over two years out of date, because uptime was so important that maintenance was never scheduled. Same for the host OS and HBA drivers.
    2) Failure to heed a published service alert requiring an upgrade or workaround.
    3) Failure to save the current array configuration information.
    4) The site does not have tape backup, instead the data is remote replicated to a similar array at the DR site.
    4a) Someone convinced management that a disaster recovery site situated below mean sea level in New Orleans is a good idea. Oh, the date is early Sepember 2005, a week after Katrina hit.
    4b) Alternatively, the DR site is in Florida, just after another hurricane. Someone forgot to buy a diesel fuel contract to top off the emergency generators every three days. After a week, the site goes dark.
    4c) The telco routed data center primary and alternate fibre lines through the same physical conduit under the street. The utility crew with a ditch witch severs both. The ditch witch *always* wins! Anyways, your DR site is now out of the picture.
    5) If tape backups are available, the cassettes are stored on top of a cabinet marked "Danger High Voltage".

    6) A minor failure triggers the chain of events. For example, a drive fails and reconstruction to a hotspare drive begins. This is ignored.
    7) The array's "call home for help" feature was never configured.
    8) The array's data scrubbing feature was not active.

    The array continues to operate in a degraded state, but since data availability is maintained, no one notices.
    9) A reconstruction read failure on another spindle (second fault) in the Raid 5 volume occurs, taking the entire volume group offline. This read failure also kills off all other hotspares. This situation would have been prevented with data scrubbing.
    Up to here, support can almost always recover the existing data on the array drives without too much work or restoring from backup.

    10) Someone runs to the array, hears the array alarming and sees the flashing lights. End users are complaining. Management wants something done "right now". The replacement drives are a few hours away. It's time to make a command decision- call for help, sit tight and wait for the cavalry or go into kamikaze sysadmin mode and save the day?
    11) The sysadmin recalls there is another identical array, which is running less important applications. He decides to yank parts from this "installed spare".
    12) The stolen drives have not fixed the problem. A replacement controller or two from the other array didn't either. End users report additional problems with previously unaffected hosts.
    13) Someone finally decided to call support, b

  • by bigtallmofo (695287) * on Friday January 02, 2009 @11:36PM (#26308513)

    The company that runs Journalspace (or used to, anyway) is Lagomorphics. They will host your site for you...

    http://www.lagomorphics.com/hosting/ [lagomorphics.com]

    At Lagomorphics, we're OS X hosting experts. We've been using the Mac mini and Xserve platforms for years, and we're proud to offer you the opportunity to use our colocation facility. Just send us your Mac mini, or let us provide the hardware.

Adding manpower to a late software project makes it later. -- F. Brooks, "The Mythical Man-Month"

Working...