Why Mirroring Is Not a Backup Solution 711
Craig writes "Journalspace.com has fallen and can't get up. The post on their site describes how their entire database was overwritten through either some inconceivable OS or application bug, or more likely a malicious act. Regardless of how the data was lost, their undoing appears to have been that they treated drive mirroring as a backup and have now paid the ultimate price for not having point-in-time backups of the data that was their business." The site had been in business since 2002 and had an Alexa page rank of 106,881. Quantcast said they had 14,000 monthly visitors recently. No word on how many thousands of bloggers' entire output has evaporated.
DUH! (Score:5, Insightful)
DUH!
Re:DUH! (Score:5, Funny)
Re: (Score:3, Insightful)
We can only hope they remain silent.
Re:DUH! (Score:5, Funny)
Journalspace CTO: We don't need an expensive off-site backup solution b/c we mirror all of our data real-time. It's genius!
-entire database gets overwritten-
Journalspace CTO: Ohhhhhh...now I get it.
Re:DUH! (Score:5, Funny)
Re:DUH! (Score:4, Insightful)
Fixed that for you. ;)
Double Duh! (Score:5, Interesting)
Re:Double Duh! (Score:4, Insightful)
Or attach a 4 TB Drobo to it and then use Time Machine.
Then make a backup and test the restore.
Their admin is criminally incompetent.
Re: (Score:3, Informative)
Correction: all they needed was a large enough, functional, external disk.
Finding functional external drive products isn't so easy, I've discovered.
Re:Double Duh! (Score:5, Informative)
Not quite. Backing up a live database can be a bit tricky. By the time you finish copying part of the database, the first bit can change again. So you have to create a snapshot of some kind. And that has to be supported in the database setup (at the application or server level) in order for the backup to be in a consistent state. And you don't want your backup process to degrade site performance, either. So a simple file copy is totally inadequate.
A common solution is replication. Backup is then performed by creating a replication point on the slave database machine then taking a snapshot and copying that while while master database machine continues serving at full speed. Replication can then catch up when the backup is complete. Another advantage to having replication is duplication on the machine level -- if the master fails, go live to the slave with minimal to no downtime. Set both machines up in a master-master configuration and you can swap back and forth as needed, allowing live maintenance and backup with no performance degredation.
Re: (Score:3, Informative)
Except if somebody issued a drop table... then the repliants get dropped to... nice faithful mirroring!!! Been there, done that.
A good solution is to use mirroring like this, and then take the replicant offline to do real, full backups without taking down the production box. Then you have a live copy if drives or processors go bad to bring up immediately, and a backup tape to cover boo boos like this one. I believe that's what parent is getting at.
Archive/redo logs too (Score:4, Informative)
ACID compliant databases use a log, much like a filesystem journal, that contains all the changes made to the database before those changes were actually written out to the main database storage. When you back up the raw database, you back up all the logs since at least the time you started backup up the raw files until the time the backup was finished, and when you need to restore the database you put the raw data back and then let the database replay the logs.
Re:Double Duh! (Score:5, Informative)
*BZZZZT*
The GP was 100% correct. If you had kept reading, you'd see that the suggestion was to use replication so you can lock the DB into a consistent state while backing up. When the backup is done, the box starts replicating again. If you didn't have the backup box, you'd have to lock the production database while your backup was going on.
He was suggesting replication purely as a way to avoid having to pause the application during backup, not as the backup it's self.
Re: (Score:3, Insightful)
Re:DUH! (Score:5, Funny)
What about archive.org?
Ah, apparently not... [archive.org] :-D
Re:DUH! (Score:5, Funny)
Again a frost post to a red story (Score:5, Funny)
While this mirrors previous comments, it's not really a backup solution.
When is backing up *not* an option? (Score:5, Interesting)
Mirroring, RAID, grid, whatever. At some point, you want your data safe and secure on something not physically attached to any power source.
Re:When is backing up *not* an option? (Score:5, Insightful)
This is at a minimum people. Come on!
Re: (Score:3, Informative)
backing up nightly to a large mirrored NAS and a periodic copy to a removable device seems like a good way to go these days. I haven't used tapes for years.
Re:When is backing up *not* an option? (Score:4, Insightful)
Re: (Score:3, Informative)
Amen, and why I just love ZFS (or any filesystem that supports instant snapshots). I use mirroring to cover drive failures, and I use weekly snapshots as backups. Once every three months, I offline to external disk. Minimum cost, more than reasonable protection.
Re: (Score:3, Interesting)
that's where trixter needs to zfs send/recv the snapshots to an offsite location (and probably roll snapshots more frequently)
Re:When is backing up *not* an option? (Score:5, Insightful)
That's not my company's policy, that's *my* policy. I can take a 3-month hit to my personal data. AND YET MY LAX PERSONAL POLICY WOULD HAVE SAVED JOURNALSPACE.
My *company's* policy is daily offsiting. Expensive, but very many of our locations could become a smoking hole in the ground and we'd still be able to restore and operate.
Re:When is backing up *not* an option? (Score:4, Interesting)
Nope.
Mirrors are fine, just snapshot them and store them offsite regularly. Do delta backups as needed but close-in for fast restoration.
There is no rational justification for tape anymore, what with the cost per TB stored on hard disks now under $130, total $$. Random accessibility unless you're stalling a subpoena, is just mandatory on backup media.
Re:When is backing up *not* an option? (Score:5, Insightful)
Even accepting your price that's a cost of about 12.7 cents per gigabyte and you can get 800GB native LTO-4 tapes for about $50, which comes out to about 6.3 cents per gigabyte.
But quoting costs for desktop grade SATA drives severely understates the true cost. For any non-trivial site installation you're talking near-line rated drives, drive caddies, storage shelves and additional SAN fabric. Then price out the additional power, cooling and rack space. Then price offsite shipping and storage for the bulkier, heavier and more delicate disk option.
Mirroring has its place. Snapshotting has its place. And backups to stable media still has its place too.
Re:When is backing up *not* an option? (Score:5, Insightful)
Fine. Get the cartridges, but what about the capital cost minus depreciation of the drive? What about random access?
Random access is why snapshots also have their place. :) Archival backups and nearline backups solve different sets of problems.
Now weigh those against an inexpensive jbod frame with a 2gb FC backplane.
What kind of capacity are we talking. For a small site you can pick up a little 2U unit that'll store 6.4TB uncompressed for under $5k. Or if you're running a larger site you can snag a 4U unit with two drives for about $15k that'll handle 30.4TB with optional expansion to 60.8TB native.
What's the write speed of LT vs a tasty little GB SAS drive?
120MB/sec per drive without compression. And now that you've talking about SAS drives your per TB cost is hopelessly optimistic. Even OEM packaged terabyte SAS drives are going to run you about a quarter a gigabyte, which is now four times the media cost of an LTO-4 solution.
Rackspace? You can put a dozen into about 4U.
So about 12TB in 4U compared to the 30TB unit I mention above.
Cooling? Although I'll grant you green cost, the random accessibility out-classes the seek time and tape insertion by a human cost dramatically.
Have you never heard of a tape library?
Stable media? Tape? Sometimes.
Properly handled tape is incredibly stable.
Shelf space?
If you're doing off-site storage, that's going to be an issue regardless of what media you're using. And as I pointed out, tape is far more compact and far lighter than disks.
No need to use tape anymore. Get out of the reality distortion field, but do the right thing by testing what you have and doing drills to ensure that whatever you have, works and is a procedure understood by all.
I'm not the one dismissing an entire class of technology while demonstrating ignorance of its costs and benefits.
Re: (Score:3, Insightful)
I'm not sure what planet you're on, but I wish the rest of us were there with you.
Backup media should be and must be transported offsite every freakin day. You'd do that with a hard disk? Or more correctly, you'd do that with a STACK of hard disks? Or is your building fire, flood (including broken sprinkler pipes), gas leak, and drunken-truck-driver proof.
Re: (Score:3, Insightful)
can you restore a RAID with different hardware? With LTO3 tape I have several drive choices.. notably I can by a NEW drive and know the tape will work even 3-4 years out. What happens when the maker of your RAID solution moves on and wants to send you next year's model? Will the encryption and striping still line up on different hardware made by a different company?
Re: (Score:3)
can a stored disk drive sit on a shelf for a year or two of being tussled about and still work? Tapes have nearly all the moving parts external and replaceable in the drive, not the media. A cardboard box of backup tapes will survive storage pretty well.. a box of disk drives not so much.
To many shops think HA==DR (Score:5, Informative)
It's more an issue that some people think that HA == DR.. which obviously this story reminds us that it is not the same thing.
Mirroring / RAID == HA.. if one of your HDDs let the smoke out, you still don't incur downtime. If you have a hot-spare, you're even better.. all it does it let you have alittle time to correct the
issue (ie: "It can wait until morning").
Also, one other very important thing.. mirroring doesn't prevent/restore data corruption. If you're mirroring your rm -rf (as pointed out by Corsec67 below), your RAID will happy do what it does.. and span your command to all your disks.... Congrats, you just successfully gave yourself HA to your disk erasing! :]
Backups are DR.. If your RAID croaks.. your SOL if you don't off-machine backups. If you accidently nuke your disks with an rm or something, you can still go back and restore data.. sure you'll likely loose -some- data, but -some- is better then all in this case.
Re:To many shops think HA==DR (Score:5, Informative)
DR is Disaster Recovery
HA is High Availability
Re:To many shops think HA==DR (Score:5, Funny)
I tried Googling, but the only results I got were a medical office in Chinatown.....
Re: (Score:3, Informative)
People who care about their data and their business know what they mean.
Although, at my particular shop, we use the term "BC" instead of "HA".
BC = Business Continuance (HA = High Availability)
DR = Disaster Recovery
BC = "Looks like we just lost a drive in the array. Better replace that right away." or "Oops, broke one of the multiple fibers to the SAN. Where's the spare again?"
BC also applies to our load-balanced clusters of web servers and application servers that allow for the offlining or loss of entir
Dear Every Corporate Tool in the Universe: (Score:5, Insightful)
Re:Dear Every Corporate Tool in the Universe: (Score:5, Insightful)
And that's why your IT department actually needs funding. Sleep tight.
They've had the site live for 6 years.
This wasn't a lack of funding, it was just sheer stupidity.
6 years and nobody ever thought it'd be a good idea to back everything up to dvd or an external hard drive. HTML compresses really well in case they didn't know.
Re:Dear Every Corporate Tool in the Universe: (Score:5, Insightful)
Re:Dear Every Corporate Tool in the Universe: (Score:5, Insightful)
Being too stupid to recognize your own shortcomings is also a form of stupidity. Or hubris, whichever is more appropriate.
Re: (Score:3, Funny)
Pay the salary of someone smart enough to handle your data correctly if you have no interest in becoming smart yourself.
The first step is admitting you are stupid. That is hard for most people. Of course today they are having NO trouble making that cognitive leap...
Re:Dear Every Corporate Tool in the Universe: (Score:4, Insightful)
Never underestimate the beancounter's desire to save every cent possible. If your site's working perfectly fine, well, what's the point of having backups? Seriously, I see this happen all the time with small businesses. "Oh, it's never failed before, why do we need backups?" Then the server implodes.
Course, they then get pissed at us for not preventing it, but what do they expect us to do, shell out for a tape drive with our own cash? I think not.
Re: (Score:3, Interesting)
Never underestimate the beancounter's desire to save every cent possible.
That's contrary to my experience. Other expenses have been skimped on occasionally, but just mention the word "backup" and the funding was there.
Re:Dear Every Corporate Tool in the Universe: (Score:5, Funny)
Re:Dear Every Corporate Tool in the Universe: (Score:5, Funny)
Re:Dear Every Corporate Tool in the Universe: (Score:5, Insightful)
Hell, they could have spent $50 on a USB hard drive (i.e., half-assed it) and been better off!
Re:Dear Every Corporate Tool in the Universe: (Score:4, Insightful)
A USB drive is an excellent non-archival backup. Two or more in rotation is even better. That plus a decent RAID for the primary storage will cover most data losses. Even better if the drive goes home with the admin at night.
Re: (Score:3, Insightful)
Don't send tapes home with Admins... send them to the bank to be put into a safety deposit box with the days checks if you have to. Admins don't want tapes in their home, it's a corporate security risk and the admin WILL forget to bring some back because one or two is no big deal... until they're not at your company anymore. I know I wouldn't do that because I wouldn't want to be the guy who's laptop bag gets ripped off with customer data on tapes inside it. It's just bad mojo waiting to happen.
Treat data
Re: (Score:3, Insightful)
If so, he can do so anyway.
Re:Dear Every Corporate Tool in the Universe: (Score:5, Funny)
Screw that!! IT Departments are cost centers and have absolutely no benefit to the bottom line of a company... none at all... nope.
Re:Dear Every Corporate Tool in the Universe: (Score:4, Funny)
rm -rf / (Score:5, Informative)
That is one reason why mirroring isn't a backup, and why backups should ideally be off-line.
Re:rm -rf / (Score:5, Funny)
C:\>rm -rf /
'rm' is not recognized as an internal or external command,
operable program or batch file.
Everything's still running here...
Re: (Score:3, Funny)
Re: (Score:3, Funny)
Ha! I store all my data in directory names!
Re: (Score:3, Funny)
$ del C:\*.* /s /q /y
-bash: del: command not found
Guys...
Re: (Score:3, Informative)
Re: (Score:3, Funny)
operators make mistakes... the higher up they are paid the more likely they are to do it. Take the manager covering for somebody on vacation... they jump in and mistype something... boom! data gone. Happens to the best of us... the really good ones have ways in place to make sure mistakes have a way of being undone.
how many AS400 operators typed PWRDWNSYS *IMMED and got a surprise!
Ouch (Score:4, Informative)
Re:Ouch (Score:5, Insightful)
Or even one, stale, backup.
Re:Ouch (Score:4, Insightful)
This story put the fear of god into me. The first thing I did since reading it is to back up the website I admin (for my dad) locally. I'd always assumed our host would have good backup, but that seems naÃve now.
Re: (Score:3, Interesting)
Working at several hosting places I would say,you are correct. Never trust a hosting service backup. I always told our customers to never trust our backup. Sometimes backups just never happened. They are not high on the list of things to keep working.
You need more than backups ... (Score:5, Insightful)
Re:You need more than backups ... (Score:4, Informative)
Backups must be:
1) Automated - if you need human intervention, it will fail
2) Point-in-time - the system must be able to provide restores for a set of times, as fitting for the turn around on your data. A good default is: daily backups for a week, weekly for a month, and monthly for a year
3) TESTED: You must fully test the restoration process (if this can be automated, even better). Backups that you can't restore from a bare machine are worthless.
For better disaster recovery, backups should be:
4) offsite - if a fire or tornado hits, is the backup somewhere else?
5) easily accessible - how long will it take to get the restore going?
Re: (Score:3, Interesting)
The best way I have found to test the backup is to nuke the data and restore.
Seriously, if you know what files store the data (and that you are backing up), just stop services and rename a directory or two so the data is "gone". Then, restore from backup, start the service, and see how things look. Another good way is to restore the data to a VM that runs the same software as the production server. You can sandbox a simulation of the entire Internet inside a few VMs if you want, and test what happens.
I j
Excellent! (Score:5, Funny)
Re:Excellent! (Score:5, Funny)
Ironically, it's more useful than the entire collection of blogs that they stored.
Mirroring is not intended as a data backup (Score:3, Informative)
That's what backups are for (Score:5, Interesting)
It's really unfortunate that this happened. If they had simply had a backup snapshot of the DB they could have restored it. RAID only saves you from disk failures. It doesn't work on OS/user failures.
Unfortunately this is the kind of thing you tend to learn from experience (either yours or someone else). It's very easy to think "RAID 1 = disks are safe".
Just like a database cluster wouldn't have saved them. A clustering database can save you from load, or you can swap servers if a disk goes bad. But when someone issues "DELETE * FROM..." the other cluster nodes start to happily run the same thing and now you have 2 (or 3 or 10 or...) empty database boxes.
I hope those bloggers had a backup of some sort of their own.
Re:That's what backups are for (Score:4, Interesting)
Ah, it totally depends on the type of database cluster. For example, with Oracle, if you're using Oracle DataGuard, even in synchronous replication mode you can define an "apply delay" - basically, "Don't acknowledge this commit until it is written locally, and copied and acknowledged on the remote side, but don't actually apply the transaction for two hours"
That way, if someone does a delete * from blogs;, it will be reflected immediately on the production, but you've got a nice window to sort it out.
Plus, if you've got database flashback turned on, you can simply say, "Flash my database back to what it looked like before someone was an idiot", and all your data comes back.
These features are expensive in Oracle, but they can be very useful when you actually need them.
Re:That's what backups are for (Score:5, Insightful)
My guess (and this is a guess, I'd never heard of the site before yesterday) is that this is some guy who started his own little site and it got bigger and bigger. Basically he never designed the backup, the system was just slowly pieced bigger and bigger until it got to it's current state.
The comments in the messages from the site's operator about the cost of the drive recover and thinking both drives just died at once indicate to me that this site was basically a hobby for him and he isn't experienced as an admin.
El Oh El (Score:4, Insightful)
That's all I can say at this. I'm really surprised that with all the users they had, they are so quick to say "everything is gone and we're giving up" instead of just starting over and maybe implementing protocol that would make sure this doesn't happen again.
Re:El Oh El (Score:5, Insightful)
Considering how complete and unrecoverable the loss is, they have no idea who their users are. The accounts would have to be recreated from scratch, but who would try? Their users have no reason to ever trust them again. Journalspace would have a difficult time wooing back their original users, and no new user would seriously consider using them.
Bowing out is the only recourse, but I'm glad they're considering releasing their source code.
Re:El Oh El (Score:4, Funny)
Thank you (Score:3, Funny)
Inconceivable? (Score:4, Funny)
I do not think it means what you think it means.
How hard is it to remember: (Score:5, Insightful)
Mirroring: High availability
Backups: High reliability
The rules of backups (Score:5, Informative)
The rules of backups:
1. Backup all your data
2. Backup frequently
3. Take some backups off-site
4. Keep some old backups
5. Test your backups
6. Secure your backups
7. Perform integrity checking
Re: (Score:3, Insightful)
1. Backup all your data
2. Test your backups
3. Backup frequently
4. Test your backups
5. Take some backups off-site
6. Test your backups
7. Keep some old backups
8. Test your backups
9. Secure your backups
10. Test your backups
11. Perform integrity checking
10. Test your backups
Every company I've worked at has had a backup plan. Exactly zero have had a recovery plan.
Only 2 drives? (Score:4, Insightful)
BUT, according to the site "the server which held the journalspace data had two large drives in a RAID configuration". Only TWO drives.
All they had to do was pull one of the drives, replace it, and lock up the original off site. In a couple of hours the drives would have been mirrored again.
To the HR department (Score:5, Funny)
Re:To the HR department (Score:4, Insightful)
The only problem with that idea is that it may not have been the IT guy's decision to save money by not having a true backup system. I have seen companies skimp on backup systems because they thought their RAID system was enough.
A lesson for admins, and users too (Score:5, Insightful)
No doubt this incident is the result of the admin's fault. He's been confusing mirroring and backup and carried on the mistake until it's too late, as pointed out in other comments.
Now what about a user's angle? The morale is you can never think your data is safer when it's "in the cloud". If you value your blog and your readers, you *should* save a copy of your work as well as the readers' info, *locally*, somewhere you have control over.
There's no place like $HOME.
Re:A lesson for admins, and users too (Score:5, Insightful)
And a corollary to the parent's good advice: if you can't easily get a complete copy of your work, find another host. Manual one-by-one downloads don't cut it.
No Archive.org either (Score:5, Informative)
They also purposely blocked archive.org via a robots.txt exclusion, so the bloggers can't use that to try and recover some of their blogs.
Google cache diving (Score:5, Informative)
Looks like at least some content is still in Google's cache [google.com], those looking to salvage their journals should act quickly.
You can limit google's search results to a particular site by using the "site:domainname.com" search term (example [google.com]) and then click the "Cached" [209.85.173.132] links of each result to see Google's copy.
There's also a Greasemonkey script [userscripts.org] for Firefox that can automatically add Google Cache links next to page links, so you can navigate from one cached page to another easier.
Re:No Archive.org either-Compound Foolishness (Score:3, Interesting)
This is just compound foolishness. I gather they did it in an attempt to control bandwidth costs since it's hard to imagine any other reason.
There is a denial going on (Score:5, Insightful)
In today's world where primary storage and protection storage are well-defined, and where entire industry grew around it (examples: NetApp, Data Domain), one is hard-pressed to understand the reason for such a debacle. The reading of the note referred to in the article [journalspace.com] leads me to believe, unfortunately, that Journalspace's IT department did not understand the difference.
It is sometimes considered a bad form to say something bad about fellow techies. We prefer to look for 'outside' causes. Still, to learn and avoid the same problems in the future, one has to admit his mistakes first. This paragraph from the Journalspace's page:
The value of such a setup is that if one drive fails, the server keeps running, using the remaining drive. Since the remaining drive has a copy of the data on the other drive, the data is intact. The administrator simply replaces the drive that's gone bad, and the server is back to operating with two redundant drives.
makes me believe there is a denial going on.
Re: (Score:3, Insightful)
Yeah, right. If there's anything professionals love to do, it's talk trash about their peers. What's the first thing a computer guy says when you bring him in to fix a broken system? "My god, what idiot spec'd/built/installed/configured this piece of garbage? It's a miracle it ever worked at all!" Ditto every other kind of professional, from plumber to surgeon to architect to accountant.
(As such a professional, I often discover
Someone needs to be FIRED (Score:4, Funny)
Mirroring (Score:5, Insightful)
Re:Mirroring (Score:5, Funny)
There's a major flaw in your analogy. See, if I stick a fork in my right eye, the mirror image will stick a fork in his left eye. Between the two of us, however, we still have one good left AND right eye. So ipso fatso, I have a complete backup.
Personal backups of online data (Score:5, Insightful)
Do the big kahunas of the "Web 2.0" world give users that option? Gmail, Myspace, Facebook, Twitter etcetera ad nauseam?
OS X Server (Score:3, Interesting)
The site was run on OS X Server... I think this may be indicative of the level of IT effort with the company. Look, *I* run an OS X Server... but *I* am a Biology major that knows approximately dick about the UNIX command line, and use it to run a server that I probably wouldn't be able to run any other way. I also have it backup nightly to a cheap NAS, archiving old backups, and I've tested a restore to make sure it works.
This is probably just a couple guys who ran a website in their spare time... not a huge IT effort that failed.
Darwin awards (Score:4, Funny)
Re:stunned silence (Score:5, Funny)
I am experiencing a strange phenomenon. The jaw-drop reflex has been popping my mouth open for several minutes and won't stop. If I focus I can close it, but then it pops open again. wow.
Re:Noobs. No, really. (Score:5, Informative)
Even the greenest IT employee knows that mirroring is to protect against hard drive failure and not software corruption.
I only wish that were true. I've given up arguing with friends about this, who insist that their mirrors are good enough backups. I just stare at colleagues who think such, especially those who SHOULD know better. And I *know* coworkers are doing this @ work, too, and I'm just waiting for about 50TB of data to suddenly go missing...
Re: (Score:3, Informative)
The article says the data recovery company has found the drives wiped. There is no recoverable data.
It seems like the actual site failure was on the 23rd or so.
IMarv
PS, the internet archive was blocked by their robots, so there isn't even that to look at. http://web.archive.org/web/*/http://www.journalspace.com [archive.org]
Re: (Score:3, Informative)
Since you're the only poster to reply without yelling "idiot" (thanks, btw) -- Zeroing the drive makes software recovery impossible. It doesn't make data recovery impossible. There are ways to read the offset data, though this is getting harder as magnetic densities increase every year. Ontrack data recovery specializes in that kind of thing. I've seen them do it. Granted, it's not a 100% thing -- you don't get back something that even resembles a filesystem. At least a third of it is uselessly garbled bina
Re: (Score:3, Interesting)
Thats bullshit, and has been for decades.
Its a myth. Just learn about it. Even if we use our newest AFM, or XMCD microscopy, you wont see an overwritten byte in any drive of the last 5 years. And even the last decade is very doubtful (basically, since GMR drives are around).
There IS NO SPACE between tracks anymore. Bits are right next to each other. If you overwrite, nothing above the superparamagnetic limit is left.
Not even the NSA could get anything useful out of a single overwrite with zeros (well, excep
Re: (Score:3, Informative)
Adding the OSX comment and that a bug in their code is impossible is even lamer.
The drives were overwritten sector by sector on a machine that didn't have any of their code running on it. Their application couldn't have done it because it couldn't execute arbitrary code on that server. The "impossible" comment makes sense to me.
As for it being lame/unprofessional to name the possibilities, I disagree. He states the OS it was running on and said that it was either an OS problem or sabotage. There might be a few possibilities, but that about sums them up right there. He was being thor
Re: (Score:3, Interesting)