Tech Magazine Loses June Issue, No Backup 245
Gareth writes "Business 2.0, a magazine published by Time, has been warning their readers against the hazards of not taking backups of computer files. So much so that in an article published by them in 2003, they 'likened backups to flossing — everyone knows it's important, but few devote enough thought or energy to it.' Last week, Business 2.0 got caught forgetting to floss as the magazine's editorial system crashed, wiping out all the work that had been done for its June issue. The backup server failed to back up."
We've all been there. Don't be too pious, here. (Score:3, Insightful)
err... (Score:3, Insightful)
*coughs
TFA:
sorry, their MAIN problem is not in any way a dysfunctional backup system. ever heard of verifying backuped data?
They probably still have most of it (Score:5, Insightful)
What was the nature of the crash? (Score:2, Insightful)
But whatever the case - there is a useful lesson here. Make sure your backups are backing something up.
High profile SNAFUs (Score:5, Insightful)
I don't know about you people, but after reading this (and giving it the "haha" tag) I'm going home and catching up on a couple of backups I've been slacking off on for a while.
Re:What was the nature of the crash? (Score:3, Insightful)
usually that is the case but it has happened when one of my backup failed one night and someone needed a file restores from the previous day, if that company never checked it's backup or never configure some kind of noticaition upon failiure or success then they are very lame
Re:After the swearing stopped. (Score:5, Insightful)
honestly, they CANT have competent IT. The FIRST thing you do in the morning is check the backups.
I have a HP sdat jukebox here and I STILL check the backup logs to make sure the backup and verify succeeded last night. if they dont I mirror the important files right away and then run a manual backup to not lose the last 24 hours of backup.
I hope that Business 2.0 learned that paying top $$$ for competent IT is a good idea and they should run a article about it.
Wrong problem (Score:5, Insightful)
the problem was, as always, not the backup. I've rarely seen problems resulting from the backup process. The troublesome process is the restore. Or as a friend put it once:
Nobody wants backups, what everybody wants is a restore.
In my twenty years of IT i've seen several companies making backups like a well oiled machine. The backup process was well documented and everyone was trained to a degree, they could do it with their eyes closed. But everything fell apart in the critical moment, because all they had planned was making the backup. Nobody ever imagined or tried a restore on the grand scale. So they ended up with a big stack of tapes with unuseable data.
Backup is the mean, not the goal.
Regards, Martin
Re:How does this actually happen? (Score:4, Insightful)
Re:After the swearing stopped. (Score:5, Insightful)
There is not a week going by without me getting an issue from one of out regular analysts with question about how the customer can salvage their data because they don't have a backup. My standard answer is that we may be able to save some data, but it's going to cost a lot of $$$. And I also say: "When you don't have a backup, you have either deemed that you can easily recreate the data or that they are not important for the company"
And these are not mom&pop companies but big multi million/billion dollar companies.
Re:How does this actually happen? (Score:3, Insightful)
Perhaps, although my experience is that IT people are incredibly bad at framing business cases in terms more compelling than my daughter's request for a mobile phone for her birthday: a few vague reasons, followed by a sulk when asked for specifics.
I keep 20TB on RAID5, and replicate it daily to a RAID5 array that has no components or software above the spindle level in common (Solaris/EMC and Pillar Data). The data we really care about is on RAID 0+1, in some cases with three-way mirroring. We take it out to tape, in case the filesystem pukes over all the copies or the RAID controller decides to go bonkers. We're about to put ten miles between the two file servers. At no point have I had much pushback from management over the money, once the risks and rewards are explained. Too often, IT people convince themselves that some Dilbert-esque stereotype of a manager is going to say no, and therefore make their case in a passive-agressive style that will make anyone say no.
ian
Re:After the swearing stopped. (Score:4, Insightful)
HP DAT? You'd better do more than check the logs. A test restore (if your users don't already test for you by deleting files) at least a few times a week might save your butt one day. Actually DAT or not, test restores are a must. Logs lie.
Re:Rag (Score:5, Insightful)
I wish. I wish people didn't read Time, either (the publisher), but they do. Time's writing style is the dumbed down, try-to-be-hip crap I wouldn't have gotten away with in sixth grade. Seriously. Like I said before [slashdot.org], to understand why its writing is like fingernails on a blackboard for me, consider how the same information would be conveyed by two sources:
8-year-old: "6 divided by 3 is 2."
Time magazine: "Okay, imagine you've got a half-dozen widgets, churned out of the ol' Widget Factory on Fifth and Main. Now, say you've gotta divvy 'em up into little chunklets -- a doable three, let's say -- and each chunklet has the same number that math professor Gregory Beckens at Overinflated Ego University calls a 'quotient'. The so-called 'quotient' in this case? Dos."
Based on how that post got modded, I'm not alone in this.
Re:High profile SNAFUs (Score:5, Insightful)
We can't backup, its too expensive. (Score:5, Insightful)
Re:We can't backup, its too expensive. (Score:1, Insightful)
Re:After the swearing stopped. (Score:3, Insightful)
First, no one really understands best practices for backup, and a lot of systems that are backed up "successfully" can't be restored anyway (in fact, most commonly this is Microsoft Exchange, the most important system in most companies!). Second, Tape sucks! You MUST have Disk-to-disk backups to have any true recoverability in today's world. Third, check you logs EVERY day, there's no excuse! Fixing a failing backup should be the number 1 priority second only to an actual failed server you are recovering. Next, nobody spends enough on IT disaster recovery, and no one documents the recovery process properly. Your IT spending on DR should be approximately 25% or more of your total IT budget for server systems. At least 1 day per month should be used to practice system recovery or update the documentation covering it. Next, nothing should ever be considered backed up until the server has been test recovered, completely from scratch, at least once. At least some data should be recovered from backup media every day just to be certain it can be done when needed. The test recovery should be of a random critical data folder, or database, not the same stuff each time.
Off-site DR is also important. Making sure that your entire data set for all critical systems is moved off site every 24 hours is a must. Included in this should be any media required to process a restore (not just the backups, but the install CDs, BareMetal recovery disks, licenses keys for all servers and applications, the DR documentation itself, network architecture information, hardware and software configuration of each server, and all information regarding your ISP contract, and system warranties from each manufacturer. If you don't have all this stuff, contract someone who knows what they are doing to make it for you.
For each unique mission critical system you have (Mail, critical database server that allow the business to operate, point application server, Citrix box, etc) you should have a complete spare system meeting the system requirements so that system can be restored immediately in the event of a system outage. Your system recovery tests should be performed regularly to that hardware. Best practice is also to keep those test boxes off-site when possible, but nearby enough to get in a jiffy. If you don't have spare lab equipment, and don't have enough budget to have it, you can't afford to have those critical systems in house, and should consider outsourcing a data center who does have those resources. Clustering is complicated and expensive, but spare chassis and a few spare drives don't amount to a huge IT burden. You don't have to have 1 for each server, just one that can handle the job of each unique mission critical system (if you have 5 SQL servers, 1 exchange, 1 citrix, and 4 file servers, you only need 4 total spare system).
The average business that goes through a critical system disaster that interrupts business for more than 48 hours requires 1 month of revenue to overcome the loss of each day of downtime. 40% of businesses that have a site disaster lasting more than 3 days go bankrupt within 90 days of the event. How much money will your business loose if you have to roll your purchase database back 2 days and loose all records of those transactions? How will your business survive if e-mail is out for 3 days? How much will you loose if your online store is gone for several days? How many customers will you loose if your support department is off-line for 2 days? How much will you be sued for if you miss a contractual deadline due to data loss? Can you afford to NOT spend the money to make sure this doesn't happen!?!?!
Re:err... (Score:4, Insightful)
"we will buy new when those fail" is what we were told
"Your successor will buy new when these fail." is the correct response to this.
CYA for Management Mistakes (Score:2, Insightful)
If management is going to brag about cost savings, make sure that you get documentation on their comments and your warnings. That way, if/when things turn to slime, you can put it in your resume that you tried to warn them.
This may be needed for your personal recovery plan. It may also be needed if lawyers get involved and you end up facing charges.