Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Data Storage Japan

University Loses 77TB of Research Data Due To Backup Error (bleepingcomputer.com) 74

An anonymous reader quotes a report from BleepingComputer: The Kyoto University in Japan has lost about 77TB of research data due to an error in the backup system of its Hewlett-Packard supercomputer. The incident occurred between December 14 and 16, 2021, and resulted in 34 million files from 14 research groups being wiped from the system and the backup file. After investigating to determine the impact of the loss, the university concluded that the work of four of the affected groups could no longer be restored. All affected users have been individually notified of the incident via email, but no details were published on the type of work that was lost.

At the moment, the backup process has been stopped. To prevent data loss from happening again, the university has scraped the backup system and plans to apply improvements and re-introduce it in January 2022. The plan is to also keep incremental backups -- which cover files that have been changed since the last backup happened -- in addition to full backup mirrors. While the details of the type of data that was lost weren't revealed to the public, supercomputer research costs several hundreds of USD per hour, so this incident must have caused distress to the affected groups. The Kyoto University is considered one of Japan's most important research institutions and enjoys the second-largest scientific research investments from national grants. Its research excellence and importance is particularly distinctive in the area of chemistry, where it ranks fourth in the world, while it also contributes to biology, pharmacology, immunology, material science, and physics.

This discussion has been archived. No new comments can be posted.

University Loses 77TB of Research Data Due To Backup Error

Comments Filter:
  • by Anonymous Coward on Thursday December 30, 2021 @05:14PM (#62129373)
    A match made in heaven.
    • Considering we're coming up on ten months since we placed orders with them for computers and haven't received anything, the fact their data backup didn't work is unsurprising.

      At this point HP is just blowing smoke up people's asses and lying on their financials.

      • HP Enterprise is responsible for supercomputers. If you bought HP PCs, you bought them from HP Inc., the printer and PC arm of the company that was spun off into an independent entity a few years ago.

        Same cultural heritage, but technically separate companies at this point.

  • "I shat myself" in Japanese?

    • Re:How do you say (Score:5, Interesting)

      by AmiMoJo ( 196126 ) on Thursday December 30, 2021 @05:48PM (#62129525) Homepage Journal

      Kusou detta.

    • I am wondering how you "scrape" a backup system.

      • Assuming that the data was lost in the traditional Microsoft fashion of changing the first character of the filename to indicate "deleted", this may be just reassembling files based on data still in the directory structure. If the directories were zeroed but the data file sectors left intact, scraping might refer to reading all those sectors and reassembling files based on context and position on the disk. 77 TB is a lot of data and even if it's human readable it might not be worth the human effort required
        • My post was meant to be a joke. I am fairly confident that the word in the article should have been "scrapped", not "scraped".

  • by mwfischer ( 1919758 ) on Thursday December 30, 2021 @05:15PM (#62129381) Journal

    If you can easily your backup, then you did not have a backup.

    • by Anonymous Coward on Thursday December 30, 2021 @05:17PM (#62129391)

      If you can erase your backup, then you did not have a backup.

      FTFY

    • We are not given much detail ... how old was that 77TB ? How precious was the data, ie was it something that can be re-computed ?

      This also suggests that they do not have off-site backups - if the data is 'precious' this is silly as a fire/catastrophe could be very expensive. They prolly have more than 77TB disk space in total, but 77TB off-site would only cost a few £thousand and need a fast network connection for rsync/whatever.

      This is why users of centrally provided IT services should always ensure

      • by AmiMoJo ( 196126 )

        They might have off-site backups, it depends how this disaster happened. Did the backup system erase the source files for some reason? Did they think that the source files were backed up and so deleted them, only to discover that the backup system then removed them from the backup set as well?

      • We are not given much detail ... how old was that 77TB ? How precious was the data, ie was it something that can be re-computed ?

        Considering,
        FTA "Fugaku cost $1.2 billion to build and has so far been used for research on COVID-19, diagnostics, therapeutics, and virus spread simulations."
        I'd say the value of the data was pretty low, COVOD-19 diagnostics are settled science, therapeutics are dis-allowed without regard to effectiveness due to political narratives and virus spread simulations are worse than useless.

  • by registrations_suck ( 1075251 ) on Thursday December 30, 2021 @05:17PM (#62129387)

    They waited this long to determine their backups suck?

    They waited this long to implement incremental and overlapping backups?

    They didn't periodically test their restore process?

    What a bunch of idiots.

    • only four of the affected groups lost data?
      Is there any info with more of the tech info? or did HP just fuck up and they don't want to say how they did it?

      • Was HP directly involved I wonder, or were they just the hardware vendor ?
        Had the Uni outsourced the IT operation to some 3rd party ?

        Backup software needs a process established to run it
        The process needs people who operate it correctly.
        The people who operate it must have mgt. that check they're doing it right,
        The management must be able to say 'hey!, its not working'.

        I wonder if this was another Fukishima where nobody dare say anything without loosing face ?
        Then when someone wanted a file back ... Boom!

    • by Culture20 ( 968837 ) on Thursday December 30, 2021 @05:33PM (#62129465)
      This reads as worse to me; over the course of two days, their backup system had write access to the source filesystems, and the backup system not only wiped itself of the previous backups, but also the source filesystems. So they weren't keeping anything offsite or offline, and the backup system didn't have restrictive read-only access to the source filesystems. I don't get how a design gets to that point when you're dealing with double digit TB of data.
      • My guess is the source data became corrupted and that bad data was being backed up long enough to overwrite any usable backups.

    • Likely forced by a bunch of bean counters.

  • A similar thing happened to me ages ago, not as serious though. I lost source for a mainframe application system, disk crashed and the backup tape had crc errors. Had to start over.

    Kind of like this take on Start Wars:

    I felt a great disturbance in Cyber Space, as if hundreds of voices suddenly cried out in terror

    • Re:It happens (Score:4, Insightful)

      by guygo ( 894298 ) on Thursday December 30, 2021 @06:49PM (#62129691)

      Back in the stone age, I was in charge of a small DEC cluster (a PDP 11/70, an 11/44, and an 11/34) all running RSX-11m and DECNetted together. The supplied RSX backup utility was called "BRU" (for Backup and Restore Utility, so creative). After a disk crash, we discovered that the months and months of incremental and full backups we had on tape were unreadable by the very utility that generated them. Shortly afterwards DEC announced there was a "bug" in BRU that used the wrong encrypt key vs decrypt key. Hundreds of manhours and tape dollars were wasted generating backup tapes that couldn't be read. After that, part of the daily incremental procedure was to also restore some files (different each time) to make sure the damn utility could read what it had written. Forever after the RSX community renamed BRU to BLU (the Backup and Lose Utility, hey... we can do it too!). I learned my lesson: Backup is only half the story, always make sure you can Restore. Oh, and use a 3rd-party utility; unlike the megaDEC of the time its authors might actually care about such things as Restore.

      • by tlhIngan ( 30335 )

        Backup is only half the story, always make sure you can Restore

        Any idiot can write a backup utility. Really, any idiot.

        However, the harder challenge is writing a restore utility. It's completely trivial to write a backup utility, but writing a utility to restore a backup takes a lot more skill and experience.

        • Apparently, the difficult part is writing a backup utility that has the power to destroy the production data.

          That's pretty impressive.

          I mean, "copy" won't ever delete the original. "rsync", you need to get really fancy to intelligently destroy the original, or really sloppy to forget the one argument to avoid touching the original.

          Looks like they were doing more than backing up. They were deleting/rotating log files -- based on argument-defined paths. A common rookie mistake that ultimately creates the

      • by Agripa ( 139780 )

        I had something similar happen with a CMS QIC-80 drive which had soft heads. Tapes would write and verify, somehow, but could not be read. I did not know it for months.

        The lesson was learned; verify backups as a separate operation. CMS screwed us over warranty replacement by delaying until the warranty expired, and then going bankrupt.

  • by nospam007 ( 722110 ) * on Thursday December 30, 2021 @05:18PM (#62129399)

    ...from the other article.

  • I am currently running scripts to construct a regression suite of all the defects submitted against my module over the last 20 years. It is not super computer, but I have already cleared my 1 TB disk of temporary scratch files several times. I expect to create 5 to 8 TB of simulation data. So a super computer can easily create 100 TB in just a week or two. If the input files have been backed up the lost files can be recreated in a week or two of run time.

    Usually we do not store any long lasting thing in

    • I am currently running scripts to construct a regression suite of all the defects submitted against my module over the last 20 years. It is not super computer, but I have already cleared my 1 TB disk of temporary scratch files several times. I expect to create 5 to 8 TB of simulation data. So a super computer can easily create 100 TB in just a week or two. If the input files have been backed up the lost files can be recreated in a week or two of run time.

      I'm not sure this is a particularly meaningful statement.

      If I do 'cat /dev/urandom > /tmp/foo' I can generate 1 TB of data in just a few minutes... but I don't think that's useful data.

      If I'm trying decrypt a file... well generating a few KB of data might take the lifetime of the universe.

      I expect the rate at which these groups generated data is highly dependent on the particular workloads.

    • What is your supercompuer doing? Were you looking for an answer or looking for an endless string of intermittent data? Did you need to follow the path of individual particles? Or did you book 7.5 million supercomputer years just for it to spit out the number 43?

      The article tells us nothing about the nature, importance or the complexity involved in generating the data. All we can do it guess.

      • *42

        • This is just a run of the mill university supercomputer, not Deep Thought. Expect rounding errors to creep in ;-)

      • Typical for any finite element method simulations. The mesh will have several hundred million elements (common in CFD), billion or two degrees of freedom. Depending on physics the matrix equation might be solved by a form of LU decomposition (more common in Maxwell's equations). All these will create terabytes of temporary data. But after post processing lots of the data can be cleared. Once you get a solution you can delete the L and U matrices etc.

        We need extremely fast disk in super computers, we even

        • Of course. I wasn't asking why you're generating data, just giving examples that there's a wide range of different things a supercomputer can be used to compute, and not all of them generate huge datasets.

        • by jabuzz ( 182671 )

          I have seen Gaussian or Gromacs. it's a bit fuzzy now which one generate ~1TB temporary files. It was creating issues for the backup until I excluded them.

          We have some large memory nodes with 3TB of RAM and have seen those top out a bit over 2TB of usage when generating large meshes for CFD.

          The TL;DR here is if you don't do HPC you likely have not the foggiest what is involved.

          All that said fully daily backups of several PB is perfectly feasible *IF* you choose the right file system. Where the right file sy

  • by dogsbreath ( 730413 ) on Thursday December 30, 2021 @05:33PM (#62129463)

    Backing 77TB onto floppies [mainichi.jp] is fraught with difficulty even if they are 1.44MB each and from HP.

  • the amount of incompetency in regards to data security just astounds me.

  • Real super. Dude, you're getting a Dell, lol.
  • rm -rf * worked faster than admin could kill the process.

  • I just feel for the users. We can all make comments about the vendor or the admins, but at the end of the day, we've all let the users down and our profession has a black eye.
    • by Tyr07 ( 8900565 )

      Yeah but let's be honest.

      Often we lay out the risks, if it's a 2% risk or whatever, it still means this scenario could happen.
      We can offer solutions that drop it down to way less, to near impossible to have a complete failure statistically.

      Then they look at the cost, and say, you know what, that 2% risk is acceptable, and just hope they aren't part of that 2%.

      Technology changes and improves as well. It's a hard sell to people making financial decisions why you need to scrap the old system that was working a

    • by CWCheese ( 729272 ) on Thursday December 30, 2021 @06:07PM (#62129587)
      I would hope the users, at least some of them might have been forward thinking enough to keep their own copies of the data

      As a computer scientist & engineer for over 40 years, I've nearly always kept copies of data and code (etc etc etc) on my own storage devices because I've lost enough over the years to know you can't have enough backup methods.
  • by stikves ( 127823 ) on Thursday December 30, 2021 @06:01PM (#62129567) Homepage

    I think they were using "alias rsync=rm -rf" to speed up their backups.

    Joking aside, it needs to be a perfect storm, so that
    - Researchers have no local copies on workstations
    - The disks and RAID mirrors are both affected
    - They did not use snapshots / copy-on-write for sensitive data (no ZFS/btrfs/WinRE)
    - They override past backups with broken data
    - They cannot use any recovery software to bring the data back (testdisk + control board change worked for me in the past)

    This is really sad...

    • by hey! ( 33014 )

      Can anyone explain to me why COW and snapshotting haven't become standard practice yet?

      • > Can anyone explain to me why COW and snapshotting haven't become standard practice yet?

        There's a performance penalty, especially for rewrites.

        Many workloads don't notice but some do.

        • by hey! ( 33014 )

          Granted, but it seems like the default approach should be the safer one, because it's simpler to buy more performance than it is to buy more security.

        • CoW and snapshots shouldn't have any performance impact if you're using Enterprise grade syatems, and set them up rationally. But it does mean a lot of added cost, especially if you have high change rates in your dataset. You can also cache and destage to a timed Immutable backup location which prevents data alteration (intentional or accidental) but that's even more expensive.
    • by jabuzz ( 182671 )

      It's a supercomputer so the chances of using ZFS, btrfs or any file system you have experience with is around h.

      It's HP so my guess is they where using Lustre which in my experience is prone to eating itself and for which backup options suck.

      The smart money uses GPFS (or Spectrum Scale as IBM likes to call it now), at which point the smart money also uses TSM (or Spectrum Protect as IBM likes to call it now) and does daily backups. For something like this with 77TB of data I would expect to restore it in we

  • of run time on a modern supercomputer, I'm going to guess. Re-run the program, go grab a cup of coffee from the vending machine down the hall. Problem solved.
  • In a follow up notice, Kyoto University apologizes to the human race for losing the cure for all cancer, highly functional self driving vehicles, the fist self aware AI, and a working fusion reactor design. The notice goes on to say the University is planning lawsuits against HP for loss of intellectual property for $77 trillion dollars.
    • In a follow up notice, Kyoto University apologizes to the human race for losing [...] the fist self aware AI [...]

      "Fist", indeed.

  • by bugs2squash ( 1132591 ) on Thursday December 30, 2021 @06:56PM (#62129709)
    The result after all that number crunching is 42, just write in on a post-it note for heaven's sake
  • horrible memories (Score:5, Interesting)

    by bloodhawk ( 813939 ) on Thursday December 30, 2021 @07:14PM (#62129761)
    Brings back horrible memories. Was doing my honours research in the early 90's into AI with neural networks, I had 6 months of Neural network training and modelling wiped out as one of the admins noticed a lot of processing power and space being used by something he didn't recognize (each model took 8-12 hours to process) and they hadn't been doing backups properly (no I could not backup this myself as was in days of floppy disks and we had no access to extract data from the system and even if I did it would have taken hundreds of floppies). My professor at the time shrugged and said "oh well you just have to start again", instead I made up a heap of generic garbage and handed that in, biggest wasted year of my life.
  • by Another Random Kiwi ( 6224294 ) on Thursday December 30, 2021 @07:58PM (#62129863)
    From the document that the university posted (in Japanese), from HPE, about the reason for the loss: --- 3 Cause of file loss The backup script uses the find command to delete log files that are older than 10 days. The reason is included. Variable name to be passed to the delete process of the find command along with the improvement of the script function Has been changed to improve visibility and readability, but in the release procedure of this modified script There was a lack of consideration. bash loads the shell script in a timely manner while the shell script is running. Due to this behavior Overwriting a script in the presence of a running script without recognizing the side effects Since it was released more, the modified shell script was reloaded from the middle, As a result, the find command containing undefined variables was executed. As a result, the original log Delete the file in / LARGE0 instead of deleting the file saved in the directory I have. ---
    • bash reads each line from disk as it executes?

      If so I've never guarded against that but gdi shell scripting is fragile.

      I usually save myself the agitation and start with perl. That's a lie - I start with bash then get frustrated and rewrite it in perl but if it gets to look like a real program I rewrite it in python or god forbid go.

      At least any of those compile before execution.

      • by Another Random Kiwi ( 6224294 ) on Thursday December 30, 2021 @10:04PM (#62130085)
        set -u, in bash, is your friend, too. Seems they got burned by "undefined" variables evaluating to empty strings, but then people who write the logical equivalent of rm -rf $FOO/ kinda get what they deserve, or, as we see in this case, someone else gets what the author deserved...
      • Yup, same here. Always always wind up in perl.

        But honestly, I'm a master programmer because it's been three decades since I, too, learned the hard way that you never ever ever ever use a passed-in argument for a destructive command.

        bad:
        delete $dir
        better:
        my $dirs = {'one'=>'/foo', 'two'=>'/bar', 'three'=>'/foo/bar'};

        • ...and if you're willing to go the multi-pass route, your "delete" function merely renames the files with a to-be-deleted flag.

          Another routine, hopefully a day lagged if you can spare it, goes through and deletes the flagged files -- subject to simple quantity checks, or your approval of the previous night's flag report -- even if it's only a simple count.

  • spawn a copy and run it to tape every once in a while.

    those copies on disk are always vulnerable to error human or tech.

    I always assumed my management, programmers & users were their own worst enemies and made mistakes in judgement.

    But then again I am a old fart BOFH now.

    Tape is cheap and reliable.

    Someone listened to a vendor and believed them.

    • Tape backup isn't always more reliable. In fact, I remember being very surprised that HP Data Protector would allow you to "backup" data to tape that turned out to be completely useless in the event of a restore. This is why you don't just backup regularly, you TEST your backups regularly. If the University of Kyoto failed to do that, then honestly, it's a human error.
      • Or you could just get a better backup system. The only testing I do is the occasional restore when some numpty user deletes a file or directory by mistake. Then again as part of a storage upgrade I restored the whole lot from backup earlier in the year. Mind you I did have to throw a wobbler at IBM because TSM was returning error codes if there where any ACLs and they where only offering to fix it the next point release. Restored everything fine just the exit code was none zero but how was I to know if that

  • by Anonymous Coward

    Many decades ago, I worked at place where one group did regular backups using removable disk packs (back in the days of "washing machine like" disk drives not unlike the IBM 1311) and kept the backups on site in the same area. They kept several iterations of full backups.

    One night, something bad happened and the operator decided they needed to restore the disk with the group's software on it due to some catastrophic mistake. Not a giant deal as backups were done (IIRC) nightly so only a days work would be l

  • Once we had a 3rd party script that always ended successfully ,but when a restore for European Commision was needed the data on virtual tape was only zeroes. Imagine our surprise.


    Fortunately the storage guys found needed data miracolously.
  • No actually, don't trust anyone to back up your data. Also, always check your parachute before you jump out of an airplane.

Power corrupts. And atomic power corrupts atomically.

Working...