Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Bug Data Storage Upgrades IT Linux

Linux 4.0 Has a File-System Corruption Problem, RAID Users Warned 226

An anonymous reader writes: For the past few days kernel developers and Linux users have been investigating an EXT4 file-system corruption issue affecting the latest stable kernel series (Linux 4.0) and the current development code (Linux 4.1). It turns out that Linux users running the EXT4 file-system on a RAID0 configuration can easily destroy their file-system with this newest "stable" kernel. The cause and fix have materialized but it hasn't yet worked its way out into the mainline kernel, thus users should be warned before quickly upgrading to the new kernel on systems with EXT4 and RAID0.
This discussion has been archived. No new comments can be posted.

Linux 4.0 Has a File-System Corruption Problem, RAID Users Warned

Comments Filter:
  • by Anonymous Coward on Thursday May 21, 2015 @09:26AM (#49742923)

    I'll stick with Windows Vista, thanks.

  • ... need to be debugged, so using Raid® is probably the cause of this.

  • stable (Score:5, Funny)

    by rossdee ( 243626 ) on Thursday May 21, 2015 @09:34AM (#49742965)

    this is obviously some strange usage of the word "stable" that I wasn't previously aware of.

    • Re:stable (Score:5, Funny)

      by Anonymous Coward on Thursday May 21, 2015 @09:38AM (#49743009)

      If you ever owned horses, you would understand what "stable" means in this context

    • This. My first thought upon reading TFS was, how did this ever pass peer review and testing to get into the "stable" kernel? They do still perform peer review and unit testing, don't they?
      • This. My first thought upon reading TFS was, how did this ever pass peer review and testing to get into the "stable" kernel? They do still perform peer review and unit testing, don't they?

        Testing? Who does that anymore? That is the user's job.

        MMO's and Microsoft have made it so.

    • Re:stable (Score:5, Informative)

      by Trevelyan ( 535381 ) on Thursday May 21, 2015 @10:20AM (#49743341)
      It's stable as in terms of features and changes. i.e. No longer under development and will only receive fixes.

      However! Kernels from kernel.org are not for end users, if someone is using these kernels directly then they do so at their own risk.
      They are intended for integrators (distributions), whose integration will include their own patches/changes, testing, QA and end user support

      There is a reason that RHEL 7 is running Kernel 3.10 and Debian 8 is running 3.16. Those are the 'stable' kernels you were expecting.

      When kernel development moved from 2.5 to 2.6 (that later became 3.0), they stopped their odd/even number development/stable-release cycle. Now there is only development, and the integrators are expected to take the output of that to create stable-releases.
  • Warning: RAID 0 (Score:3, Interesting)

    by Culture20 ( 968837 ) on Thursday May 21, 2015 @09:37AM (#49742991)
    RAID 0 is unstable to begin with. Medium case scenario here (for legitimate use) is some data gets corrupted on a compute node. Run the program on two nodes; if you get the same result on both, that result is probably fine. If you're running RAID0 on any filesystem that isn't temporary or at least easily replaceable, you're doing it wrong.
    • Re: (Score:3, Insightful)

      by Enry ( 630 )

      RAID 0 is only as unstable as its least stable component. In this case it's most likely a drive failure, and most drives are fairly long MTBFs. The chances of a disk failure increase as a function of time and number of drives deployed. A two-drive RAID 0 will be more stable than a five-drive RAID 0 which will be more stable than a 10 drive RAID 0 that's three years old. In the case of higher RAID levels, you can remove a single (or multiple) drive failure as the point of failure. In this case, the poin

      • Re:Warning: RAID 0 (Score:5, Insightful)

        by nine-times ( 778537 ) <nine.times@gmail.com> on Thursday May 21, 2015 @10:39AM (#49743517) Homepage

        Would you say the same thing if the bug affected RAID 1 or RAID 5?

        I suspect not, since his point seemed to be that you shouldn't be using RAID 0 for data that you care about anyway.

        It doesn't really make it ok for a bug to exist that destroys RAID 0 volumes, but it does mitigate the seriousness of the damage caused. And it's true: Don't use RAID 0 to store data that you care about. I don't care if the MTBF is long, because I'm not worried about the mean time, but the shortest possible time between failures. If we take 1,000,000 drives and the average failure rate is 1% for the first year, it's that that comforting to the 1% of people whose drives fail in that first year.

        • Would you say the same thing if the bug affected RAID 1 or RAID 5?

          I suspect not, since his point seemed to be that you shouldn't be using RAID 0 for data that you care about anyway.

          Exactly. About the only reason I would ever use RAID 0 is for some sort of temp data drive where for some reason I wanted to string multiple drives together. You've basically taken a bunch of drives that each would be vulnerable without redundancy and have produced one big drive that will fail whenever any component does, thereby greatly increasing failure rate over individual drive failure rate. There are only a limited set of use cases where this is a helpful thing, and basically all of them are situat

          • Well, it mitigates the seriousness of the damage a bug should cause, assuming that people use RAID reasonably.

            I'm going to go ahead and say that it mitigates the serious of the damage caused in actuality since most IT people entrusted with serious and important data aren't going to be that stupid. I mean, yes, I've seen some pretty stupid things, and I've seen professional IT techs set up production servers with RAID 0, but it's a bit of a rarity. There could still be some serious damage, but much less than if it were a bug affecting RAID 5 volumes.

            • I'm going to go ahead and say that it mitigates the serious of the damage caused in actuality since most IT people entrusted with serious and important data aren't going to be that stupid.

              And that's where your assumptions are different from mine. I was discussing people who are probably NOT "entrusted with serious and important data," but nevertheless have their own personal data (which they think is at least somewhat valuable) and choose to run a RAID 0 setup because of some stupid reason, like it makes their system run a bit faster.

              (Well, that's not a completely stupid reason, but it is a reason to have a good backup strategy for essential files and to segregate your data so only the mi

              • If you doubt such people exist, do an internet search or read some gamer forums.

                I think you missed my point. I don't doubt such people exist. I doubt such people are generally safeguarding information that I think is important.

        • by Enry ( 630 )

          I suspect not, since his point seemed to be that you shouldn't be using RAID 0 for data that you care about anyway.

          I meant, what if there was a bug in the RAID 5 code that caused similar corruption? This is equivalent (almost) to blaming the victim. Yes, you did risky behavior, but the problem wasn't caused because of the risky behavior.

          • I meant, what if there was a bug in the RAID 5 code that caused similar corruption?

            Yes, I understood. And I way saying, yes, it seems clear that we would all care more if it were a problem with RAID 5.

            I understand that you think "we would respond differently if this were RAID 5" is a sign of hypocrisy or something. But it's not really that.

            It's a little like saying, "There was a design flaw in trash cans that cause items stored in the trash can to be damaged." And people respond by saying, "Yeah, well... that's not great, but it could be worse. Things stored in trash cans are usuall

            • by Enry ( 630 )

              I understand that you think "we would respond differently if this were RAID 5" is a sign of hypocrisy or something. But it's not really that.

              Yes it is, and that's a very short sighted approach. I hope you're not a developer.

              • As I said:

                It doesn't really make it ok for a bug to exist that destroys RAID 0 volumes, but it does mitigate the seriousness of the damage caused.

      • by TheCarp ( 96830 )

        I have been running a 4 disk RAID 5 array for a few years now at home, and did a replacement upgrade a couple of years back.

        Overall I find in a 4 disk scenario I lose just a bit less than one disk per year. Maybe one disk every year and a half.

        So when you say RAID 0 that is 3 years old, that sounds about right. I would call such an array in serious danger of loss.

        • by Enry ( 630 )

          I was really just throwing out drives and times. I had name-brand systems that were in a RAID 0 to consolidate two drives (the drive contents were expendable since this was just scratch space) and they ran for many years with few failures.

    • There is no valid reason for corruption to occurs on RAID0 anymore than on any other setup. The problem of RAID0 is data loss (drive failure).
    • by iONiUM ( 530420 )

      For the record, I have a 6 year old machine running Windows 7 with a RAID-0 setup (asus p5k-e motherboard, WD 250gb drives), and it has never had an issue. It it typically on 24/7, but it has gone through many power outages where the UPS ran out of battery and it hard-reset.

      I do, of course, keep all data on a separate regular drive, along with an external back-up of that. So if the RAID-0 did die, it wouldn't be a big deal (and I could finally move to SSD!).

      Anyways, the point I am trying to make is that RAI

  • by silas_moeckel ( 234313 ) <silas@@@dsminc-corp...com> on Thursday May 21, 2015 @09:38AM (#49743001) Homepage

    If your running a brand spanky new kernel, with data you do not care about why an old FS. Plenty of newer better FS's to choose from.

    • Name one that actually boots the Linux kernel, and doesn't just run in user space. (Yes, I am a fan of ZFS, but not the Linux implementation.)
      • XFS for starters it's the default nowadays on rhel/centos.

      • Re: (Score:3, Informative)

        by fnj ( 64210 )

        Name one that actually boots the Linux kernel, and doesn't just run in user space. (Yes, I am a fan of ZFS, but not the Linux implementation.)

        You really should get out more. ZFS on Linux is not to be confused with the ZFS Fuse project. You can boot [zfsonlinux.org] from a ZoL filesystem. In general ZoL is about as stable, complete, and reliable [clusterhq.com] as any ZFS.

      • by sjames ( 1099 )

        You're thinking of the ZFS that goes through FUSE. There is also ZFS on Linux that runs as kernel modules like any other fs.

        There's also btrfs.

        Of course, neither of those needs the md driver at all, they have their own raid like systems.

  • There seems to be a fix in RAID code [brown.name] and a fix in Ext4 code [kernel.org].

    The latter was incorporated in Linux 4.0.3 (changelog [kernel.org]), and according to the Phoronix article [phoronix.com] the RAID bug is still unfixed.

  • New version ... (Score:5, Insightful)

    by JasterBobaMereel ( 1102861 ) on Thursday May 21, 2015 @09:40AM (#49743023)

    This is the new 4.0 kernel, A Major version update , less than a month old, that most Linux systems will not have yet ...and the issue has already been patched

    Bleeding edge builds get what they expect, stable builds don't even notice

    • Re: (Score:2, Insightful)

      by Anonymous Coward

      The last major Linux version update that actually meant something was 1->2. The "major version" bumps in the kernel are now basically just Linus arbitrarily renumbering a release. The workflow no longer has a notion of the next major version.

    • The down side is that since no one runs business critical loads on new stuff, business critical tools do not get tested as well as simple stuff.
      • by jedidiah ( 1196 )

        No. They just don't run PRODUCTION on the bleeding edge code. That doesn't mean that this stuff isn't being tested with non-trivial use cases. Any reputable IT shop is going to be putting version n+1 through it's paces before it does anything important because everyone wants to keep their jobs.

        The last time I used RAID0 for anything it was a high volume R&D project. The OS vendor probably got a couple of good bug fixes out of us.

        • Most places I know do not have identical hardware for testing. They have retired production hardware for testing, so it is older stuff, with older drivers.
    • by Yunzil ( 181064 )

      Uh, 4.0 is a stable build, chief.

    • I'll wait for 4.1, and then I'll wait for 4.1.2 just to be safe.

  • It also looks like if dropping the discard mount option you will also avoid being hit by this serious issue.

    There's very little good reason to use 'discard' on Linux, and many reasons not to. (This isn't the first data corruption problem, and there are several performance issues as well.) Fstrim in a con job is the way to go.

    • by Rashkae ( 59673 )

      Having said that, considering the nature of this bug, I wouldn't be surprised is using fstrim would also trigger this particular bug.

    • by marsu_k ( 701360 )

      There's very little good reason to use 'discard' on Linux

      Care to elaborate on that? My bible [archlinux.org] says that discard is the first choice, fstrim when that isn't applicable for whatever reason. Bear in mind that I use Linux mostly as a desktop OS, so whatever caveats there may be in server use do not affect me.

      • by Rashkae ( 59673 )

        This is the first time I've found someone suggesting discard as the first choice over fstrim. The reasons to use fstrim is stated right in that article. Performance bottlenecks when there are file delete opperations. (And no real benefit to trimming on the fly vs trimming in a batch process.) However, while I usually have nothing against debaing my betters and making a spectacular fool of myself, I'm not going to go out of my way to contractict the Arch Linux documentation.

        • by marsu_k ( 701360 )
          Oh, I'm not saying Arch Wiki is infallible (although, it is correct pretty much the whole time). I was just looking for rationalization to discard or not to discard. As a personal anecdote, this Zenbook has been running discard since day 1 (24GB SSD and 500GB HDD, discard on the first drive only of course) - the OS partition (the 24GB drive, ext4) is still spanking fast. Although, it has never been close to running out of space (/var is on the HDD).
  • by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Thursday May 21, 2015 @12:13PM (#49744281) Homepage Journal

    Tunneled down into the articles, http://git.neil.brown.name/?p=... [brown.name] has the patch. I'm building a system with 4.0.4 right now so this was material to me

  • by TeknoHog ( 164938 ) on Thursday May 21, 2015 @01:46PM (#49745027) Homepage Journal
    Well, there goes that slogan [raidkillsbugs.com].

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...