Follow Slashdot blog updates by subscribing to our blog RSS feed

Linux 4.0 Has a File-System Corruption Problem, RAID Users Warned 226

Posted by timothy on Thursday May 21, 2015 @09:23AM from the don't-store-the-ark-there dept.

An anonymous reader writes: For the past few days kernel developers and Linux users have been investigating an EXT4 file-system corruption issue affecting the latest stable kernel series (Linux 4.0) and the current development code (Linux 4.1). It turns out that Linux users running the EXT4 file-system on a RAID0 configuration can easily destroy their file-system with this newest "stable" kernel. The cause and fix have materialized but it hasn't yet worked its way out into the mainline kernel, thus users should be warned before quickly upgrading to the new kernel on systems with EXT4 and RAID0.

This discussion has been archived. No new comments can be posted.

Linux 4.0 Has a File-System Corruption Problem, RAID Users Warned

Load All Comments

Search 226 Comments Log In/Create an Account

Comments Filter:

Linux is clearly unstable! (Score:5, Funny)

by Anonymous Coward writes: on Thursday May 21, 2015 @09:26AM (#49742923)

I'll stick with Windows Vista, thanks.

Share
twitter facebook
The disks usually don't ... (Score:2, Offtopic)

by CaptainDork ( 3678879 ) writes:

... need to be debugged, so using Raid® is probably the cause of this.
- Re: (Score:2)
  
  by Culture20 ( 968837 ) writes:
  
  You laugh, but the first computer bug was a moth.
  https://www.youtube.com/watch?... [youtube.com]
  - - Bob Wilson knows they're real! (Score:2)
      
      by Thud457 ( 234763 ) writes:
      
      Way to go Barney Buzzkill!
      Next you're gonna try to tell us that gremlins aren't real.
stable (Score:5, Funny)

by rossdee ( 243626 ) writes: on Thursday May 21, 2015 @09:34AM (#49742965)

this is obviously some strange usage of the word "stable" that I wasn't previously aware of.

Share
twitter facebook
- Re:stable (Score:5, Funny)
  
  by Anonymous Coward writes: on Thursday May 21, 2015 @09:38AM (#49743009)
  
  If you ever owned horses, you would understand what "stable" means in this context
  
  Parent Share
  twitter facebook
- Re: (Score:3)
  
  by Deep Esophagus ( 686515 ) writes:
  
  This. My first thought upon reading TFS was, how did this ever pass peer review and testing to get into the "stable" kernel? They do still perform peer review and unit testing, don't they?
  - Re: (Score:2)
    
    by duke_cheetah2003 ( 862933 ) writes:
    
    This. My first thought upon reading TFS was, how did this ever pass peer review and testing to get into the "stable" kernel? They do still perform peer review and unit testing, don't they?
    Testing? Who does that anymore? That is the user's job.
    MMO's and Microsoft have made it so.
- Re:stable (Score:5, Informative)
  
  by Trevelyan ( 535381 ) writes: on Thursday May 21, 2015 @10:20AM (#49743341)
  
  It's stable as in terms of features and changes. i.e. No longer under development and will only receive fixes.
  
  However! Kernels from kernel.org are not for end users, if someone is using these kernels directly then they do so at their own risk.
  They are intended for integrators (distributions), whose integration will include their own patches/changes, testing, QA and end user support
  
  There is a reason that RHEL 7 is running Kernel 3.10 and Debian 8 is running 3.16. Those are the 'stable' kernels you were expecting.
  
  When kernel development moved from 2.5 to 2.6 (that later became 3.0), they stopped their odd/even number development/stable-release cycle. Now there is only development, and the integrators are expected to take the output of that to create stable-releases.
  
  Parent Share
  twitter facebook
- - Re:stable (Score:4, Insightful)
    
    by dave420 ( 699308 ) writes: on Thursday May 21, 2015 @11:01AM (#49743729)
    
    I understand if you are emotionally attached to Linux to the point where someone accidentally criticising it makes you feel uncomfortable, but you really should be able to figure out that "but... but... they're worse!" is no rational response :)
    
    Parent Share
    twitter facebook
    - - Re: (Score:2, Insightful)
        
        by andydouble07 ( 2344014 ) writes:
        
        Meanwhile, my Win keeps BSOD.
        Really? Sounds like you're screwing something up pretty bad, haven't seen one of those in about 6 or 7 years.
        
        Re: (Score:3)
        
        by jbengt ( 874751 ) writes:
        
        I can routinely cause a BSOD (about 1/3 of the time) on my HP laptop running Windows 7 Pro if I use the touchpad at the log-in screen on start-up. It's apparently a known bug in the touchpad driver that will not get fixed.
        
        Re: stable (Score:4, Insightful)
        
        by oobayly ( 1056050 ) writes: on Thursday May 21, 2015 @03:30PM (#49745765)
        
        It's not. However it isn't beyond a reasonable expectation that a dodgy touchpad driver shouldn't be able to kill an OS.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by jeremyp ( 130771 ) writes:
        
        I can write you a touchpad driver that will crash Linux even if you don't use the touchpad on start up. That Linus Torvalds really doesn't know what he's doing.
Warning: RAID 0 (Score:3, Interesting)

by Culture20 ( 968837 ) writes: on Thursday May 21, 2015 @09:37AM (#49742991)

RAID 0 is unstable to begin with. Medium case scenario here (for legitimate use) is some data gets corrupted on a compute node. Run the program on two nodes; if you get the same result on both, that result is probably fine. If you're running RAID0 on any filesystem that isn't temporary or at least easily replaceable, you're doing it wrong.

Share
twitter facebook
- Re: (Score:3, Insightful)
  
  by Enry ( 630 ) writes:
  
  RAID 0 is only as unstable as its least stable component. In this case it's most likely a drive failure, and most drives are fairly long MTBFs. The chances of a disk failure increase as a function of time and number of drives deployed. A two-drive RAID 0 will be more stable than a five-drive RAID 0 which will be more stable than a 10 drive RAID 0 that's three years old. In the case of higher RAID levels, you can remove a single (or multiple) drive failure as the point of failure. In this case, the poin
  - Re:Warning: RAID 0 (Score:5, Insightful)
    
    by nine-times ( 778537 ) writes: <nine.times@gmail.com> on Thursday May 21, 2015 @10:39AM (#49743517) Homepage
    
    Would you say the same thing if the bug affected RAID 1 or RAID 5?
    I suspect not, since his point seemed to be that you shouldn't be using RAID 0 for data that you care about anyway.
    It doesn't really make it ok for a bug to exist that destroys RAID 0 volumes, but it does mitigate the seriousness of the damage caused. And it's true: Don't use RAID 0 to store data that you care about. I don't care if the MTBF is long, because I'm not worried about the mean time, but the shortest possible time between failures. If we take 1,000,000 drives and the average failure rate is 1% for the first year, it's that that comforting to the 1% of people whose drives fail in that first year.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by AthanasiusKircher ( 1333179 ) writes:
      
      Would you say the same thing if the bug affected RAID 1 or RAID 5?
      I suspect not, since his point seemed to be that you shouldn't be using RAID 0 for data that you care about anyway.
      
      Exactly. About the only reason I would ever use RAID 0 is for some sort of temp data drive where for some reason I wanted to string multiple drives together. You've basically taken a bunch of drives that each would be vulnerable without redundancy and have produced one big drive that will fail whenever any component does, thereby greatly increasing failure rate over individual drive failure rate. There are only a limited set of use cases where this is a helpful thing, and basically all of them are situat
      - Re: (Score:3)
        
        by nine-times ( 778537 ) writes:
        
        Well, it mitigates the seriousness of the damage a bug should cause, assuming that people use RAID reasonably.
        I'm going to go ahead and say that it mitigates the serious of the damage caused in actuality since most IT people entrusted with serious and important data aren't going to be that stupid. I mean, yes, I've seen some pretty stupid things, and I've seen professional IT techs set up production servers with RAID 0, but it's a bit of a rarity. There could still be some serious damage, but much less than if it were a bug affecting RAID 5 volumes.
        
        Re: (Score:2)
        
        by AthanasiusKircher ( 1333179 ) writes:
        
        I'm going to go ahead and say that it mitigates the serious of the damage caused in actuality since most IT people entrusted with serious and important data aren't going to be that stupid.
        And that's where your assumptions are different from mine. I was discussing people who are probably NOT "entrusted with serious and important data," but nevertheless have their own personal data (which they think is at least somewhat valuable) and choose to run a RAID 0 setup because of some stupid reason, like it makes their system run a bit faster.
        (Well, that's not a completely stupid reason, but it is a reason to have a good backup strategy for essential files and to segregate your data so only the mi
        
        Re: (Score:2)
        
        by nine-times ( 778537 ) writes:
        
        If you doubt such people exist, do an internet search or read some gamer forums.
        I think you missed my point. I don't doubt such people exist. I doubt such people are generally safeguarding information that I think is important.
    - Re: (Score:2)
      
      by Enry ( 630 ) writes:
      
      I suspect not, since his point seemed to be that you shouldn't be using RAID 0 for data that you care about anyway.
      I meant, what if there was a bug in the RAID 5 code that caused similar corruption? This is equivalent (almost) to blaming the victim. Yes, you did risky behavior, but the problem wasn't caused because of the risky behavior.
      - Re: (Score:2)
        
        by nine-times ( 778537 ) writes:
        
        I meant, what if there was a bug in the RAID 5 code that caused similar corruption?
        Yes, I understood. And I way saying, yes, it seems clear that we would all care more if it were a problem with RAID 5.
        I understand that you think "we would respond differently if this were RAID 5" is a sign of hypocrisy or something. But it's not really that.
        It's a little like saying, "There was a design flaw in trash cans that cause items stored in the trash can to be damaged." And people respond by saying, "Yeah, well... that's not great, but it could be worse. Things stored in trash cans are usuall
        
        Re: (Score:2)
        
        by Enry ( 630 ) writes:
        
        I understand that you think "we would respond differently if this were RAID 5" is a sign of hypocrisy or something. But it's not really that.
        Yes it is, and that's a very short sighted approach. I hope you're not a developer.
        
        Re: (Score:2)
        
        by nine-times ( 778537 ) writes:
        
        As I said:
        It doesn't really make it ok for a bug to exist that destroys RAID 0 volumes, but it does mitigate the seriousness of the damage caused.
  - Re: (Score:2)
    
    by TheCarp ( 96830 ) writes:
    
    I have been running a 4 disk RAID 5 array for a few years now at home, and did a replacement upgrade a couple of years back.
    Overall I find in a 4 disk scenario I lose just a bit less than one disk per year. Maybe one disk every year and a half.
    So when you say RAID 0 that is 3 years old, that sounds about right. I would call such an array in serious danger of loss.
    - Re: (Score:2)
      
      by Enry ( 630 ) writes:
      
      I was really just throwing out drives and times. I had name-brand systems that were in a RAID 0 to consolidate two drives (the drive contents were expendable since this was just scratch space) and they ran for many years with few failures.
    - - Re: (Score:2)
        
        by TheCarp ( 96830 ) writes:
        
        Hard to say but it will happen eventually. I have seen it go a few years, then lose 2 within a few months. Always make sure monitoring works and will alert you if its degraded. You can run degraded mode for a long time without monitoring.....till the next one fails.
        They are mechanical, so manufacturing quality and environment will factor in. My drives likely see a lot of shake and heat being on the third floor of a 100 year old house, between the wind, the washing machine and seasonal heat.... its no data c
- Re: (Score:2)
  
  by danbob999 ( 2490674 ) writes:
  
  There is no valid reason for corruption to occurs on RAID0 anymore than on any other setup. The problem of RAID0 is data loss (drive failure).
- Re: (Score:2)
  
  by iONiUM ( 530420 ) writes:
  
  For the record, I have a 6 year old machine running Windows 7 with a RAID-0 setup (asus p5k-e motherboard, WD 250gb drives), and it has never had an issue. It it typically on 24/7, but it has gone through many power outages where the UPS ran out of battery and it hard-reset.
  I do, of course, keep all data on a separate regular drive, along with an external back-up of that. So if the RAID-0 did die, it wouldn't be a big deal (and I could finally move to SSD!).
  Anyways, the point I am trying to make is that RAI
- - Re: (Score:2)
    
    by styrotech ( 136124 ) writes:
    
    Linux md RAID10 [wikipedia.org] is a 'non standard' single level layout that does not have a RAID0 component/layer.
    There are 3 layouts available for it, one of which can mimic the underlying block layout of the 'standard' layered/nested RAID10.
Why ext4 (Score:3)

by silas_moeckel ( 234313 ) writes: <silas@@@dsminc-corp...com> on Thursday May 21, 2015 @09:38AM (#49743001) Homepage

If your running a brand spanky new kernel, with data you do not care about why an old FS. Plenty of newer better FS's to choose from.

Share
twitter facebook
- Re: (Score:2)
  
  by houstonbofh ( 602064 ) writes:
  
  Name one that actually boots the Linux kernel, and doesn't just run in user space. (Yes, I am a fan of ZFS, but not the Linux implementation.)
  - Re: (Score:2)
    
    by silas_moeckel ( 234313 ) writes:
    
    XFS for starters it's the default nowadays on rhel/centos.
    - - Re: (Score:2)
        
        by Pope Hagbard ( 3897945 ) writes:
        
        It's newer than ext2, which ext4 is based on.
      - Re: (Score:2)
        
        by kthreadd ( 1558445 ) writes:
        
        XFS isn't newer than ext4.
        Depends on how you look at. XFS is continuously developed and improved, they just don't stick a version number after it like the ext developers.
  - Re: (Score:3, Informative)
    
    by fnj ( 64210 ) writes:
    
    Name one that actually boots the Linux kernel, and doesn't just run in user space. (Yes, I am a fan of ZFS, but not the Linux implementation.)
    You really should get out more. ZFS on Linux is not to be confused with the ZFS Fuse project. You can boot [zfsonlinux.org] from a ZoL filesystem. In general ZoL is about as stable, complete, and reliable [clusterhq.com] as any ZFS.
    - - Re: (Score:3)
        
        by houstonbofh ( 602064 ) writes:
        
        If you trace it back, all of that fear originates on one post from the freenas forums. A post from one of the key developers says that you should use ecc for any server with critical data, but zfs is neither more or less sensitive to it.
  - Re: (Score:2)
    
    by sjames ( 1099 ) writes:
    
    You're thinking of the ZFS that goes through FUSE. There is also ZFS on Linux that runs as kernel modules like any other fs.
    There's also btrfs.
    Of course, neither of those needs the md driver at all, they have their own raid like systems.
- - Re: (Score:2)
    
    by pla ( 258480 ) writes:
    
    Ext4 should in theory be the best choice. It's widely used and has a large enterprise support. Lots of business people get angry if it does not work properly.
    
    On a modern system with multiple disks you want to configure as some variety of soft-RAID, ZFS hands-down counts as the clear best choice (short of going for a "cluster" FS). It allows an arbitrary number of extra parity drives (think "RAID 8"), as well as arbitrarily many hot spares; it quickly and easily recovers from having someone pull out all
    - Re: (Score:2)
      
      by Enry ( 630 ) writes:
      
      You can't remove drives from a ZFS pool - once they're in (even if you have free space on other drives), the number of drives can't go down. Which really bothers me. With LVM you can evacuate data off of drives and shrink the pv. LVM in itself isn't a filesystem, but if you think of a pool as an LVM volume the functionality is somewhat similar.
      - Re: (Score:3)
        
        by Rich0 ( 548339 ) writes:
        
        The problem is that the feature-list for ZFS is very enterprise-oriented.
        Why would you want to add just one drive to a server with 5x 6-drive RAID6 arrays? Just add another 6 drives at a time.
        On the other hand, if you have a PC with 3 drives in RAID5, you could easily want to turn that into a 4-drive RAID5 or a 5-drive RAID6 in-place.
        Btrfs has a lot of features that are useful for smaller deployments, like being able to modify the equivalent of a vdev in-place. ZFS on the other hand has a lot of features
        
        Re: (Score:2)
        
        by goarilla ( 908067 ) writes:
        
        Why would you want to add just one drive to a server with 5x 6-drive RAID6 arrays? Just add another 6 drives at a time.
        
        ZFS isn't ideal for growing like that since it doesn't do rebalancing. Your younger raid arrays will always have more data on them.
        Also zfs destroy is very expensive.
        
        Re: (Score:2)
        
        by goarilla ( 908067 ) writes:
        
        *Your older raid arrays*
        
        Re: (Score:2)
        
        by Rich0 ( 548339 ) writes:
        
        Why would you want to add just one drive to a server with 5x 6-drive RAID6 arrays? Just add another 6 drives at a time.
        ZFS isn't ideal for growing like that since it doesn't do rebalancing. Your younger raid arrays will always have more data on them.
        Also zfs destroy is very expensive.
        Perhaps, but my point was more that if you want to grow ZFS this is the ONLY way to actually do it, as far as I'm aware. You can't add individual drives to individual "vdevs."
        
        Re: (Score:2)
        
        by goarilla ( 908067 ) writes:
        
        Well you can grow the underlying component devices of the vdev. But yes even the "perfect" ZFS has its weaknesses.
        
        Re: (Score:2)
        
        by Rich0 ( 548339 ) writes:
        
        Agree, as the other reply pointed out as well. And you can do the same with mdadm raid too (though obviously with none of the benefits btrfs/zfs bring for data integrity like checksumming and copy-on-write). Mdadm will also let you reshape an array in place (that is change raid levels or number of disks), though with mdadm that will often result in messing up your stripe alignment and of course it is more likely to eat your data if something goes wrong since if it finds a parity mismatch it has no way to
        
        Re: (Score:2)
        
        by pla ( 258480 ) writes:
        
        Perhaps, but my point was more that if you want to grow ZFS this is the ONLY way to actually do it, as far as I'm aware. You can't add individual drives to individual "vdevs."
        
        You can replace all the drives in the array with bigger ones, resilvering after each replacement, and when you get to the last one, poof, you magically have a bigger pool. I certainly won't claim that as terribly efficient, though. :)
        
        It has its shortcomings, no doubt. But compared to old-school RAID or even LVM, it takes a huge
        
        Re: (Score:2)
        
        by Rich0 ( 548339 ) writes:
        
        Sure, but with btrfs you can just add one drive and sometimes get its entire capacity added to your array - it works fine with mixed-size disks.
        Of course, it might just decide not to boot the next day, and that is the downside to btrfs. It does tend to be a bit more friendly in scenarios where you have a small number of disks, though, which was my main point.
      - Re: (Score:2)
        
        by pla ( 258480 ) writes:
        
        I will readily admit that as a "shortcoming" of ZFS, but honestly, I don't quite see any obvious use cases for it. On the short term (months), I've only ever needed to *add* storage, never remove it.
        
        On the longer term (years), I have found that I go back and forth on how many drives I need, but when I do eventually upgrade my home NAS to bigger and better hardware, I don't even try to salvage old drives 1/10th the size of modern ones - I bring up the new system, with however many brand new drives I consi
        
        Re: (Score:2)
        
        by sjames ( 1099 ) writes:
        
        In a production server, I can see value in stepwise evacuating old drives and then swapping them for new larger drives only once the data is stable on the filesystem. Done right you could pull it off with zero downtime without opening a window where a single failure brings you down.
      - Re: (Score:2)
        
        by sjames ( 1099 ) writes:
        
        I'm using ZFS in production for now but I'm actively testing btrfs for that reason among others.
- - - Re: (Score:3)
      
      by wed128 ( 722152 ) writes:
      
      ReiserFS predates ext4, and it's hard to be an active software developer in prison.
      - Re: (Score:2)
        
        by kthreadd ( 1558445 ) writes:
        
        If the file system actually was great I'm sure someone else would pick it up, but I don't know if it was that great.
        
        Re: (Score:2)
        
        by sjames ( 1099 ) writes:
        
        It had some REALLY ugly failure modes.
Two issues in play? (Score:2)

by jones_supa ( 887896 ) writes:

There seems to be a fix in RAID code [brown.name] and a fix in Ext4 code [kernel.org].
The latter was incorporated in Linux 4.0.3 (changelog [kernel.org]), and according to the Phoronix article [phoronix.com] the RAID bug is still unfixed.
New version ... (Score:5, Insightful)

by JasterBobaMereel ( 1102861 ) writes: on Thursday May 21, 2015 @09:40AM (#49743023)

This is the new 4.0 kernel, A Major version update , less than a month old, that most Linux systems will not have yet ...and the issue has already been patched
Bleeding edge builds get what they expect, stable builds don't even notice

Share
twitter facebook
- Re: (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  The last major Linux version update that actually meant something was 1->2. The "major version" bumps in the kernel are now basically just Linus arbitrarily renumbering a release. The workflow no longer has a notion of the next major version.
  - Re: (Score:2)
    
    by torqer ( 538711 ) writes:
    
    Agreed... Tovalds having a google+ poll for changing major version numbers is as arbitrary as not having windows 9.
    https://plus.google.com/+Linus... [google.com]
- Re: (Score:2)
  
  by houstonbofh ( 602064 ) writes:
  
  The down side is that since no one runs business critical loads on new stuff, business critical tools do not get tested as well as simple stuff.
  - Re: (Score:2)
    
    by jedidiah ( 1196 ) writes:
    
    No. They just don't run PRODUCTION on the bleeding edge code. That doesn't mean that this stuff isn't being tested with non-trivial use cases. Any reputable IT shop is going to be putting version n+1 through it's paces before it does anything important because everyone wants to keep their jobs.
    The last time I used RAID0 for anything it was a high volume R&D project. The OS vendor probably got a couple of good bug fixes out of us.
    - Re: (Score:2)
      
      by houstonbofh ( 602064 ) writes:
      
      Most places I know do not have identical hardware for testing. They have retired production hardware for testing, so it is older stuff, with older drivers.
- Re: (Score:2)
  
  by Yunzil ( 181064 ) writes:
  
  Uh, 4.0 is a stable build, chief.
  - Re: (Score:2)
    
    by sound+vision ( 884283 ) writes:
    
    And which stable distros are using 4.0? My Debian stable box that gets updated at least weekly is on 3.2.
  - - Re: (Score:2)
      
      by Yunzil ( 181064 ) writes:
      
      kernel.org:
      stable: 4.0.4 2015-05-17
- Re: (Score:2)
  
  by KingMotley ( 944240 ) writes:
  
  I'll wait for 4.1, and then I'll wait for 4.1.2 just to be safe.
- - Re: (Score:2)
    
    by F.Ultra ( 1673484 ) writes:
    
    No it wasn't. The patch that caused this problem was a fix to another problem that where introduced in 3.14-rc1.
From the Article... (Score:2)

by Rashkae ( 59673 ) writes:

It also looks like if dropping the discard mount option you will also avoid being hit by this serious issue.
There's very little good reason to use 'discard' on Linux, and many reasons not to. (This isn't the first data corruption problem, and there are several performance issues as well.) Fstrim in a con job is the way to go.
- Re: (Score:2)
  
  by Rashkae ( 59673 ) writes:
  
  Having said that, considering the nature of this bug, I wouldn't be surprised is using fstrim would also trigger this particular bug.
  - - Re: (Score:2)
      
      by unrtst ( 777550 ) writes:
      
      Good thing all those 'eyes on the source code' caught this little nasty before it went out the door.
      Name one distro where this is out the door.
      For example, the latest ubuntu, released just a month ago, is using 3.19.
      - Re: (Score:2)
        
        by marsu_k ( 701360 ) writes:
        
        4.0.2-1-ARCH #1 SMP PREEMPT Thu May 7 06:47:54 CEST 2015 x86_64 GNU/Linux
        
        Arch, at least.
        
        Re: (Score:2)
        
        by unrtst ( 777550 ) writes:
        
        Ok, I stand corrected.
        Still, not many (I don't count gentoo, as that's just whatever you compile; and unstable (ex. debian unstable) shouldn't count either).
        Running down the list in distrowatch:
        mint: 3.13
        ubuntu: 3.19.0
        debian (stable/testing): 3.16.7
        mageia: 3.19.8
        fedora: 3.17.4
        opensuse: 3.16.6
        arch: 4.0.4
        centos: 3.10
        pclinuxos 2014.12: 3.18.1
        slackware: 3.18.11
        freebsd: ... it's freebsd, not linux ...
        So, out of the top 11 (I don't know why freebsd is even on there), arch is the only one whose current release is
        
        Re: (Score:2)
        
        by marsu_k ( 701360 ) writes:
        
        If that is your criteria, then perhaps Arch shouldn't count either - sure, it is not a source-based distro (the only package I compile frequently is Firefox, and that is due to it being the KDE-friendly fork, kudos to OpenSUSE for that), but still very much bleeding edge. Remarkably stable at that, but comparable to to Debian Unstable.
    - Re: (Score:2)
      
      by stooo ( 2202012 ) writes:
      
      >> I hope there are lawsuits from people who lose their valuable data from this.
      ????? is there a law against bugs in the US ?
      can you accuse source code ?
- Re: (Score:2)
  
  by marsu_k ( 701360 ) writes:
  
  There's very little good reason to use 'discard' on Linux
  Care to elaborate on that? My bible [archlinux.org] says that discard is the first choice, fstrim when that isn't applicable for whatever reason. Bear in mind that I use Linux mostly as a desktop OS, so whatever caveats there may be in server use do not affect me.
  - Re: (Score:2)
    
    by Rashkae ( 59673 ) writes:
    
    This is the first time I've found someone suggesting discard as the first choice over fstrim. The reasons to use fstrim is stated right in that article. Performance bottlenecks when there are file delete opperations. (And no real benefit to trimming on the fly vs trimming in a batch process.) However, while I usually have nothing against debaing my betters and making a spectacular fool of myself, I'm not going to go out of my way to contractict the Arch Linux documentation.
    - Re: (Score:2)
      
      by marsu_k ( 701360 ) writes:
      
      Oh, I'm not saying Arch Wiki is infallible (although, it is correct pretty much the whole time). I was just looking for rationalization to discard or not to discard. As a personal anecdote, this Zenbook has been running discard since day 1 (24GB SSD and 500GB HDD, discard on the first drive only of course) - the OS partition (the 24GB drive, ext4) is still spanking fast. Although, it has never been close to running out of space (/var is on the HDD).
Ahh there it is (Score:3)

by drinkypoo ( 153816 ) writes: <drink@hyperlogos.org> on Thursday May 21, 2015 @12:13PM (#49744281) Homepage Journal

Tunneled down into the articles, http://git.neil.brown.name/?p=... [brown.name] has the patch. I'm building a system with 4.0.4 right now so this was material to me

Share
twitter facebook
- Re: (Score:2)
  
  by TeknoHog ( 164938 ) writes:
  
  I'm building a system with 4.0.4 right now
  Where's that? For some reason I can't find that version anywhere.
  - - Re: (Score:2)
      
      by TeknoHog ( 164938 ) writes:
      
      I love replies like this, especially the whooshing sound [hoo.dy.fi] they make as they go by.
Raid kills bugs dead! (Score:5, Funny)

by TeknoHog ( 164938 ) writes: on Thursday May 21, 2015 @01:46PM (#49745027) Homepage Journal

Well, there goes that slogan [raidkillsbugs.com].

Share
twitter facebook
- Re:Which RAID are they referring to? (Score:5, Informative)
  
  by bakaorg ( 870848 ) writes: on Thursday May 21, 2015 @09:46AM (#49743069) Homepage
  
  md raid. The bug was in commit md/raid0: fix bug with chunksize not a power of 2 [kernel.org] causing, you guessed it, a bug with a chunksize not a power of two. I guess "fix" was a bit diversionary.
  
  The real problem was a macro modifying arguments that were later expected to be the unmodified version.
  
  Parent Share
  twitter facebook
  - Or just use a power of 2 chunk size? (Score:4, Insightful)
    
    by tlambert ( 566799 ) writes: on Thursday May 21, 2015 @12:16PM (#49744301)
    
    Or just use a power of 2 chunk size?
    What idiot configuration did someone have to have to trigger this bug?
    
    Parent Share
    twitter facebook
  - Re:Which RAID are they referring to? (Score:4, Informative)
    
    by MSG ( 12810 ) writes: on Thursday May 21, 2015 @12:55PM (#49744587)
    
    That fix is actually in the wrong place. The fix for that is tracked in kernel.org's bugzilla # 98501. I'm not linking directly as linking to bugzilla tends to place too high a load on those systems. It's impolite.
    Neil Brown said that he'd push the fix to Linus "shortly" at 2015-05-20 23:06:58 UTC. I still don't see the fix in Linus' tree.
    Watch for a fix titled "md/raid0: fix restore to sector variable in raid0_make_request"
    
    Parent Share
    twitter facebook
  - - Re:Which RAID are they referring to? (Score:5, Informative)
      
      by msauve ( 701917 ) writes: on Thursday May 21, 2015 @10:39AM (#49743513)
      
      No. There was a minor bug introduced at 3.14. The patch to fix that, completely different issue, went into 4.0 and caused this corruption issue.
      
      Parent Share
      twitter facebook
    - Re: (Score:2)
      
      by nightsky30 ( 3348843 ) writes:
      
      Hmmmm, let's call it RAID PI
- Re: (Score:3)
  
  by Enry ( 630 ) writes:
  
  I have 4 drives in a RAID 10, so two RAID 1 arrays of two drives each combined together in a RAID 0. I did it mostly because I can add new drive at any time and just chain them onto the RAID 0.
  - Re: (Score:2)
    
    by Qzukk ( 229616 ) writes:
    
    Linux's MD raid10 isn't the same as RAID 1+0, so I'm not entirely sure it would be affected by this.
    - Re:It's RAID 0 (Score:5, Informative)
      
      by Forever Wondering ( 2506940 ) writes: on Thursday May 21, 2015 @07:40PM (#49747473)
      
      Based on the commit fixes, it's in a function called raid0_make_request, which is only used in raid0.c
      raid 10 is in raid10.c, so it doesn't use this function.
      The bug is based on the fact that a macro "sector_div" modifies it's first argument [and returns the remainder]. I've removed the obligatory backslashes for clarity:
      # define sector_div(n, b)(
      {
      int _res;
      _res = (n) % (b);
      (n) /= (b);
      _res;
      }
      )
      This is used in some fifty files. Some just want the remainder [and they don't want the first arg changed so they do]:
      sector_t tmp = sector;
      rem = sector_div(tmp,blah);
      This is effectively what the code wanted, but the actual fix was to do a restore afterwards:
      sector_t sector = myptr->sector;
      ...
      rem = sector_div(sector,blah);
      ...
      sector = myptr->sector;
      ... // use sector [original value only please ;-)]
      The last line to restore sector with the original value was the fix.
      They should do a full code audit as their may be other places that could be a problem. I've reviewed half the files that use this macro and while they're not broken, some of the uses are fragile. I paraphrase: "sector_div considered harmful"
      What they really need are a few more variants which are pure functions that could be implemented as inlines:
      rem = sector_rem_pure(s,n)
      s2 = sector_div_pure(s1,n)
      Or, a cleaner sector_div macro:
      sector_div_both(s,n,sector_return,rem_return)
      
      Parent Share
      twitter facebook
  - Re: (Score:2)
    
    by Ksevio ( 865461 ) writes:
    
    You should switch to a RAID5! Then you'd get a little extra capacity while still being protected against 1 drive failing.
    - Re: (Score:2)
      
      by KingMotley ( 944240 ) writes:
      
      Raid 10 can survive SOME 2-drive failures (in a 4-drive raid 10), and has significantly faster write speeds than Raid 5.
      Personally, I use a combination of RAID-0 and RAID-6 (not the same array), because Raid-5 for large arrays is almost useless. I've seen too many raid-5's die when the bad drive is replaced and the added stress of the rebuild then kills a second drive. Ouch.
- In particular, NO redundancy. Reliability drops. (Score:5, Informative)
  
  by Ungrounded Lightning ( 62228 ) writes: on Thursday May 21, 2015 @02:05PM (#49745161) Journal
  
  Losing data goes with the territory if you're going to use RAID 0.
  In particular, RAID 0 combines disks with no redundancy. It's JUST about capacity and speed, striping the data across several drives on several controllers, so it comes at you faster when you read it and gets shoved out faster when you write it. RAID 0 doesn't even have a parity disk to allow you to recover from failure of one drive or loss of one sector.
  That means the failure rate is WORSE than that of an individual disk. If any of the combined disks fails, the total array fails.
  (Of course it's still worse if a software bug injects additional failures. B-b But don't assume, because "there's a RAID 0 corruption bug", that there is ANY problem with the similarly-named, but utterly distinct, higher-level RAID configurations which are directed toward reliability, rather than ONLY raw speed and capacity.)
  
  Parent Share
  twitter facebook
- - Re: (Score:3, Insightful)
    
    by kthreadd ( 1558445 ) writes:
    
    Or it could work just fine. RAID 0 is not dangerous, you may just as well loose your data even if you only use a single drive. Hard drives and SSDs don't go bad that often that it's a problem.
    - Re: (Score:2)
      
      by sound+vision ( 884283 ) writes:
      
      It's not quite "as well", with RAID 0 if you lose either of two drives, the data is gone. That effectively doubles the chance of failure.
      If you only have the budget to play with 2 drives, you should be using one drive normally and the second drive as an external backup. Not RAID 1, that leaves you open to software/file system errors and the like. Having the backup as a separate drive that's not plugged in except when running backups negates a lot of those failure modes.
- - - Re: (Score:2)
      
      by Swave An deBwoner ( 907414 ) writes:
      
      The standard mantra you are chanting is correct. But given the human propensity for failing to do something that they planned to do (regular fine-grained backups), a redundant array (pretty much anything except RAID 0) can mean the difference between losing some valuable data or development work and not losing it if a disk fails.
- Re: (Score:2)
  
  by Penguinisto ( 415985 ) writes:
  
  That's a good rule of thumb for Windows and Linux. Not sure about Apple :)
  On the Apple side, the rule seems to be an iOS-only thing (and even then only recently... thanks iOS 8!)
  On the OSX side? 10.0 sucked pretty hard, and (IIRC) 10.2 had some problems, but it's been rather rock-stable since then (at least from my POV - I've used OSX from the ill-fated 10.0 all the way up to Yosemite, but YMMV).
- - Re: (Score:2)
    
    by houstonbofh ( 602064 ) writes:
    
    And 2.6 was the worst since both numbers were even!
- - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
  - - Re: (Score:2)
      
      by lister king of smeg ( 2481612 ) writes:
      
      best response to troll feeding ever
- - - Re: (Score:2)
      
      by tshawkins ( 1239974 ) writes:
      
      You need to feed him MOAR BRAINZZ

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Linux is clearly unstable! (Score:5, Funny)

The disks usually don't ... (Score:2, Offtopic)

Re: (Score:2)

Bob Wilson knows they're real! (Score:2)

stable (Score:5, Funny)

Re:stable (Score:5, Funny)

Re: (Score:3)

Re: (Score:2)

Re:stable (Score:5, Informative)

Re:stable (Score:4, Insightful)

Re: (Score:2, Insightful)

Re: (Score:3)

Re: stable (Score:4, Insightful)

Re: (Score:2)

Warning: RAID 0 (Score:3, Interesting)

Re: (Score:3, Insightful)

Re:Warning: RAID 0 (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Why ext4 (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Two issues in play? (Score:2)

New version ... (Score:5, Insightful)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

From the Article... (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Ahh there it is (Score:3)

Re: (Score:2)

Re: (Score:2)