Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Data Storage Bug

Your Hard Drive Lies to You 512

fenderdb writes "Brad Fitzgerald of LiveJournal fame has written a utility and a quick article on how all hard drives from the consumer level to the highest level 'enterprise' grade SCSI and SATA drives do not obey the fsync() function. Manufacturers are blatantly sacrificing integrity in favor of scoring higher on 'pure speed' performance benchmarking."
This discussion has been archived. No new comments can be posted.

Your Hard Drive Lies to You

Comments Filter:
  • Re:What's this? (Score:3, Insightful)

    by maxwell demon ( 590494 ) on Friday May 13, 2005 @03:33AM (#12517279) Journal
    Not to mention the 1.44 "Megabyte" floppy disk where "Megabyte" means 1024000 Bytes ...
  • by rice_burners_suck ( 243660 ) on Friday May 13, 2005 @03:36AM (#12517287)
    Why am I not surprised at this? First, they decide that a kilobyte = 1000 bytes, rather than the correct value of 1024. This leads the megabyte to be 1000 kilobytes, again, rather than 1024. The gig is likewise 1000 megabytes. You might think, ok, big deal, right?

    Yeah. In the days when the biggest hard drive you could get was 2 gigs, you would get 147,483,648 bytes less storage than advertised, unless you read the fine print located somewhere. This is only about 140 megs less than advertised. Today, when you can get 200 gig hard drives, the difference is much larger: 14,748,364,800 bytes less storage than advertised. This means that now, you get almost FOURTEEN GIGABYTES less storage than advertised. That's bigger than any hard drive that existed in 1995. That is a big deal.

    I'm bringing up the size issue in a thread on fsync() because it is only one more area where hard drive manufacturers are cheating to get "better" performance numbers, instead of being honest and producing a good product. As a result, journaling filesystems and the like cannot be guaranteed to work properly.

    If the hard drive mfgs really want good performance numbers, this is what they should do: Hard drives already have a small amount of memory (cache) in the drive electronics. Unfortunately, when the power goes away, the data therein becomes incoherent within nanoseconds. So, embed a flash chip on the hard drive electronics, along with a small rechargeable battery. If the battery is dead or the flash is fscked up, both of which can easily be tested today, the hard drive obeys all fsync() more religiously than the pope and works slightly more slowly. If the battery is alive and the flash works, the hard drive will, in the event of power-off with data remaining in the cache (now backed by battery), that data would be written to the flash chip. Upon the next powerup, the hard drive will initialize as normal, but before it accepts any incoming read or write commands, it will first record the information from flash to the platter. This is a good enough guarantee that data will not be lost, as the reliability of flash memory exceeds that of the magnetic platter, provided the flash is not written too many times, which it won't be under this kind of design; and as I said, nothing will be written to flash if the flash doesn't work anymore.

  • Not just fsync() (Score:1, Insightful)

    by Anonymous Coward on Friday May 13, 2005 @03:40AM (#12517310)
    Lot's of stuff relies on knowing when blocks hit the disk. Think about it... knowing that something is on the disk means you can make assertions about write ordering. What relies on ordering? Databases and filesystems (i.e. BSD softupdates) for starters. If the disk lies to the OS about when data is written, bad stuff will happen sooner or later.
  • by Yokaze ( 70883 ) on Friday May 13, 2005 @03:47AM (#12517333)
    No. If you had no cache, there would be no need for a flush command. The flush command exists purely for the reason of flushing buffer and caches on the harddisc. The ATA-5 specifies the command as E7h (and as mandatory).

    The command is specified in practically in all storage interfaces for exactly the reason the author cited, integrity. Otherwise, you can't assure integrity without sacrificing a lot of performance.
  • by Erik Hensema ( 12898 ) on Friday May 13, 2005 @03:50AM (#12517345) Homepage

    We need it because of journalling filesystems. A JFS needs to be sure the journal has been flushed out to disk (and resides safely on the platters) before continuing to write the actual (meta)data. Afterwards, it needs to be sure the (meta)data is written properly to disk in order to start writing the journal again.

    When both the journal and the data are in the write cache of the drive, the data on the platters is in an undefined state. Loss of power means filesystem corruption -- just the thing a JFS is supposed to avoid.

    Also, switching off the machine the regular way is a hazard. As an OS you simply don't know when you can safely signal the PSU to switch itself off.

  • by Sparr0 ( 451780 ) <sparr0@gmail.com> on Friday May 13, 2005 @03:54AM (#12517359) Homepage Journal
    You have no grasp of what 'kilo', 'mega', and 'giga' mean. They have meant the same thing for 45 years, computers did not change that. There is a standard for binary powers, you simply refuse to use it.
  • by daikokatana ( 845609 ) on Friday May 13, 2005 @04:04AM (#12517388)
    Ok, fair enough. Now step into any of the 99% of all computer shops out there and ask for a hard drive, 160 gibibyte in size.

    If they don't laugh until you exit the store, I'll pay your disk. Please make sure you record the event and share it on the net.

  • Re:What's this? (Score:3, Insightful)

    by KiloByte ( 825081 ) on Friday May 13, 2005 @04:07AM (#12517400)
    No, the gibi crap is a new invention, going against established practice. And, it sounds awful.
  • by Basehart ( 633304 ) on Friday May 13, 2005 @04:44AM (#12517510)
    "I remember that MS had a fix for this (for laptops etc)... Which just made Windows wait a duration (~30s)..."

    This turned into the "my computer isn't doing what I want it to do, which is turn the F off" at which point the consumer simply reached down and yanked the power cord.

    Try writing a routine for this routine!
  • by Alioth ( 221270 ) <no@spam> on Friday May 13, 2005 @05:05AM (#12517577) Journal
    Ah, so now we know your 3GB space an 100GB of transfer advertised in your sig aren't binary gigabytes, but decimal, just like the hard drive manufacturers :-)
  • by bgog ( 564818 ) * on Friday May 13, 2005 @05:08AM (#12517584) Journal
    The author is specifically talking about the fsync function not the ATA sync command. fsync is an OS call notifying the system to flush it's write caches to the physical device. This writes to the disks write cache but I don't believe it actually issues the sync command to the drive.

    In the case of a journaling file system they issue the sync command to the drive to flush the data out.

    I work on a block-level transactional system that requires blocks to be synced to the platters. There where two options, modify the kernel to issue syncs to the ata drives on all writes (to the the disk in question) or to just disable the physical write cache on the drive. Turned out to be a touch faster to just diable the cache but the two are effectivly equal.

    However drives operate fine under normal conditions, applications write to file systems which take care of forcing the disks to sync. fsync (which the author is talking about) is an OS command and not directly related to the disk sync command.
  • by pe1chl ( 90186 ) on Friday May 13, 2005 @05:25AM (#12517631)
    SCSI drives usually make use of the latest technology. ATA uses whatever older technology has been cost-engineered to a suitable price-point

    SCSI drives usually are a couple of years behind in drive capacity relative to ATA drives. This seems to contradict the above.
  • by Dogtanian ( 588974 ) on Friday May 13, 2005 @05:50AM (#12517740) Homepage
    Maybe using kilo to mean 1024x is wrong.

    Fact of it is that *anyone* who knew enough about computers for it to matter would have known and agreed on this standard anyway, right or wrong.

    They came along and messed up a standard that everyone had agreed upon and was happy with. Don't even *think* of saying that using decimal kilobytes et al had any purpose other than making drives seem bigger than they were; that trick only worked because everyone had previously agreed that a kilobyte was 1024 bytes.

    If the industry was *so* damn keen to get the 'correct' meaning of the words, they wouldn't still be using the 'incorrect' versions when selling memory.

    Simple fact; anyone who wants to be pedantic about it can correctly argue that the 1024 definition of kilobyte is wrong. What they can't do is give any proper justification for changing a definition that everyone knew and understood to mean 1024 bytes.

    Marketing bullshit, pure and simple; in fact, I propose the phrase "marketing gigabyte", just to make it absolutely clear which definition is in use...
  • by Dogtanian ( 588974 ) on Friday May 13, 2005 @06:05AM (#12517818) Homepage
    Actually, it is. The standard was updated in 1998 to avoid confusion. Having different name for different things can avoid an awful lot of confusion, so it would very much recommend using them.

    Which is more important? The de facto standard that slightly misuses the 'kilo-' prefix, but *everyone* knows what it means; or something that was forced into place by marketing?

    As I argued in more depth elsewhere [slashdot.org], anyone who used computers *knew* what "kilobyte" and friends meant.

    There was no confusion, because only the 1024-byte definition was widely used.

    The 'need' to use the '1000 byte' definition was created by marketing, not computer people. THEY caused the confusion for their (short term) gain by exploiting the old meaning of 'kilobyte' to make their drives seem larger.

    Marketing do not give a flying **** about correctness or clarity; if there was any problem, *they* created it. Computer people knew what kilobyte meant.
  • by Crayon Kid ( 700279 ) on Friday May 13, 2005 @06:35AM (#12517940)

    Marketing do not give a flying **** about correctness or clarity; if there was any problem, *they* created it. Computer people knew what kilobyte meant.

    I'm sure they took advantage of the blurry meanings for a while. But in the long run, you gotta admit the change makes sense, from a standardisation point of view. Every measuring unit uses kilo/mega/giga to mean powers of ten. Computer world was the odd one out, and it should rightly be labeled specifically.

  • by swmccracken ( 106576 ) on Friday May 13, 2005 @06:50AM (#12517983) Homepage
    This writes to the disks write cache but I don't believe it actually issues the sync command to the drive.

    Yeah - that's the point of this thing - what's supposed to happen with fsync? From memory, sometimes it will guarentee it's all the way to the platters, sometimes it will not, depending on what storage system you're using, and how easy such a guarentee is to make.

    Linus in 2001 [iu.edu] discussing this issue - it's not new. That whole thread was about comparing SCSI against IDE drives, and it seemed that the IDE drives were either breaking the laws of physics, or lying, but the SCSI drives were being honest.

    From hazy memory, one problem is that without tagged-command-queing or native-command-queuing, one process issuing a sync will cause the hard drive and related software to wait until it has fully synched for all i/o "in flight"; holding up any other i/o tasks for other processes!

    That's why fsync often lies; because it's not pratical for people that fsync all the time to flush buffers to screw around with the whole i/o subsystem, and apparently some programs were overzealous with calling fsync when they shouldn't.

    However, with TCQ, commands that are synched overlap with other commands, so it's not that big a deal (other i/o tasks are not impacted any more than they would by other, unsynchronised, i/o). (Thus, with TCQ, fsync might go all the way to the platters, but without it it might just go to the IDE bus.) SCSI has had TCQ from day one, which is why a SCSI system is more likely to sync all the way than IDE.

    If I'm wrong, somebody correct me please.

    Brad's program certainly points out an issue - it should be possible for a database engine to write to disk and guarentee that it gets written; perhaps fsync() isn't good enough - be this fault in the drives, the IDE spec, IDE drivers or the OS.
  • by hyfe ( 641811 ) on Friday May 13, 2005 @07:23AM (#12518110)
    You have no grasp of what 'kilo', 'mega', and 'giga' mean. They have meant the same thing for 45 years, computers did not change that. There is a standard for binary powers, you simply refuse to use it.

    Being able to keep two thoughts in your head simultaniosly is a nice skill.

    Sure, kilo, mega and giga scientific meanings never changed, but kilo, mega and giga in computer science started as out the binary values. They are still in use, when reporting free space left on your hard-drive both Windows and Linux use binary thousands. Saying this is a clear cut case is just ignoring reality, as using 1024 really does simplify alot of the math.

    Secondly, if the manufacturers actually had come out and said 'we have decided to adhere to scientific standards and use regular 1000's' and clearly marked their products as such, we wouldn't have any problems now. The problem is, they didn't. They just silently changed it, causing shitloads of confusion along the way. Of all the alternatives in this mess, they choose the one which could ruin an engineers day, only for the purpose of having your drive look a few % larger.

    Some fool let the marketers in on the engineering meetings and we all lived to rue that day.

  • by quantum bit ( 225091 ) on Friday May 13, 2005 @07:23AM (#12518113) Journal
    Every measuring unit uses kilo/mega/giga to mean powers of ten. Computer world was the odd one out, and it should rightly be labeled specifically.

    Oh, the computer world uses those prefixes to mean powers of 10 too. They just mean powers of 10 in base 2 math :)
  • by enosys ( 705759 ) on Friday May 13, 2005 @07:48AM (#12518205) Homepage
    IMHO having the drive hide bad sectors is a good idea. That way you don't have to enter any bad sector lists, you don't have to scan for them when formatting, and the OS doesn't have to worry about them.

    What would you do if you had full control over bad sectors? You're still able to keep trying to read a new bad sector that contains data. The drive will try to repair it when you write to it and if it can't then it will remap it. It seems to me the only thing you can't do is force the drive to try to repair bad sectors that it gave up on earlier.

    Also consider how hard would it be to make a perfect hard drive. Would you be willing to pay for that? Bad sectors that were there all along don't even hurt reliability. It's only a problem when new ones go bad.

  • Re:What's this? (Score:3, Insightful)

    by Shinobi ( 19308 ) on Friday May 13, 2005 @07:48AM (#12518206)
    Ever since they started using the Giga prefix. Giga is explicitly defined as 10^9 base-10, ever since 1873 when the kilo, Mega, Giga etc prefixes were standardized.

    Ergo, 1 GigaByte=1 000 000 000 Bytes.

    Anything else is a result of comp sci people fucking up their standards compliance.
  • by Anonymous Coward on Friday May 13, 2005 @08:50AM (#12518565)
    SCSI drives usually are a couple of years behind in drive capacity relative to ATA drives.

    Not really, but with 300GB drives costing $700 for the "low end" and 500GB drives in the thousands of dollars range, you're not going to see them in your local computer shop.
  • by Hammer ( 14284 ) on Friday May 13, 2005 @08:57AM (#12518621) Journal
    Seems you don't get it. fsync() flushes to the device not to the physical media! The specs clearly says that all the data should be sent to the storage device, it does not say that the storage device should flush it's internal cache too! Do you see the difference?

    I think you missed the point here buddy... In the case of Linux, after sending the data, the driver explicitly issues a hardware command to tell the device to write to media and STFU until done!
    Do you see the difference?
  • Exactly - the author of this "test" made a bad assumption: fsync() (or rather the windows equivalent) means it's on the disk. Understandable, and once upon a time it was true in Unix. fsync() doesn't (that I know of) issue ATA sync commands, though.

    I used to beta-test SCSI drives, and write SCSI and IDE drivers (for the Amiga). Write-caching is (except for very specific applications) mandatory for speed reasons.

    If you want some performance and total write-safety, tagged queuing (SCSI or ATA) could provide that (with write caching turned off). You'll still give up some performance, since the a single-threaded write application/FS will wait for data to be on disk before continuing. If the FS/app writes (say) 3 chunks of data that fill a track, with write caching off and tagged queuing, it's probably a minimum of 3 rotations (probably more like 4.5 or more) to write the data. With write caching, it's minimum 1, more like average 1.5 rotations. With a LOT of pain, you could break the single-threadedness of this in some cases by not waiting for tagged write completions and reporting success, while marking the VM pages as copy-on-write or some equivalent so the app won't overwrite the data that you're still writing (or, you could only return success to the app/FS when the data has been sent to the drive, but before it reports success). This (in a way) moves the write cache into the disk driver and thus gives you control over it. Perf will still be lower than letting the drive do it, perhaps a lot lower in some cases.

    If you want _real_ performance and safety, turn on write caching, and when you hit a "safety checkpoint", tell the drive to flush the write cache to disk. I don't currently believe that ATA or SCSI drives generally ignore that command - please provide links if you know differently. It's not a benchmarking advantage to subvert that unless the OS/app is using it - but maybe OS's are turning fsync()/etc into ATA/SCSI sync commands, and the drive makers are lieing.
  • by Viceice ( 462967 ) on Friday May 13, 2005 @10:00AM (#12519230)
    It's called Not Keeping Info from the User(tm).

    All that needs to be done is instead of simply displaying "Windows is Shutting Down..." display what's going on.. Like "Flushing Disc Buffers..." then "Awaiting Disc OK "

    And people won't assume the PC has Hung and yank the cord (and if they did, they took an informed gamble and deserve the consequences.)
  • Re:What's this? (Score:1, Insightful)

    by Anonymous Coward on Friday May 13, 2005 @10:10AM (#12519325)
    Yes, but that number has NO importance. No has ever needed to refer to 8*(10^9) bits of data, and no one ever will. We could either say I have 512MB of RAM, or that I have 536.870912MB of RAM.

    Knowing the sort of people who argue for mebi, I'd imagine you'll next suggest that's the RAM manufacturers problem, and they should stop making address spaces powers of two...
  • by barawn ( 25691 ) on Friday May 13, 2005 @11:26AM (#12520129) Homepage
    As I argued in more depth elsewhere, anyone who used computers *knew* what "kilobyte" and friends meant.

    Except Ethernet card manufacturers, modem manufacturers, PCI card manufacturers... oh, hell, just about anyone who transfers something with a clock.

    10baseT ethernet transfers data at 10 Mbps. That means 10 x 10^6 bits per second. IDE buses running at 66 MHz list their theoretical maximum as 66 MB/s.

    kilo = 1024 is retarded. It only makes sense for things that have to scale in powers of two, like memory. For a long while, "data rate" meant "kilo=1000, mega=1000 kilo" wheras in storage, "kilo=1024". Talk about a recipe for disaster.

    Just as an example: here's an article describing Ultra320 SCSI, and PCI bus bandwidth:

    Under standard PCI the host bus has a maximum speed of 66 MHz. This allows for a maximum transfer rate of 533 MB/sec across a 64-bit PCI bus.


    66 2/3 MHz (M here means what? oh, right, 10^6) times 8 bytes is 533 1/3 MB/s. Where here, "M" means "1000*1000". In MiB/s, it'd be 508.6263 MiB/s.

    Is this a problem? Yes. I shouldn't have to pull out a freaking calculator to figure out how long it should take to dump 2 GB of RAM across a 2 GB/s link. It should be one second, not 1.0737418 seconds.

    Computer people knew what kilobyte meant.

    No we didn't. We've never used kilo consistently. See above - we've talked about CPU speeds in terms of kHz and MHz, meaning 10^3, 10^6, and talked about kilobits/second meaning 10^3 bits per second, talked about kilobytes/second meaning 10^3 bytes/second, and turned around and talked about file sizes where kilobyte means 1024 bytes.

    We've never been consistent. The IEC finally owned up to it and admitted it, and asked us to all finally stop being so damned sloppy, and I'm quite glad they did.
  • by dgatwood ( 11270 ) on Friday May 13, 2005 @12:53PM (#12521154) Homepage Journal
    I'm sure they took advantage of the blurry meanings for a while. But in the long run, you gotta admit the change makes sense, from a standardisation point of view.

    No, I don't admit it. Volume and distance measures are standardized to base 10 because they have no inherent natural unit. Computers have a natural unit---powers of two. In much the same way, we don't standardize time to base 10. Can you imagine if we decided we wanted to have 100 days in a year? It wouldn't work well because Earth doesn't go around the sun every 100 days. It goes around the sun every 365.25 days.

    For the same reason the base-10 standardization of time was rejected, the base-10 bastardization of computing units should also be rejected. A megabyte (2^20) is a natural unit that expresses both the underlying addressing of the computer and the fundamental organization of RAM that corresponds to that addressing system. A megabyte (10^6) represents an arbitrary grouping that (at least with modern design standards) CANNOT ACTUALLY EXIST IN HARDWARE.

    So how does the SI "standard" make sense again?

I've noticed several design suggestions in your code.

Working...