Follow Slashdot stories on Twitter

Your Hard Drive Lies to You 512

Posted by CowboyNeal on Friday May 13, 2005 @03:18AM from the say-it-ain't-so dept.

fenderdb writes "Brad Fitzgerald of LiveJournal fame has written a utility and a quick article on how all hard drives from the consumer level to the highest level 'enterprise' grade SCSI and SATA drives do not obey the fsync() function. Manufacturers are blatantly sacrificing integrity in favor of scoring higher on 'pure speed' performance benchmarking."

This discussion has been archived. No new comments can be posted.

Your Hard Drive Lies to You

Load All Comments

Search 512 Comments Log In/Create an Account

Comments Filter:

Hardly a new thing... (Score:4, Funny)

by |>>? ( 157144 ) writes: on Friday May 13, 2005 @03:19AM (#12517212) Homepage

Since when do computers do what you mean?

Share
twitter facebook
- Re:Hardly a new thing... (Score:5, Funny)
  
  by Clay Pigeon -TPF-VS- ( 624050 ) writes: on Friday May 13, 2005 @04:17AM (#12517439) Journal
  
  You must be new here. Computers always do what you tell them to do in the command line. What, you're using a gui? Well that's your fault then.
  
  Parent Share
  twitter facebook
  - Re:Hardly a new thing... (Score:5, Funny)
    
    by pyrrhonist ( 701154 ) writes: on Friday May 13, 2005 @04:28AM (#12517467)
    
    Computers always do what you tell them to do in the command line.
    They sure do.
    
    $ rm -rf * .o $ ls -a . .. $
    
    FUCK!!!!!
    
    Parent Share
    twitter facebook
    - Re:Hardly a new thing... (Score:3, Informative)
      
      by indifferent children ( 842621 ) writes:
      
      Windows is WYSIWYG; Linux is YAFIYGI (You asked for it, you got it).
      This is an old quote, but not everyone has seen it. This is much like Neal Stephenson comparing Linux to the Hole Hawg drill in "In the Beginning Was the Command Line". Great read!
    - - Re:Hardly a new thing... (Score:3, Informative)
        
        by Afrosheen ( 42464 ) writes:
        
        You must be new here. That's the whole point of meta-moderation. If you do it, it improves your chances of moderating in the future, because it basically reviews the moderating quality of previous moderation from other people.
        
        Re:Hardly a new thing... (Score:5, Funny)
        
        by Fulcrum of Evil ( 560260 ) writes: on Friday May 13, 2005 @10:43AM (#12519654)
        
        Is there a difference? ;-)
        
        I don't know about you, but when I mod slashdot, I'm almost always drunk or stoned. Really, it's the only way to fit in.
        
        Parent Share
        twitter facebook
        
        Lucky for me, (Score:3, Informative)
        
        by EvilStein ( 414640 ) writes:
        
        I'm both drunk *and* stoned.
        
        Should be a lot of moddin' fun today, lemme tell ya..
- - Re:Hardly a new thing... (Score:5, Funny)
    
    by pyropunk51 ( 819247 ) writes: on Friday May 13, 2005 @05:32AM (#12517658) Homepage
    
    I really hate this damned machine!
    I wish that I could sell it.
    It never does quite what I want,
    But only what I tell it!
    
    Parent Share
    twitter facebook
Err... "lying" is the default setting. RTFM. (Score:3, Informative)

by Tetard ( 202140 ) writes: on Friday May 13, 2005 @03:24AM (#12517243)

Write Cache enable is default on most IDE/ATA
drives. Most SCSI drives don't enable it.
If you don't like it, turn it off. There's
no "lying", and I'm sure the fsync() function
doesn't know diddly squat about the cache of
your disk. Maybe the ATA/device abstraction layer does, and I'm sure there's a configurable registry/sysctl/frob you can twiddle to make it DTRT (like FreeBSD has).

Move along, nothing to see...

Share
twitter facebook
- Re:Err... "lying" is the default setting. RTFM. (Score:5, Informative)
  
  by ewhac ( 5844 ) writes: on Friday May 13, 2005 @03:35AM (#12517281) Homepage Journal
  
  Yes, except there is a 'sync' command packet that is supposed to make the drive commit outstanding buffers to the platters, and not signal completion until those writes are done. It would appear, at first blush, that the drives are mis-handling this command when write-caching is enabled.
  There is historical precedent for this. There were recorded incidents of drives corrupting themselves when the OS, during shutdown, tried to flush buffers to the disk just before killing power. The drive said, "I'm done," when it really wasn't, and the OS said Okay, and killed power. This was relatively common on systems with older, slower disks that had been retrofitted with faster CPUs.
  
  However, once these incidents started ocurring, the issue was supposed to have been fixed. Clearly, closer study is needed here to discover what's really going on.
  
  Schwab
  
  Parent Share
  twitter facebook
  - Re:Err... "lying" is the default setting. RTFM. (Score:4, Informative)
    
    by frinkazoid ( 880013 ) writes: on Friday May 13, 2005 @07:25AM (#12518122)
    
    this is true .. Installing a fresh windows 98 SE on a fairly new pc and then doing windows update, there is an update witch this description:
    
    The Windows IDE Hard Drive Cache Package provides a workaround to a recently identified issue with computers that have the combination of Integrated Drive Electronics (IDE) hard disk drives with large caches and newer/faster processors. Computers with this combination may risk losing data if the hard disk shuts down before it can preserve the data in its cache.
    
    This update introduces a slight delay in the shutdown process. The delay of two seconds allows the hard drive's onboard cache to write any data to the hard drive.
    
    I found it nice to see how M$ worked around it, just waiting 2 seconds, how ingenious !
    link to the M$ update site: http://www.microsoft.com/windows98/downloads/conte nts/WUCritical/q273017/Default.asp [microsoft.com]
    
    Parent Share
    twitter facebook
    - Re:Err... "lying" is the default setting. RTFM. (Score:3, Interesting)
      
      by c_oflynn ( 649487 ) writes:
      
      >I found it nice to see how M$ worked around it,
      >just waiting 2 seconds, how ingenious !
      
      What would you have done? Verifying all data would probably take longer than 2 seconds, and you can't trust the disk to tell you when it's written the data.
      
      So you'd either have to figure out all the data that was in the cache, and verify that against the disk surface and only write when all that is done, or wait a bit. Making some assumptions about buffer size and transfer speed, then adding a saftey factor, is p
  - Re:Err... "lying" is the default setting. RTFM. (Score:4, Informative)
    
    by jonwil ( 467024 ) writes: on Friday May 13, 2005 @07:51AM (#12518227)
    
    The right answer is for the drive not to respond to the "Sync" command with "Done" untill it really is done (however long it takes) and for the OS to not continue untill it sees the "done" command from the drive.
    
    Parent Share
    twitter facebook
  - - - Re:Err... "lying" is the default setting. RTFM. (Score:5, Insightful)
        
        by Basehart ( 633304 ) writes: on Friday May 13, 2005 @04:44AM (#12517510)
        
        "I remember that MS had a fix for this (for laptops etc)... Which just made Windows wait a duration (~30s)..."
        
        This turned into the "my computer isn't doing what I want it to do, which is turn the F off" at which point the consumer simply reached down and yanked the power cord.
        
        Try writing a routine for this routine!
        
        Parent Share
        twitter facebook
        
        Re:Err... "lying" is the default setting. RTFM. (Score:3, Insightful)
        
        by Viceice ( 462967 ) writes:
        
        It's called Not Keeping Info from the User(tm).
        
        All that needs to be done is instead of simply displaying "Windows is Shutting Down..." display what's going on.. Like "Flushing Disc Buffers..." then "Awaiting Disc OK "
        
        And people won't assume the PC has Hung and yank the cord (and if they did, they took an informed gamble and deserve the consequences.)
- Re:Err... "lying" is the default setting. RTFM. (Score:5, Insightful)
  
  by Yokaze ( 70883 ) writes: on Friday May 13, 2005 @03:47AM (#12517333)
  
  No. If you had no cache, there would be no need for a flush command. The flush command exists purely for the reason of flushing buffer and caches on the harddisc. The ATA-5 specifies the command as E7h (and as mandatory).
  
  The command is specified in practically in all storage interfaces for exactly the reason the author cited, integrity. Otherwise, you can't assure integrity without sacrificing a lot of performance.
  
  Parent Share
  twitter facebook
  - - - Re:Err... "lying" is the default setting. RTFM. (Score:3, Insightful)
        
        by Hammer ( 14284 ) writes:
        
        Seems you don't get it. fsync() flushes to the device not to the physical media! The specs clearly says that all the data should be sent to the storage device, it does not say that the storage device should flush it's internal cache too! Do you see the difference?
        I think you missed the point here buddy... In the case of Linux, after sending the data, the driver explicitly issues a hardware command to tell the device to write to media and STFU until done!
        Do you see the difference?
- Re:Err... "lying" is the default setting. RTFM. (Score:5, Informative)
  
  by Everleet ( 785889 ) writes: on Friday May 13, 2005 @05:13AM (#12517596)
  
  fsync() is pretty clearly documented to cause a flush of the kernel buffers, not the disk buffers. This shouldn't come as a surprise to anyone.
  From Mac OS X --
  
  DESCRIPTION Fsync() causes all modified data and attributes of fd to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk. Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physi- cally write the data to the platters for quite some time and it may be written in an out-of-order sequence. Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present while earlier writes are not. This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures. For applications that require tighter guarantess about the integrity of their data, MacOS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications such as databases that require a strict ordering of writes should use F_FULLFSYNC to ensure their data is written in the order they expect. Please see fcntl(2) for more detail.
  
  From Linux --
  
  NOTES In case the hard disk has write cache enabled, the data may not really be on permanent storage when fsync/fdatasync return.
  
  From FreeBSD's tuning(7) --
  
  IDE WRITE CACHING FreeBSD 4.3 flirted with turning off IDE write caching. This reduced write bandwidth to IDE disks but was considered necessary due to serious data consistency issues introduced by hard drive vendors. Basically the problem is that IDE drives lie about when a write completes. With IDE write caching turned on, IDE hard drives will not only write data to disk out of order, they will sometimes delay some of the blocks indefinitely under heavy disk load. A crash or power failure can result in serious file system corruption. So our default was changed to be safe. Unfortu- nately, the result was such a huge loss in performance that we caved in and changed the default back to on after the release. You should check the default on your system by observing the hw.ata.wc sysctl variable. If IDE write caching is turned off, you can turn it back on by setting the hw.ata.wc loader tunable to 1. More information on tuning the ATA driver system may be found in the ata(4) man page. There is a new experimental feature for IDE hard drives called hw.ata.tags (you also set this in the boot loader) which allows write caching to be safely turned on. This brings SCSI tagging features to IDE drives. As of this writing only IBM DPTA and DTLA drives support the feature. Warning! These drives apparently have quality control problems and I do not recommend purchasing them at this time. If you need perfor- mance, go with SCSI.
  
  Parent Share
  twitter facebook
  - Re:Err... "lying" is the default setting. RTFM. (Score:3, Insightful)
    
    by jesup ( 8690 ) * writes:
    
    Exactly - the author of this "test" made a bad assumption: fsync() (or rather the windows equivalent) means it's on the disk. Understandable, and once upon a time it was true in Unix. fsync() doesn't (that I know of) issue ATA sync commands, though.
    
    I used to beta-test SCSI drives, and write SCSI and IDE drivers (for the Amiga). Write-caching is (except for very specific applications) mandatory for speed reasons.
    
    If you want some performance and total write-safety, tagged queuing (SCSI or ATA) could prov
  - - Re:Err... "lying" is the default setting. RTFM. (Score:4, Interesting)
      
      by pv2b ( 231846 ) writes: on Friday May 13, 2005 @08:21AM (#12518391)
      
      Right. And the author is implementing a program that sends raw commands to ATA drives... in perl. Right. He does no such thing, at least not what I can see, by glancing at the source code of the perl script. Granted, I'm not fluent in perl, but it doesn't seem to do anything else than to do an fsync() equivalent. Please do correct me if I'm wrong.
      
      The truth is that he doesn't know wtf he's talking about. I decide to cut him some slack though, because the FreeBSD 4 man pages at least are very misleading, and I don't know what man pages he did read.
      
      By the way, I sent him an e-mail. It's available on my web space [altunderctrl.se]. I'm not posting it in full here, because it's a little long and it would be redundant, since a lot of the surrounding posts discuss pretty much the same thing as I said.
      
      Parent Share
      twitter facebook
What's this? (Score:5, Funny)

by binaryspiral ( 784263 ) writes: on Friday May 13, 2005 @03:24AM (#12517244)

Hard drive manufacturers screwing over customers? Why, who would have thought?

1 billion bytes equals 1 gigabyte - since when?

Dropped MTBF right after reducing the 3 year standard wrty to a 1 year - good timing.

Now this?

Wow what a track record of consumer loving...

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re:What's this? (Score:2, Informative)
  
  by Anonymous Coward writes:
  
  1 billion bytes equals 1 gigabyte - since when?
  
  Since 1960 [wikipedia.org]. Since 1998 [wikipedia.org], 2^30 bytes = 1 gibibyte.
  - Re:What's this? (Score:5, Funny)
    
    by pyrrhonist ( 701154 ) writes: on Friday May 13, 2005 @03:57AM (#12517366)
    
    2^30 bytes = 1 gibibyte.
    AaARaaGGgHHhh! I simply loathe the IEC binary prefix names.
    
    Kibibits sounds like dog food [kibblesnbits.com].
    
    "Kibibits, Kibibits, I'm gonna get me some Kibibits..."
    
    Parent Share
    twitter facebook
    - - Re:What's this? (Score:3, Funny)
        
        by pyrrhonist ( 701154 ) writes:
        
        So you prefer ambiguity? I'm sorry, but "pyrrhonist doesn't like the sound of the word" is NO reason to continue using ambiguous language.
        Relax, it was supposed to be a jo....
        
        waitaminute....
        
        You're the guy who came up with these prefixes aren't you?
  - Re:What's this? (Score:3, Insightful)
    
    by KiloByte ( 825081 ) writes:
    
    No, the gibi crap is a new invention, going against established practice. And, it sounds awful.
    - - Being right doesn't stop you being a pedant (^_^) (Score:3, Insightful)
        
        by Dogtanian ( 588974 ) writes:
        
        Maybe using kilo to mean 1024x is wrong.
        
        Fact of it is that *anyone* who knew enough about computers for it to matter would have known and agreed on this standard anyway, right or wrong.
        
        They came along and messed up a standard that everyone had agreed upon and was happy with. Don't even *think* of saying that using decimal kilobytes et al had any purpose other than making drives seem bigger than they were; that trick only worked because everyone had previously agreed that a kilobyte was 1024 bytes.
        
        If t
- Damn processor industry... (Score:4, Funny)
  
  by fo0bar ( 261207 ) writes: on Friday May 13, 2005 @03:50AM (#12517348)
  
  using the wrong definitions to make their products seem bigger. I bought a P4 2.4GHz CPU the other day, and was shocked to find it wasn't 2,576,980,377.6Hz like it should be! Lying thieves...
  
  Parent Share
  twitter facebook
  - Re:Damn processor industry... (Score:3, Informative)
    
    by spectecjr ( 31235 ) writes:
    
    using the wrong definitions to make their products seem bigger. I bought a P4 2.4GHz CPU the other day, and was shocked to find it wasn't 2,576,980,377.6Hz like it should be! Lying thieves...
    
    Sad to see this post being marked "insightful". 2.4GHz has always meant 2,400,000,000,000 cycles per second, and nothing else. No matter what speed your crystal clocks at.
    
    The original poster was being ... kind of sarcastic.
    - Re:Damn processor industry... (Score:4, Funny)
      
      by spectecjr ( 31235 ) writes: on Friday May 13, 2005 @04:07AM (#12517403) Homepage
      
      Ooops. Make that 2,400,000,000 not 2,400,000,000,000. That's the problem with big numbers - it's like spelling bananananananananananana - once you start you can't stop.
      
      Parent Share
      twitter facebook
    - Re:Damn processor industry... (Score:3, Funny)
      
      by ShagratTheTitleless ( 828134 ) writes:
      
      2,400,000,000,000 cycles per second
      Fucking Trillahertz! or is it Terrahertz! Either way, imagine a Beowulf Clu....[Post terminated by Karma Police]
- Re: (Score:3, Insightful)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- - Re:What's this? (Score:3, Insightful)
    
    by maxwell demon ( 590494 ) writes:
    
    Not to mention the 1.44 "Megabyte" floppy disk where "Megabyte" means 1024000 Bytes ...
    - Re:What's this? (Score:3, Informative)
      
      by KiloByte ( 825081 ) writes:
      
      Well, they are usually (corrected) labelled as 1440KB instead of 1.44MB. They have 2 sides and 80 tracks with 18 512 byte sectors on each track.
      
      It's real 1440KB, without cheating on the sector's headers and/or inter-sector gaps. If you format the floppy yourself, you can shave quite a bit of space from those gaps, and this was a quite popular thing to do.
      - Re:What's this? (Score:3, Informative)
        
        by KiloByte ( 825081 ) writes:
        
        Wrong. You're confusing low-level formatting (laying physical sectors onto the disk's surface) with creating a filesystem -- this is what's usually called formatting these days.
        
        If you obey the standard PC format, you'll get 18 sectors per track, letting quite a lot of margin space. The margins are needed as the drive doesn't really care whether the new data is put in the exactly same place as the old sector was. Still, the standard is way too conservative, and many programs like fdformat let you reduce
        
        Re:What's this? (Score:3, Informative)
        
        by some guy I know ( 229718 ) writes:
        
        The Amiga could format high-density disks to 1,760 Kb, given a high-density disk drive (which only came as standard in the high-end machines). However, if I remember rightly, to do this it had to slow the drive down.
        
        The drive itself was not slowed down.
        Instead, the Amiga had the drive write an entire track at a time, rather than just one sector at a time.
        (This meant that it could store more sectors per track, because there were no inter-sector gaps, just a lead-in and lead-out for the track as a whole.)
  - Re:What's this? (Score:2)
    
    by binaryspiral ( 784263 ) writes:
    
    Come'on you all know what I'm talking about, don't be a chode about it.
    
    1,000,000,000 bytes != 1 Gigabyte, unless you put a little legal disclaimer on the box so you can sell smaller harddrives with big numbers on it.
    
    So when I plug my 160GB hard drive in, linux, windows, and mac all say I have 152.66GB - this is the screw I don't enjoy. And no - it's not all lost to formatting.
    - Re:What's this? (Score:5, Informative)
      
      by thsths ( 31372 ) writes: on Friday May 13, 2005 @04:10AM (#12517414)
      
      > 1,000,000,000 bytes != 1 Gigabyte
      
      Actually, it is. The standard was updated in 1998 to avoid confusion (Standard IEC 60027-2). Giga is 10^9, and it is constant, which means it does not change just because you use it for hard disks or memory.
      
      If you mean 2^30, then you have to say gigabinary, abbreviated as gibi or Gi. Having different name for different things can avoid an awful lot of confusion, so it would very much recommend using them.
      
      And now please put the following events into the correct order: America goes metric, hell freezes over, people use Gibi correctly.
      
      Parent Share
      twitter facebook
      - Re:What's this? (Score:4, Informative)
        
        by Crayon Kid ( 700279 ) writes: on Friday May 13, 2005 @04:59AM (#12517559)
        
        I was also under the (wrong) impression that gigabit was the good old binary thing, and that gibi was something they made to express decimal alternatives. And in fact I find out it's quite contrary, thanks to the parent poster. :)
        
        Having repented, I point you to the this reference [absoluteastronomy.com] which does a very nice job of summing everything up.
        
        Parent Share
        twitter facebook
      - Marketing created the 'confusion' (Score:5, Insightful)
        
        by Dogtanian ( 588974 ) writes: on Friday May 13, 2005 @06:05AM (#12517818) Homepage
        
        Actually, it is. The standard was updated in 1998 to avoid confusion. Having different name for different things can avoid an awful lot of confusion, so it would very much recommend using them.
        
        Which is more important? The de facto standard that slightly misuses the 'kilo-' prefix, but *everyone* knows what it means; or something that was forced into place by marketing?
        
        As I argued in more depth elsewhere [slashdot.org], anyone who used computers *knew* what "kilobyte" and friends meant.
        
        There was no confusion, because only the 1024-byte definition was widely used.
        
        The 'need' to use the '1000 byte' definition was created by marketing, not computer people. THEY caused the confusion for their (short term) gain by exploiting the old meaning of 'kilobyte' to make their drives seem larger.
        
        Marketing do not give a flying **** about correctness or clarity; if there was any problem, *they* created it. Computer people knew what kilobyte meant.
        
        Parent Share
        twitter facebook
        
        Re:Marketing created the 'confusion' (Score:5, Insightful)
        
        by Crayon Kid ( 700279 ) writes: on Friday May 13, 2005 @06:35AM (#12517940)
        
        Marketing do not give a flying **** about correctness or clarity; if there was any problem, *they* created it. Computer people knew what kilobyte meant.
        
        I'm sure they took advantage of the blurry meanings for a while. But in the long run, you gotta admit the change makes sense, from a standardisation point of view. Every measuring unit uses kilo/mega/giga to mean powers of ten. Computer world was the odd one out, and it should rightly be labeled specifically.
        
        Parent Share
        twitter facebook
        
        Re:Marketing created the 'confusion' (Score:5, Insightful)
        
        by quantum bit ( 225091 ) writes: on Friday May 13, 2005 @07:23AM (#12518113) Journal
        
        Every measuring unit uses kilo/mega/giga to mean powers of ten. Computer world was the odd one out, and it should rightly be labeled specifically.
        
        Oh, the computer world uses those prefixes to mean powers of 10 too. They just mean powers of 10 in base 2 math :)
        
        Parent Share
        twitter facebook
        
        Re:Marketing created the 'confusion' (Score:3, Insightful)
        
        by dgatwood ( 11270 ) writes:
        
        I'm sure they took advantage of the blurry meanings for a while. But in the long run, you gotta admit the change makes sense, from a standardisation point of view.
        No, I don't admit it. Volume and distance measures are standardized to base 10 because they have no inherent natural unit. Computers have a natural unit---powers of two. In much the same way, we don't standardize time to base 10. Can you imagine if we decided we wanted to have 100 days in a year? It wouldn't work well because Earth doesn't
        
        Re:Marketing created the 'confusion' (Score:5, Insightful)
        
        by barawn ( 25691 ) writes: on Friday May 13, 2005 @11:26AM (#12520129) Homepage
        
        As I argued in more depth elsewhere, anyone who used computers *knew* what "kilobyte" and friends meant.
        
        Except Ethernet card manufacturers, modem manufacturers, PCI card manufacturers... oh, hell, just about anyone who transfers something with a clock.
        
        10baseT ethernet transfers data at 10 Mbps. That means 10 x 10^6 bits per second. IDE buses running at 66 MHz list their theoretical maximum as 66 MB/s.
        
        kilo = 1024 is retarded. It only makes sense for things that have to scale in powers of two, like memory. For a long while, "data rate" meant "kilo=1000, mega=1000 kilo" wheras in storage, "kilo=1024". Talk about a recipe for disaster.
        
        Just as an example: here's an article describing Ultra320 SCSI, and PCI bus bandwidth:
        
        Under standard PCI the host bus has a maximum speed of 66 MHz. This allows for a maximum transfer rate of 533 MB/sec across a 64-bit PCI bus.
        
        66 2/3 MHz (M here means what? oh, right, 10^6) times 8 bytes is 533 1/3 MB/s. Where here, "M" means "1000*1000". In MiB/s, it'd be 508.6263 MiB/s.
        
        Is this a problem? Yes. I shouldn't have to pull out a freaking calculator to figure out how long it should take to dump 2 GB of RAM across a 2 GB/s link. It should be one second, not 1.0737418 seconds.
        
        Computer people knew what kilobyte meant.
        
        No we didn't. We've never used kilo consistently. See above - we've talked about CPU speeds in terms of kHz and MHz, meaning 10^3, 10^6, and talked about kilobits/second meaning 10^3 bits per second, talked about kilobytes/second meaning 10^3 bytes/second, and turned around and talked about file sizes where kilobyte means 1024 bytes.
        
        We've never been consistent. The IEC finally owned up to it and admitted it, and asked us to all finally stop being so damned sloppy, and I'm quite glad they did.
        
        Parent Share
        twitter facebook
  - - - Re:What's this? (Score:3, Funny)
        
        by krymsin01 ( 700838 ) writes:
        
        Down with the kilometer. Up with the thoumeter!
        
        Re:What's this? (Score:3, Funny)
        
        by darien ( 180561 ) writes:
        
        No need to get all holier-than-thou.
...and Statistics. (Score:4, Funny)

by Kaenneth ( 82978 ) writes: on Friday May 13, 2005 @03:25AM (#12517246) Journal

So, do you think someone typed "Nuclear weapons are being developed by the government of Iraq.^H^Hn." just before the power went out?

Share
twitter facebook
Why do we need it? (Score:4, Interesting)

by Godman ( 767682 ) writes: on Friday May 13, 2005 @03:25AM (#12517247) Homepage Journal

If we are just now figuring out that fsync's don't work, then the question is, why do we care? Have we been using them, and they just haven't been working or something?

If we've made it this far without it, why do we need it now?

I'm just curious...

Share
twitter facebook
- Re:Why do we need it? (Score:5, Insightful)
  
  by Erik Hensema ( 12898 ) writes: on Friday May 13, 2005 @03:50AM (#12517345) Homepage
  
  We need it because of journalling filesystems. A JFS needs to be sure the journal has been flushed out to disk (and resides safely on the platters) before continuing to write the actual (meta)data. Afterwards, it needs to be sure the (meta)data is written properly to disk in order to start writing the journal again.
  
  When both the journal and the data are in the write cache of the drive, the data on the platters is in an undefined state. Loss of power means filesystem corruption -- just the thing a JFS is supposed to avoid.
  
  Also, switching off the machine the regular way is a hazard. As an OS you simply don't know when you can safely signal the PSU to switch itself off.
  
  Parent Share
  twitter facebook
  - Re:Why do we need it? (Score:3, Informative)
    
    by spectecjr ( 31235 ) writes:
    
    When both the journal and the data are in the write cache of the drive, the data on the platters is in an undefined state. Loss of power means filesystem corruption -- just the thing a JFS is supposed to avoid. ... except most drives use the angular momentum of the drive, the power left in the PSU and any spare voltage in the on-board capacitors to provide the power to finish writing and park the drive heads.
    
    At least, that was the state of the art in the early 90s.
    - Re:Why do we need it? (Score:3, Interesting)
      
      by pe1chl ( 90186 ) writes:
      
      But since then, the angular momentum of drives has decreased, and cache size has increased.
      Of course write speed has increased as well, but typical cache size of 8MB and write speed of 50MB/s would mean 160ms of continuous writing when the head already is positioned correctly.
      Assuming the cache can contain blocks scattered over the entire disk, it does not seem realistic to write everything back on power failure.
  - Re:Why do we need it? (Score:5, Insightful)
    
    by bgog ( 564818 ) * writes: on Friday May 13, 2005 @05:08AM (#12517584) Journal
    
    The author is specifically talking about the fsync function not the ATA sync command. fsync is an OS call notifying the system to flush it's write caches to the physical device. This writes to the disks write cache but I don't believe it actually issues the sync command to the drive.
    
    In the case of a journaling file system they issue the sync command to the drive to flush the data out.
    
    I work on a block-level transactional system that requires blocks to be synced to the platters. There where two options, modify the kernel to issue syncs to the ata drives on all writes (to the the disk in question) or to just disable the physical write cache on the drive. Turned out to be a touch faster to just diable the cache but the two are effectivly equal.
    
    However drives operate fine under normal conditions, applications write to file systems which take care of forcing the disks to sync. fsync (which the author is talking about) is an OS command and not directly related to the disk sync command.
    
    Parent Share
    twitter facebook
    - Re:Why do we need it? (Score:5, Insightful)
      
      by swmccracken ( 106576 ) writes: on Friday May 13, 2005 @06:50AM (#12517983) Homepage
      
      This writes to the disks write cache but I don't believe it actually issues the sync command to the drive.
      
      Yeah - that's the point of this thing - what's supposed to happen with fsync? From memory, sometimes it will guarentee it's all the way to the platters, sometimes it will not, depending on what storage system you're using, and how easy such a guarentee is to make.
      
      Linus in 2001 [iu.edu] discussing this issue - it's not new. That whole thread was about comparing SCSI against IDE drives, and it seemed that the IDE drives were either breaking the laws of physics, or lying, but the SCSI drives were being honest.
      
      From hazy memory, one problem is that without tagged-command-queing or native-command-queuing, one process issuing a sync will cause the hard drive and related software to wait until it has fully synched for all i/o "in flight"; holding up any other i/o tasks for other processes!
      
      That's why fsync often lies; because it's not pratical for people that fsync all the time to flush buffers to screw around with the whole i/o subsystem, and apparently some programs were overzealous with calling fsync when they shouldn't.
      
      However, with TCQ, commands that are synched overlap with other commands, so it's not that big a deal (other i/o tasks are not impacted any more than they would by other, unsynchronised, i/o). (Thus, with TCQ, fsync might go all the way to the platters, but without it it might just go to the IDE bus.) SCSI has had TCQ from day one, which is why a SCSI system is more likely to sync all the way than IDE.
      
      If I'm wrong, somebody correct me please.
      
      Brad's program certainly points out an issue - it should be possible for a database engine to write to disk and guarentee that it gets written; perhaps fsync() isn't good enough - be this fault in the drives, the IDE spec, IDE drivers or the OS.
      
      Parent Share
      twitter facebook
      - Flaw in the ATA specification + manufacturers (Score:3, Informative)
        
        by tlambert ( 566799 ) writes:
        
        Actually, it's a flaw in the ATA specification: ATA drives can do a disconnected read, but there is no way to do a disconnected write.
        
        Because of this, you can have a tagged command queue for read operations, but there is no way to provide a corresponding one for write operations.
        
        SCSI does not have this limitation, but the bus implementation is much more heavyweight, and therefore more expensive.
        
        The problem is exacerbated, in that ATA does not permit new disconnected read requests to be issues while the n
    - Re:Why do we need it? (Score:3, Informative)
      
      by bill_mcgonigle ( 4333 ) * writes:
      
      I work on a block-level transactional system that requires blocks to be synced to the platters. There where two options, modify the kernel to issue syncs to the ata drives on all writes (to the the disk in question) or to just disable the physical write cache on the drive. Turned out to be a touch faster to just diable the cache but the two are effectivly equal.
      
      Just to clarify - use hdparm -W to fiddle with the write cache on the drive. I've built linux-based network appliances that go out in the field,
- Re:Why do we need it? (Score:2)
  
  by Vellmont ( 569020 ) writes:
  
  If we've made it this far without it, why do we need it now?
  
  Maybe you've made it this far, but I'm sure there's other people that have mysteriously lost data, or had it corrupted. They probbably blamed the OS, faulty hardware, drivers, whatever.
  
  Data security is based on assumptions (a contract if you will). If you assume the contract hasn't been broken, you look elsewhere for blame when something goes wrong. Up until now I'm sure no one questioned whether fsync() was doing what it was supposed to (a
Of course it does! (Score:5, Interesting)

by grahamsz ( 150076 ) writes: on Friday May 13, 2005 @03:26AM (#12517253) Homepage Journal

Having written some diagnostic tools for a smaller hard disk maker (who i'll refrain from naming) it's amazing to me that disks work at all.

Most systems can identify and patch out bad sectors so that they aren't used. What surprised me is that the manufacturers have their own bad sector table, so when you get the disk it's fairly likely that there are already bad areas which have been mapped out.

Secondly the raw error rate was astoundingly high. It's been quite a few years but it was somewhere between on error in every 10E5 to 10E6 bits. So it's not unusual to find a mistake in every megabyte read. Of course CRC picks up this error and hides that from you too.

Granted this was a few years ago, but i wouldn't be surprised if it's as bad (or even worse) now.

Share
twitter facebook
- Re:Of course it does! (Score:2)
  
  by Nutria ( 679911 ) writes:
  
  It's been quite a few years but it was somewhere between on error in every 10E5 to 10E6 bits. So it's not unusual to find a mistake in every megabyte read.
  
  I'm surprised, but not that surprised.
  
  Areal densities are so high these days, the r/w heads are so small, and prices are so low, that I also am truly amazed that modern HDDs are made to work.
  
  But then, I remember 13" removable 5MB platters, and 8" floppy drives.
- Re:Of course it does! (Score:3, Informative)
  
  by ArbitraryConstant ( 763964 ) writes:
  
  "What surprised me is that the manufacturers have their own bad sector table, so when you get the disk it's fairly likely that there are already bad areas which have been mapped out."
  
  Can't you get the count with SMART?
  - Re:Of course it does! (Score:5, Informative)
    
    by cowbutt ( 21077 ) writes: on Friday May 13, 2005 @05:00AM (#12517560) Journal
    
    Sort of, yes:
    
    # smartctl -a /dev/hde | grep 'Reallocated_Sector_Ct' 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
    
    This indicates that /dev/hde is far from exhausting its supply of reserved blocks (the first 100) and never has been (the second 100, which is 'worst'). When it crosses the threshold (36) (or the threshold of any of the other 'Pre-fail' attributes for that matter), failure is imminent.
    
    Parent Share
    twitter facebook
- Re:Of course it does! (Score:3, Interesting)
  
  by pyropunk51 ( 819247 ) writes:
  
  As anybody who's ever used (or had to use :-( ) SpinRite [grc.com] will tel you, your HDD not only lies to you, it cheats and steals as well. To whit: It makes it seem there are no bad sectors, when in fact the surface is riddled with them, only the manufacturer hides this fact from you by having a bad sector table. Also errors are corrected on the fly by some CRC checking. You can ask the SMART for the stats, but you can do very little about the results it gives you, other than maybe buying a new disk (which most li
  - Re:Of course it does! (Score:3, Insightful)
    
    by enosys ( 705759 ) writes:
    
    IMHO having the drive hide bad sectors is a good idea. That way you don't have to enter any bad sector lists, you don't have to scan for them when formatting, and the OS doesn't have to worry about them.
    What would you do if you had full control over bad sectors? You're still able to keep trying to read a new bad sector that contains data. The drive will try to repair it when you write to it and if it can't then it will remap it. It seems to me the only thing you can't do is force the drive to try to re
- - Sadly unpredictable (Score:5, Interesting)
    
    by grahamsz ( 150076 ) writes: on Friday May 13, 2005 @03:46AM (#12517328) Homepage Journal
    
    i know all disks ultimately fail, but it's frustrating that some can be really abused and run for years, when others die abruptly.
    
    While working at said hard disk company i had one of their smaller disks sitting on the end of a steel ruler on my desk. I spun round on my chair, as i do when i'm thinking, and hit the other end of the ruler with my elbow. This of course launched the disk across the room, slamming it against the wall.
    
    Given that I was in the process of writing software to diagnose failure's I was quite excited about this accident. Of course i return the disk to the test setup and there's nothing wrong.
    
    In my experience, the only sure fire way to have a disk fail is to place any piece of important, but un-backed-up, work on it.
    
    Parent Share
    twitter facebook
- - Re:Of course it does!-Perfect world. (Score:3, Interesting)
    
    by grahamsz ( 150076 ) writes:
    
    Obviously everything will ultimately fail. I know that the semiconductor industry make the same part, test it to see how fast it is, then sell it as different models based on the test results.
    
    I was surprised that some reasonable proportion of hard drives sold have errors on them at that point in time.
    
    Part of me wonders if this explains the anecdotal stories that SCSI disks are more reliable than their cheaper ATA counterparts - even when they use the same physical hardware. Perhaps (and this is blind spec
    - Re:Of course it does!-Perfect world. (Score:5, Informative)
      
      by cowbutt ( 21077 ) writes: on Friday May 13, 2005 @05:15AM (#12517603) Journal
      
      Part of me wonders if this explains the anecdotal stories that SCSI disks are more reliable than their cheaper ATA counterparts - even when they use the same physical hardware. Perhaps (and this is blind speculation) the drives with fewer errors get sold to the customers willing to pay more.
      Sort of. According to this paper from Seagate [seagate.com], the main differences between SCSI and ATA are:
      
      SCSI drives are individually tested, rather than tested in batch
      
      SCSI drives typically have a 5 year warranty, rather than 1 year for ATA (note that Seagate's ATA drives also have 5 years, and WD's Special Edition -JB ATA drives have 3 years).
      
      SCSI drives usually have higher rotational speeds (i.e. 10K or 15K RPM vs. 7200RPM)
      
      SCSI drives usually make use of the latest technology. ATA uses whatever older technology has been cost-engineered to a suitable price-point
      
      The physical and programming interface
      
      I also suspect that SCSI drives have a larger number of reserved blocks for remapping, and that they remap blocks on read operations when the ECC indicate that a block has crossed some threshold of near-unreadability. This would account for a) SCSI drives' lower capacities and b) a report I had from a SCSI-using friend running BSD who reports that a 'remapping' message turned up in his syslog without needing any special action to invoke.
      
      By contrast, in my experience, ATA drives only remap failed blocks on write operations. Lots of people think that when a drive returns a read error on a file, it's only fit for the bin, but I've forced the remapping to take place by writing to the affected blocks (either by zeroing the entire partition or drive using dd or badblocks -w, or by removing the affected file then creating a large file that fills all unallocated space in a partition, then removing it to reclaim the space).
      
      Parent Share
      twitter facebook
      - Re:Of course it does!-Perfect world. (Score:3, Insightful)
        
        by pe1chl ( 90186 ) writes:
        
        SCSI drives usually make use of the latest technology. ATA uses whatever older technology has been cost-engineered to a suitable price-point
        
        SCSI drives usually are a couple of years behind in drive capacity relative to ATA drives. This seems to contradict the above.
      - Re:Of course it does!-Perfect world. (Score:4, Informative)
        
        by pe1chl ( 90186 ) writes: on Friday May 13, 2005 @05:35AM (#12517666)
        
        a report I had from a SCSI-using friend running BSD who reports that a 'remapping' message turned up in his syslog without needing any special action to invoke.
        
        SCSI drives can be set up to return "warning" codes like "I had trouble reading this sector but eventually I could read a good copy". When the driver is careful it will enable this, and when it occurs it will write back the sector to make sure a fresh copy is on the disk and/or it is remapped.
        Apparently BSD does this.
        
        By default, corrected sectors are just returned as OK. It is also possible to enable "auto remap on read" and the drive would be triggered to do the rewrite or remap by itself. Of course this means you have less control and less logging.
        (but you can read the remap tables)
        
        There are many details that can improve error handling but not all of them are fully worked out. For example, in Linux RAID-1, when a read error occurs the action is to take the drive offline, read the sector from the other disk and continue with 1 disk. Of course the proper handling would be to try writing the correct copy from the good disk back on the failed disk, and see if that fixes it. Only after several failures the disk should be taken offline, assuming that it has crashed.
        
        This has been like this for years, and is relatively easy to fix. I would be prepared to try fixing it but it seems one has to jump over many hurdles to get a fix in the kernel while not being the maintainer of the subsystem, and a mail to said person was not answered.
        
        Parent Share
        twitter facebook
Corporate Integrity (Score:2)

by dj245 ( 732906 ) writes:

Manufacturers are blatently sacrificing integrity in favor of scoring higher on 'pure speed' performance benchmarking."
Corporate Integrity, not data integrity. I've read through the article and don't see how you can lose data integrity unless you disable all caching, from the OS to the disk itself. In this day and age, nobody does that. Sure, somethings broke. But I fail to see how its very useful these days anyway. Maybe someone with a better grasp of why you would need Fsync could help out?
- Re:Corporate Integrity (Score:4, Informative)
  
  by Dorsai65 ( 804760 ) writes: <dkmerriman AT gmail DOT com> on Friday May 13, 2005 @03:42AM (#12517320) Homepage Journal
  
  What the article is saying is that the drive (or sometimes the RAID card and/or OS) is lying (with fsync) when it answers that it wrote the data: it didn't; so when you lose power, the data that was in cache (and should have been written) gets lost. It isn't a question of whether caching is turned on or not, but the drive truthfully saying whether or not the data was actually written.
  
  Parent Share
  twitter facebook
- Here's how (Score:5, Informative)
  
  by Moraelin ( 679338 ) writes: on Friday May 13, 2005 @03:44AM (#12517322) Journal
  
  For example, don't think "home user losing the last porn pic", think for example "corporate databases using XA transactions".
  
  The semantics of XA transactions say that at the end of the "prepare" step, the data is already on the disc (or whatever other medium), just not yet made visible. That, basically all that could possibly fail, has in fact had its chance to fail. And if you got an OK, then it didn't.
  
  Introducing a time window (likely extending not just past "prepare", but also past "commit") where the data is still in some cache and God knows when it'll actually get flushed, throws those whole semantics out the window. If, say, power fails (e.g., PSU blows a fuse) or shit otherwise hits the fan in that time window, you have fucked up the data.
  
  The whole idea of transactions is ACID: Atomicity, Consistency, Isolation, and Durability:
  
  - Atomicity - The entire sequence of actions must be either completed or aborted. The transaction cannot be partially successful.
  
  - Consistency - The transaction takes the resources from one consistent state to another.
  
  - Isolation - A transaction's effect is not visible to other transactions until the transaction is committed.
  
  - Durability - Changes made by the committed transaction are permanent and must survive system failure.
  
  That time window we introduced makes it at least possible to screw 3 out of 4 there. An update that involves more than one hard drive may not be Atomically executed in that case: only one change was really persisted. (E.g., if you booked a flight online, maybe the money got taken from your account, but not given to the airline.) It hasn't left the data in a Consistent state. (In the above example some money have disappeared into nowhere.) And it's all because it wasn't Durable. (An update we thought we committed hasn't, in fact, survived a system failure.)
  
  Parent Share
  twitter facebook
  - Re:Here's how (Score:3, Informative)
    
    by arivanov ( 12034 ) writes:
    
    And this is the exact reason why any good SLQ based system must have means of integrity checking.
    
    As someone who have been writing database stuff for 10+ years now, I get really pissed off when I see lunatics raving on Acid about ACID. ACID in itself is not enough.
    
    You must have reference checking, offline integrity tests as well as ongoing online integrity test. Repeating your example a transaction for buying tickets for a holiday must insert a record in the Requests table, Tickets table, Holidays table, e
    - And your point is? (Score:4, Informative)
      
      by Moraelin ( 679338 ) writes: on Friday May 13, 2005 @06:02AM (#12517804) Journal
      
      Yes, nothing by itself is enough, not even XA transactions, but it can make your life a _lot_ easier. Especially if not all records are under your control to start with.
      
      E.g., the bank doesn't even know that the money is going to reserve a ticket on flight 705 of Elbonian United Airlines. It just knows it must transfer $100 from account A to account B.
      
      E.g., the travel agency doesn't even have access to the bank's records to check that the money have been withdrawn from your account. And it shouldn't ever have.
      
      So you propose... what? That the bank gets full access to the airline's business data, and that the airline can read all bank accounts, for those integrity checks to even work? I'm sure you can see how that wouldn't work.
      
      Yes, if you have a single database and it's all under your control, life is damn easy. It starts getting complicated when you have to deal with 7 databases, out of which 5 are in 3 different departments, and 2 aren't even in the same company. And where not everything is a database either: e.g., where one of the things which must also happen atomically is sending messages on a queue.
      
      _Then_ XA and ACID become a lot more useful. It becomes one helluva lot easier to _not_ send, for example, a JMS message to the other systems at all when a transaction rolls back, than to try to bring the client's database back in a consistent state with yours.
      
      It also becomes a lot more expensive to screw up. We're talking stuff that has all the strength of a signed contract, not "oops, we'll give you a seat on the next flight".
      
      Yes, your tools discovered that you sent the order for, say, 20 trucks in duplicate. Very good. Then what? It's as good as a signed contract the instant it was sent. It'll take many hours of some manager's time to negotiate a way out of that fuck-up. That is _if_ the other side doesn't want to play hardbal and remind you that a contract is a contract.
      
      Wouldn't it be easier to _not_ have an inconsistency to start with, than to detect it later?
      
      Basically, yes, please do write all the integrity tests you can think of. Very good and insightful that. But don't assume that it suddenly makes XA transactions useless. _Anything_ that can reduce the probability of a failure in a distributed system is very much needed. Because it may be disproportionately more expensive to fix a screw-up, even if detected, than not to do it in the first place.
      
      Parent Share
      twitter facebook
In other news.... (Score:5, Funny)

by ToraUma ( 883708 ) writes: on Friday May 13, 2005 @03:31AM (#12517269)

96% of Livejournal users replied, "What's a hard drive? Is that like a modem?"

Share
twitter facebook
- Re:In other news.... (Score:4, Funny)
  
  by ameoba ( 173803 ) writes: on Friday May 13, 2005 @04:44AM (#12517509)
  
  No. It's memory. I just can't figure out why all these games that say 512MB is optimal are runnign so slow when I have 120GB.
  
  Parent Share
  twitter facebook
Seems fair enough: We lie to our hardrives too (Score:2, Funny)

by MonsieurCoward ( 639908 ) writes:

... "Swear to you there's no pr0n there !!"
An acceptable alternative. (Score:3, Insightful)

by rice_burners_suck ( 243660 ) writes: on Friday May 13, 2005 @03:36AM (#12517287)

Why am I not surprised at this? First, they decide that a kilobyte = 1000 bytes, rather than the correct value of 1024. This leads the megabyte to be 1000 kilobytes, again, rather than 1024. The gig is likewise 1000 megabytes. You might think, ok, big deal, right?
Yeah. In the days when the biggest hard drive you could get was 2 gigs, you would get 147,483,648 bytes less storage than advertised, unless you read the fine print located somewhere. This is only about 140 megs less than advertised. Today, when you can get 200 gig hard drives, the difference is much larger: 14,748,364,800 bytes less storage than advertised. This means that now, you get almost FOURTEEN GIGABYTES less storage than advertised. That's bigger than any hard drive that existed in 1995. That is a big deal.

I'm bringing up the size issue in a thread on fsync() because it is only one more area where hard drive manufacturers are cheating to get "better" performance numbers, instead of being honest and producing a good product. As a result, journaling filesystems and the like cannot be guaranteed to work properly.

If the hard drive mfgs really want good performance numbers, this is what they should do: Hard drives already have a small amount of memory (cache) in the drive electronics. Unfortunately, when the power goes away, the data therein becomes incoherent within nanoseconds. So, embed a flash chip on the hard drive electronics, along with a small rechargeable battery. If the battery is dead or the flash is fscked up, both of which can easily be tested today, the hard drive obeys all fsync() more religiously than the pope and works slightly more slowly. If the battery is alive and the flash works, the hard drive will, in the event of power-off with data remaining in the cache (now backed by battery), that data would be written to the flash chip. Upon the next powerup, the hard drive will initialize as normal, but before it accepts any incoming read or write commands, it will first record the information from flash to the platter. This is a good enough guarantee that data will not be lost, as the reliability of flash memory exceeds that of the magnetic platter, provided the flash is not written too many times, which it won't be under this kind of design; and as I said, nothing will be written to flash if the flash doesn't work anymore.

Share
twitter facebook
- Re:An acceptable alternative. (Score:3, Informative)
  
  by Johan Veenstra ( 61679 ) writes:
  
  kilo = 10^3 = 1,000
  mega = 10^6 = 1,000,000
  giga = 10^9 = 1,000,000,000
  
  kibi = 2^10 = 1,024
  mebi = 2^20 = 1,048,576
  gibi = 2^30 = 1,073,741,824
  
  So it's not the harddrive manufacturers that are wrong. You get 1 gigabyte harddisk space for every gigabyte advertised. When you're buying 1 gigabyte of memory you get 74 megabytes for free (because you actually get 1 gibibyte).
  - Re:An acceptable alternative. (Score:3, Insightful)
    
    by daikokatana ( 845609 ) writes:
    
    Ok, fair enough. Now step into any of the 99% of all computer shops out there and ask for a hard drive, 160 gibibyte in size.
    If they don't laugh until you exit the store, I'll pay your disk. Please make sure you record the event and share it on the net.
- Re:An acceptable alternative. (Score:5, Insightful)
  
  by Sparr0 ( 451780 ) writes: <sparr0@gmail.com> on Friday May 13, 2005 @03:54AM (#12517359) Homepage Journal
  
  You have no grasp of what 'kilo', 'mega', and 'giga' mean. They have meant the same thing for 45 years, computers did not change that. There is a standard for binary powers, you simply refuse to use it.
  
  Parent Share
  twitter facebook
  - Re:An acceptable alternative. (Score:3, Insightful)
    
    by Alioth ( 221270 ) writes:
    
    Ah, so now we know your 3GB space an 100GB of transfer advertised in your sig aren't binary gigabytes, but decimal, just like the hard drive manufacturers :-)
  - Re:An acceptable alternative. (Score:4, Insightful)
    
    by hyfe ( 641811 ) writes: on Friday May 13, 2005 @07:23AM (#12518110)
    
    You have no grasp of what 'kilo', 'mega', and 'giga' mean. They have meant the same thing for 45 years, computers did not change that. There is a standard for binary powers, you simply refuse to use it.
    Being able to keep two thoughts in your head simultaniosly is a nice skill.
    
    Sure, kilo, mega and giga scientific meanings never changed, but kilo, mega and giga in computer science started as out the binary values. They are still in use, when reporting free space left on your hard-drive both Windows and Linux use binary thousands. Saying this is a clear cut case is just ignoring reality, as using 1024 really does simplify alot of the math.
    
    Secondly, if the manufacturers actually had come out and said 'we have decided to adhere to scientific standards and use regular 1000's' and clearly marked their products as such, we wouldn't have any problems now. The problem is, they didn't. They just silently changed it, causing shitloads of confusion along the way. Of all the alternatives in this mess, they choose the one which could ruin an engineers day, only for the purpose of having your drive look a few % larger.
    
    Some fool let the marketers in on the engineering meetings and we all lived to rue that day.
    
    Parent Share
    twitter facebook
- Re:An acceptable alternative. (Score:2, Informative)
  
  by Rinzwind ( 870478 ) writes:
  
  Why am I not surprised at this? First, they decide that a kilobyte = 1000 bytes, rather than the correct value of 1024. This leads the megabyte to be 1000 kilobytes, again, rather than 1024. The gig is likewise 1000 megabytes. You might think, ok, big deal, right?
  Wrong. If you start ranting get your FACTS STRAIGHT. It's been solved in 1998 allready.
  The Standards
  
  Although computer data is normally measured in binary code, the prefixes for the multiples are based on the metric system. The nearest
- - Re:An acceptable alternative. (Score:2, Informative)
    
    by kasperd ( 592156 ) writes:
    
    It's not flash (EEPROM), it's battery-backed RAM.
    
    The suggestion was to use both, which I agree is a good idea, because you get the best from both worlds. Flash have a problem with being overwritten many times, which the suggested design solves by only using it in case of loss of power. Battery backed RAM have a problem with potential data loss if it needs to keep the data for longer time than there is power, which the suggested design solves by writing data to flash as soon as main power is lost. I hope
- - - Re:An acceptable alternative. (Score:3, Interesting)
      
      by Kiryat Malachi ( 177258 ) writes:
      
      Correct is the definitions that follow standard usage, and usage in EVERY OTHER BRANCH OF THE COMPUTER WORLD.
      
      How fast is a kilobit per second data transmission? Is it 1024 bits/s or 1000 bits/s?
      
      As much as it pains me, because I know they did it to screw customers, moving to the standard was correct. It *ought* to match everything else for reasons of consistency; it is more important to have current consistency across all current measurements inside of the computer than it is to have historical consisten
More information (Score:5, Interesting)

by Halo1 ( 136547 ) writes: on Friday May 13, 2005 @03:39AM (#12517305)

There was an interesting discussion [apple.com] on this topic a while ago on Apple's Darwin development list a while ago.

Share
twitter facebook
Author lied when implied that DRIVES are the issue (Score:5, Informative)

by Anonymous Coward writes: on Friday May 13, 2005 @03:42AM (#12517319)

The author lied when implied that DRIVES are the issue.

ATA-IDE, SCSI, and S-ATA drives from all major manufacturers will accept commands to flush the write buffer including track cache buffer completely.

These commands are critical before cutting power and "sleeping" in machines that can perform a complete "deep sleep" (no power at all whatsoever sent to the ATA-IDE drive.

Such OSes include Apples OS 9 on a G4 tower, and some versions of OSX on machines not supplied with certain nuaghty video cards.

Laptops, for example need to flush drives... AND THEY do.

All drives conform.

As for DRIVER AUTHORS not heeding the special calls sent to them.... he is correct.

Many driver writers (other than me) are loser shits that do not follow standards.

As for LSI raid cards, he is right, and otehr raid cards... that is becasue the products are defective. But the drives are not and the drivers COULD be written to honor a true flush.

As for his "discovery" of sync not working.... DUH!!!!!

the REAL sync is usually a privelidged operation, sent from the OS, and not highly documented.

For example on a Mac the REAL sync in OS9 is a jhook trap and not the documented normal OS call which has a governor on it.

Mainframes such as PRIMOS and other old mainframes including even unix typically faked the sync command and ONLY allowed it if the user was at the actual physical systems console and furthermore logged in as a root or backup operator.

This cheating always sickened me. but all OSes do this because so many people that think they know what they are doing try to sync all the time for idiotic self-rolled journalling file systems and journalled databases.

But DRIVES, except a couple S-ATA seagates from 2004 with bad firmware, ALWAYS will flush.

This author should have explained that its not the hard drives.

They perform as documented.

Admittedly Linux used to corrupt and not flush several years ago... but it was not the IDE drives. They never got the commands.

Its all a mess... but setting a DRIVE to not cache is NOT the solution! Its retarded to do so, and all the comments in this thread taling of setting the cache off are foolish.

As for caching device topics, there are many options.

1> SCSI WCE permanent option

2> ATA Seagate Set Features command 82h Disable write cache

3> ATA config commands sent over SCSI (RAID card) device using a SCSI CDB in passthrough It uses 16 byte CBD with 8h, or 12 byte CDB with Ah for sending the tunneled command.

4> ATA ATAPI commands for WCE bit, asif it was SCSI

Fibre Channel drives of course honor SCSI commands.

As for mere flushing, a variety of low level calls all have the same desired effect and are documented in respective standards manuals.

Share
twitter facebook
- Re:Author lied when implied that DRIVES are the is (Score:3, Interesting)
  
  by Sinner ( 3398 ) writes:
  
  Parent either doesn't know what he's talking about, or is a troll. Pity there isn't an "incoherent rant" moderation option, or we could avoid the ambiguity.
- - - Just trying to figure out whose fault it is (Score:3, Informative)
      
      by leehwtsohg ( 618675 ) writes:
      
      fsync(2) man does state:
      fsync copies all in-core parts of a file to disk, and waits until the device reports that all parts are on stable storage.
      But then it goes on to state:
      NOTES
      In case the hard disk has write cache enabled, the data may not really be on permanent storage when fsync/fdatasync return.
      
      Which, as you point out, can be a BAD THING (TM) if someone opens a window. So, who should change? fsync, and it's man page's NOTES for devices that have a cache but actually are capable of flushing that cach
- - Re:drive write caching _is unsafe_. (Score:5, Informative)
    
    by putaro ( 235078 ) writes: on Friday May 13, 2005 @09:37AM (#12518958) Journal
    
    Let's try a reply with a bit less flame attached.
    
    A journaling file system will know when it needs to get everything committed to disk in order to have a consistent state. At that point it will issue a sync to the drive to flush the drive's write cache. However, not every write has to get to the disk for the filesystem to be in a consistent state.
    
    Now, you're yelling BS, BS, BS...hold on and listen for a minute. I write file systems for a living and have done so for over 15 years.
    
    What is the commitment that a journaling file system makes to you? It makes the commitment that it will not be in an inconsistent state. It doesn't make the commitment that every last write will make it to disk. For example, ext3 in journaling mode only journals metadata transactions. Any data writes that you make are not guaranteed at all, unless you make the proper sync call. As someone pointed out above, fsync is not the proper call on many OS's.
    
    The way that we have settled on to make filesystems and databases work is to create atomic transactions and move from transaction to transaction. If a transaction fails (for any reason, but let's just assume it's because the system crashed), all of the data that was written as part of it is discarded when you restart. If the partial data was not discarded then the filesystem would be in an inconsistent state AND the data that you were writing (if you care about consistency) would be in an inconsistent state. So, forcing every write to immediately go to disk is pointless as if the transaction you're doing is interrupted you'll be discarding the data anyway. It's only when you are finishing the transaction that you need to make sure that everything is on disk. By that time it might be already, especially if that transaction was large.
    
    Let's take a simple situation. Say that you have a filesystem that guarantees that everytime you do a write() call, when the call returns that data will be on disk and available for you the next time and that if the write() errors or does not return, the file will be as it was before the write() was called. Now, you do a write of 100MB with a single call. The filesystem may scatter that data all over the disk depending on how fragmented it is. Forcing each write to disk in order will bang the head a lot and reduce your performance. By letting the write cache do its job and reorder writes as necessary your performance will be much better (we used to do this in the driver and file system cache. However, modern disk drives provided such an abstract interface that it's nearly impossible for the OS to micromanage write ordering. In the old days the OS knew where the head is because it told the damn drive where to put it. Now, you can sort of guess and you're usually wrong). Cache on ATA drives tops out at around 16MB so you will definitely flush most of the data out of the cache in the course of writing anyway. Finally, at the end, before returning, the FS would sync the drive's cache to the disks and mark the transaction as closed. Were the system to crash in the middle of the write when the system restarts it would need to discard any data that might have been written and it wouldn't matter which data had been written or not written. (Important note: Journaling file systems and databases have a recovery process after a crash. It's just a lot less involved than running fsck or DSKCHK over the whole disk)
    
    So, write caching is valuable and widely used. In order to avoid data corruption it's not necessary to turn off caching but it is necessary for the cache to do what it is told, when it is told (all of the write caches too, not just the disk's). Were the disks truly lying to the OS it would be bad. More likely, this guy's Perl script is just not OS specific enough to get the OS to really do what he thinks he is asking it to do. There's a reason why serious data management apps need to be ported and certified on an OS. Getting everything to do its job right is tough.
    
    Parent Share
    twitter facebook
Not really a Lie (Score:3, Informative)

by bgog ( 564818 ) * writes: on Friday May 13, 2005 @03:52AM (#12517355) Journal

It's not a lie. fsync syncs to a device. The device is a hard drive with a cache.

You'd expect a fsync to complete only when the data is physically written to disk. However usually this is not the case it completes only when it is fully written to the cache on the physical disk.

The downside of this is that it's possible to loose data if you pull the power plug (usually not just by hitting the power switch). However if the disks were to actually commit fully to the physical media on every fsync you would see a very very dramatic performance degredation. Not just a little slower so you look bad in a magazine article but incredibly slow, especially if you are running a database or similar application that fsyncs often.

Server class machines solve this problem by providing battery backed cache on their controllers. This allow the full speed operation by fsyncing only to cache but if power is lost the data is then safe because of the battery.

This doesn't matter too much for the average joe for a number of reasons. First the when the power switch is hit, the disks tend to finnish writing their caches before spinning down. IN the case of a power failure journaled file systems will usually keep you safe (but not always).

This is a big issue however if you are trying to implement an enterprise class database server on everyday hardware.

So turn off the write cache if you don't want it on but don't complain when your system starts to crawl.

Share
twitter facebook
- Re:Not really a Lie (Score:4, Informative)
  
  by ravenspear ( 756059 ) writes: on Friday May 13, 2005 @04:09AM (#12517407)
  
  However if the disks were to actually commit fully to the physical media on every fsync you would see a very very dramatic performance degredation. Not just a little slower so you look bad in a magazine article but incredibly slow, especially if you are running a database or similar application that fsyncs often.
  
  I think you are confusing write caching with fsyncing. Having no write cache to the disk would indeed slow things down quite a bit. I don't see how fsync fits the same description though. Simply honoring fsync (actually flushing the data to disk) would not slow things down anywhere near the same level as long as software makes intelligent use of it. Fsync is not designed to be used with every write to the disk, just for the occasional time when an application needs to guarantee certain data gets written.
  
  Parent Share
  twitter facebook
He misunderstands fsync() (Score:4, Informative)

by Dahan ( 130247 ) writes: <khym@azeotrope.org> on Friday May 13, 2005 @04:07AM (#12517401)

According to SUSv3 [opengroup.org]:

The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.

If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. This could be valid in the case where the system cannot assure non-volatile storage under any circumstances or when the system is highly fault-tolerant and the functionality is not required. In the middle ground between these extremes, fsync() might or might not actually cause data to be written where it is safe from a power failure.

(Emphasis added). If you don't want your hard drive to cache writes, send it a command to turn off the write cache. Don't rely on fsync(). Either that, or hack your kernel so that fsync() will send a SYNCHRONIZE CACHE command to the drive. That'll sync the entire drive cache though, not just the blocks associated with the file descriptor you passed to fsync().

Share
twitter facebook
fsync IS important (Score:2, Informative)

by carstenkuckuk ( 132629 ) writes:

fsync semantic is needed whenever you want to implement ACID transactions. This lies at the core of database systems and journaling file systems, for example. No fsync, no data integrity.
RTFM (Score:2, Informative)

by BigYawn ( 842342 ) writes:

From the fsync man page (section "NOTES"):
In case the hard disk has write cache enabled, the data may not really be on permanent storage when fsync/fdata sync return.
When an ext2 file system is mounted with the sync option, directory entries are also implicitly synced by fsync.
On kernels before 2.4, fsync on big files can be ineffi cient. An alternative might be to use the O_SYNC flag to open(2).
power or non-volatile memory in disk (Score:3, Informative)

by cahiha ( 873942 ) writes: on Friday May 13, 2005 @04:55AM (#12517541)

Well, it's unlikely this is going to change. The real solution is to give power long enough to the disk drive to let it complete its writes no matter what, and/or to add non-volatile or flash memory to the disk drive so that it can complete its writes after coming back up.

There is a fairly simple external solution for that: a UPS. They're good. Get one.

And even then it is not guaranteed that just because you write a block, you can read it again, because nothing can guarantee it. So, file systems need to deal, one way or another, with the possibility that this case occurs.

Share
twitter facebook
Examples from the World of Windows. (Score:5, Interesting)

by stereoroid ( 234317 ) writes: on Friday May 13, 2005 @05:56AM (#12517769) Homepage Journal

Microsoft have had a few problems in this area - see KB281672 [microsoft.com] for example.
Then they released Windows 2000 Service Pack 3, which fixed some previous cacheing bugs, as documented in KB332023 [microsoft.com]. The article tells you how to set up the "Power Protected" Write Cache Option", which is your way of saying "yes, my storage has a UPS or battery-backed cache, give me the performance and let me worry about the data integrity".

I work for a major storage hardware vendor: to cut a long story short, we knew fsync() (a.k.a. "write-through" or "synchronize cache") was working on our hardware, when the performance started sucking after customers installed W2K SP3, and we had to refer customers to the latter article.

The same storage systems have battery-backed cache, and every write from cache to disks is made write-through (because drive cache is not battery-backed). In other words, in these and other Enterprise-class systems, the burden of honouring fsync() / write-through commands from the OS has switched to the storage controller(s), the drives might as well have no cache for all we care. But it still matters that the drives do honour the fsync() we send to them from cache, and not signal "clear" when they're not - if they lie, the cache drops that data, and no battery will get it back..!

Share
twitter facebook
Much ado about nothing (Score:5, Informative)

by jgarzik ( 11218 ) writes: on Friday May 13, 2005 @09:50AM (#12519114) Homepage

All it would have taken is ten minutes of searching on Google to discover what is going on.

You need a vaguely recent 2.6.x kernel to support fsync(2) and fdatasync(2) flushing your disk's write cache. Previous 2.4.x and 2.6.x kernels would only flush the write cache upon reboot, or if you used a custom app to issue the 'flush cache' command directly to your disk.

Very recent 2.6.x kernels include write barrier support, which flushes the write cache when the ext3 journal gets flushed to disk.

If your kernel doesn't flush the write cache, then obviously there is a window where you can lose data. Welcome to the world of write-back caching, circa 1990.

If you are stuck without a kernel that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI) command, it is trivial to write a userspace utility that issues the command.

Jeff, the Linux SATA driver guy

Share
twitter facebook
Put a capacitor on the harddrive (Score:3, Interesting)

by kublikhan ( 838265 ) writes: on Friday May 13, 2005 @01:03PM (#12521277)

Couldn't they just stick a large capacitor or small battery on the harddrive that is only used for flushing the write cache to the platters in the event of a power failure? It should be a simple enough matter, we only need a few seconds here, and it would solve this whole mess.

Share
twitter facebook
- Re:Which ones ? (Score:5, Interesting)
  
  by ewhac ( 5844 ) writes: on Friday May 13, 2005 @03:46AM (#12517330) Homepage Journal
  
  Can someone explain how OSes could lie?
  
  Easy. The driver gets a 'sync' command from the OS. However, the driver writer believes that most other programmers call fsync() when they don't really need to, and decides to "optimize" this case. So he passes the command on to the drive, but returns immediately (allowing the drive command to complete asynchronously). This makes his driver appear faster.
  
  Fortunately, most driver writers have their priorities straight about data integrity, so this kind of thinking isn't very common.
  
  Schwab
  
  Parent Share
  twitter facebook
- Re:fsync question (Score:3, Informative)
  
  by tomstdenis ( 446163 ) writes:
  
  Use reiserfs?
  
  At least then the file is either there or not there.
  
  My gentoo box has been through a few brownouts/powerouts [I have a UPS now ...] and hasn't skipped a beat. It even comes back up on it's own [go Asus bios ;-)] when I'm say on another continent ;-)
  
  Tom

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Hardly a new thing... (Score:4, Funny)

Re:Hardly a new thing... (Score:5, Funny)

Re:Hardly a new thing... (Score:5, Funny)

Re:Hardly a new thing... (Score:3, Informative)

Re:Hardly a new thing... (Score:3, Informative)

Re:Hardly a new thing... (Score:5, Funny)

Lucky for me, (Score:3, Informative)

Re:Hardly a new thing... (Score:5, Funny)

Err... "lying" is the default setting. RTFM. (Score:3, Informative)

Re:Err... "lying" is the default setting. RTFM. (Score:5, Informative)

Re:Err... "lying" is the default setting. RTFM. (Score:4, Informative)

Re:Err... "lying" is the default setting. RTFM. (Score:3, Interesting)

Re:Err... "lying" is the default setting. RTFM. (Score:4, Informative)

Re:Err... "lying" is the default setting. RTFM. (Score:5, Insightful)

Re:Err... "lying" is the default setting. RTFM. (Score:3, Insightful)

Re:Err... "lying" is the default setting. RTFM. (Score:5, Insightful)

Re:Err... "lying" is the default setting. RTFM. (Score:3, Insightful)

Re:Err... "lying" is the default setting. RTFM. (Score:5, Informative)

Re:Err... "lying" is the default setting. RTFM. (Score:3, Insightful)

Re:Err... "lying" is the default setting. RTFM. (Score:4, Interesting)

What's this? (Score:5, Funny)

Re: (Score:3, Informative)

Re:What's this? (Score:2, Informative)

Re:What's this? (Score:5, Funny)

Re:What's this? (Score:3, Funny)

Re:What's this? (Score:3, Insightful)

Being right doesn't stop you being a pedant (^_^) (Score:3, Insightful)

Damn processor industry... (Score:4, Funny)

Re:Damn processor industry... (Score:3, Informative)

Re:Damn processor industry... (Score:4, Funny)

Re:Damn processor industry... (Score:3, Funny)

Re: (Score:3, Insightful)

Re:What's this? (Score:3, Insightful)

Re:What's this? (Score:3, Informative)

Re:What's this? (Score:3, Informative)

Re:What's this? (Score:3, Informative)

Re:What's this? (Score:2)

Re:What's this? (Score:5, Informative)

Re:What's this? (Score:4, Informative)

Marketing created the 'confusion' (Score:5, Insightful)

Re:Marketing created the 'confusion' (Score:5, Insightful)

Re:Marketing created the 'confusion' (Score:5, Insightful)

Re:Marketing created the 'confusion' (Score:3, Insightful)

Re:Marketing created the 'confusion' (Score:5, Insightful)

Re:What's this? (Score:3, Funny)

Re:What's this? (Score:3, Funny)

...and Statistics. (Score:4, Funny)

Why do we need it? (Score:4, Interesting)

Re:Why do we need it? (Score:5, Insightful)

Re:Why do we need it? (Score:3, Informative)

Re:Why do we need it? (Score:3, Interesting)

Re:Why do we need it? (Score:5, Insightful)

Re:Why do we need it? (Score:5, Insightful)

Flaw in the ATA specification + manufacturers (Score:3, Informative)

Re:Why do we need it? (Score:3, Informative)

Re:Why do we need it? (Score:2)

Of course it does! (Score:5, Interesting)

Re:Of course it does! (Score:2)

Re:Of course it does! (Score:3, Informative)

Re:Of course it does! (Score:5, Informative)

Re:Of course it does! (Score:3, Interesting)

Re:Of course it does! (Score:3, Insightful)

Sadly unpredictable (Score:5, Interesting)

Re:Of course it does!-Perfect world. (Score:3, Interesting)

Re:Of course it does!-Perfect world. (Score:5, Informative)

Re:Of course it does!-Perfect world. (Score:3, Insightful)

Re:Of course it does!-Perfect world. (Score:4, Informative)

Corporate Integrity (Score:2)

Re:Corporate Integrity (Score:4, Informative)

Here's how (Score:5, Informative)

Re:Here's how (Score:3, Informative)

And your point is? (Score:4, Informative)

In other news.... (Score:5, Funny)

Re:In other news.... (Score:4, Funny)

Seems fair enough: We lie to our hardrives too (Score:2, Funny)

An acceptable alternative. (Score:3, Insightful)

Re:An acceptable alternative. (Score:3, Informative)

Re:An acceptable alternative. (Score:3, Insightful)

Re:An acceptable alternative. (Score:5, Insightful)

Re:An acceptable alternative. (Score:3, Insightful)