Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Data Storage Bug

Your Hard Drive Lies to You 512

fenderdb writes "Brad Fitzgerald of LiveJournal fame has written a utility and a quick article on how all hard drives from the consumer level to the highest level 'enterprise' grade SCSI and SATA drives do not obey the fsync() function. Manufacturers are blatantly sacrificing integrity in favor of scoring higher on 'pure speed' performance benchmarking."
This discussion has been archived. No new comments can be posted.

Your Hard Drive Lies to You

Comments Filter:
  • by Tetard ( 202140 ) on Friday May 13, 2005 @03:24AM (#12517243)
    Write Cache enable is default on most IDE/ATA
    drives. Most SCSI drives don't enable it.
    If you don't like it, turn it off. There's
    no "lying", and I'm sure the fsync() function
    doesn't know diddly squat about the cache of
    your disk. Maybe the ATA/device abstraction layer does, and I'm sure there's a configurable registry/sysctl/frob you can twiddle to make it DTRT (like FreeBSD has).

    Move along, nothing to see...
  • Re:What's this? (Score:1, Informative)

    by Anonymous Coward on Friday May 13, 2005 @03:29AM (#12517261)
    > 1 billion bytes equals 1 gigabyte - since when?

    Billion has equalled Giga since forever.

    Then people with computers decided close enough is good enough (the LAST people who should have done such a braindead thing) and decided to make some kilo, mega, giga, tera etc prefixes equal to the closest binary representation, 1024, 1048576 etc and it's confused everybody ever since.

    What's worse is that not all kilo/mega/giga in computing actually means 1024/1048576/etc, just some. One gigabyte? One Gigahertz? One Gigabit/second?
  • by ewhac ( 5844 ) on Friday May 13, 2005 @03:35AM (#12517281) Homepage Journal
    Yes, except there is a 'sync' command packet that is supposed to make the drive commit outstanding buffers to the platters, and not signal completion until those writes are done. It would appear, at first blush, that the drives are mis-handling this command when write-caching is enabled.

    There is historical precedent for this. There were recorded incidents of drives corrupting themselves when the OS, during shutdown, tried to flush buffers to the disk just before killing power. The drive said, "I'm done," when it really wasn't, and the OS said Okay, and killed power. This was relatively common on systems with older, slower disks that had been retrofitted with faster CPUs.

    However, once these incidents started ocurring, the issue was supposed to have been fixed. Clearly, closer study is needed here to discover what's really going on.

    Schwab

  • Comment removed (Score:3, Informative)

    by account_deleted ( 4530225 ) on Friday May 13, 2005 @03:36AM (#12517289)
    Comment removed based on user account deletion
  • Re:What's this? (Score:2, Informative)

    by Anonymous Coward on Friday May 13, 2005 @03:38AM (#12517299)
    1 billion bytes equals 1 gigabyte - since when?

    Since 1960 [wikipedia.org]. Since 1998 [wikipedia.org], 2^30 bytes = 1 gibibyte.

  • by Anonymous Coward on Friday May 13, 2005 @03:42AM (#12517319)
    The author lied when implied that DRIVES are the issue.

    ATA-IDE, SCSI, and S-ATA drives from all major manufacturers will accept commands to flush the write buffer including track cache buffer completely.

    These commands are critical before cutting power and "sleeping" in machines that can perform a complete "deep sleep" (no power at all whatsoever sent to the ATA-IDE drive.

    Such OSes include Apples OS 9 on a G4 tower, and some versions of OSX on machines not supplied with certain nuaghty video cards.

    Laptops, for example need to flush drives... AND THEY do.

    All drives conform.

    As for DRIVER AUTHORS not heeding the special calls sent to them.... he is correct.

    Many driver writers (other than me) are loser shits that do not follow standards.

    As for LSI raid cards, he is right, and otehr raid cards... that is becasue the products are defective. But the drives are not and the drivers COULD be written to honor a true flush.

    As for his "discovery" of sync not working.... DUH!!!!!

    the REAL sync is usually a privelidged operation, sent from the OS, and not highly documented.

    For example on a Mac the REAL sync in OS9 is a jhook trap and not the documented normal OS call which has a governor on it.

    Mainframes such as PRIMOS and other old mainframes including even unix typically faked the sync command and ONLY allowed it if the user was at the actual physical systems console and furthermore logged in as a root or backup operator.

    This cheating always sickened me. but all OSes do this because so many people that think they know what they are doing try to sync all the time for idiotic self-rolled journalling file systems and journalled databases.

    But DRIVES, except a couple S-ATA seagates from 2004 with bad firmware, ALWAYS will flush.

    This author should have explained that its not the hard drives.

    They perform as documented.

    Admittedly Linux used to corrupt and not flush several years ago... but it was not the IDE drives. They never got the commands.

    Its all a mess... but setting a DRIVE to not cache is NOT the solution! Its retarded to do so, and all the comments in this thread taling of setting the cache off are foolish.

    As for caching device topics, there are many options.

    1> SCSI WCE permanent option

    2> ATA Seagate Set Features command 82h Disable write cache

    3> ATA config commands sent over SCSI (RAID card) device using a SCSI CDB in passthrough It uses 16 byte CBD with 8h, or 12 byte CDB with Ah for sending the tunneled command.

    4> ATA ATAPI commands for WCE bit, asif it was SCSI

    Fibre Channel drives of course honor SCSI commands.

    As for mere flushing, a variety of low level calls all have the same desired effect and are documented in respective standards manuals.

  • by Dorsai65 ( 804760 ) <dkmerriman.gmail@com> on Friday May 13, 2005 @03:42AM (#12517320) Homepage Journal
    What the article is saying is that the drive (or sometimes the RAID card and/or OS) is lying (with fsync) when it answers that it wrote the data: it didn't; so when you lose power, the data that was in cache (and should have been written) gets lost. It isn't a question of whether caching is turned on or not, but the drive truthfully saying whether or not the data was actually written.
  • Here's how (Score:5, Informative)

    by Moraelin ( 679338 ) on Friday May 13, 2005 @03:44AM (#12517322) Journal
    For example, don't think "home user losing the last porn pic", think for example "corporate databases using XA transactions".

    The semantics of XA transactions say that at the end of the "prepare" step, the data is already on the disc (or whatever other medium), just not yet made visible. That, basically all that could possibly fail, has in fact had its chance to fail. And if you got an OK, then it didn't.

    Introducing a time window (likely extending not just past "prepare", but also past "commit") where the data is still in some cache and God knows when it'll actually get flushed, throws those whole semantics out the window. If, say, power fails (e.g., PSU blows a fuse) or shit otherwise hits the fan in that time window, you have fucked up the data.

    The whole idea of transactions is ACID: Atomicity, Consistency, Isolation, and Durability:

    - Atomicity - The entire sequence of actions must be either completed or aborted. The transaction cannot be partially successful.

    - Consistency - The transaction takes the resources from one consistent state to another.

    - Isolation - A transaction's effect is not visible to other transactions until the transaction is committed.

    - Durability - Changes made by the committed transaction are permanent and must survive system failure.

    That time window we introduced makes it at least possible to screw 3 out of 4 there. An update that involves more than one hard drive may not be Atomically executed in that case: only one change was really persisted. (E.g., if you booked a flight online, maybe the money got taken from your account, but not given to the airline.) It hasn't left the data in a Consistent state. (In the above example some money have disappeared into nowhere.) And it's all because it wasn't Durable. (An update we thought we committed hasn't, in fact, survived a system failure.)
  • by Anonymous Coward on Friday May 13, 2005 @03:49AM (#12517343)

    Since 1960 [wikipedia.org] [wikipedia.org], 1 kilobyte = 1000 bytes. Just like 1 kilometre = 1000 metres. Since 1998 [wikipedia.org] [wikipedia.org], 2^10 bytes = 1 kibibyte.

  • by Johan Veenstra ( 61679 ) on Friday May 13, 2005 @03:50AM (#12517347)
    kilo = 10^3 = 1,000
    mega = 10^6 = 1,000,000
    giga = 10^9 = 1,000,000,000

    kibi = 2^10 = 1,024
    mebi = 2^20 = 1,048,576
    gibi = 2^30 = 1,073,741,824

    So it's not the harddrive manufacturers that are wrong. You get 1 gigabyte harddisk space for every gigabyte advertised. When you're buying 1 gigabyte of memory you get 74 megabytes for free (because you actually get 1 gibibyte).
  • by stud9920 ( 236753 ) on Friday May 13, 2005 @03:52AM (#12517353)
    While a dubious abuse of popular culture, the hardware manufacturers are only correct about what a kilobyte, megabyte, gigabyte is : in the SI system, we do not use powers of 2 (2^10, 2^20, 2^30), but powers of 10 (10^3, 10^6, 10^9). That's already what the data transmission guys do with kilobits, megabits, gigabits and no one ever complains about them because they are correct.

    There is no reason to make an exception, the use of kilo, mega, giga was abuse in the first place (although acceptable in engineering terms, it's only a 2.4% error)

    The SI standards bodies have produced the horrendous prefixes kibi-, mebi- gibi- for your binary needs. They're horrible, but the only ones correct.
  • Not really a Lie (Score:3, Informative)

    by bgog ( 564818 ) * on Friday May 13, 2005 @03:52AM (#12517355) Journal
    It's not a lie. fsync syncs to a device. The device is a hard drive with a cache.

    You'd expect a fsync to complete only when the data is physically written to disk. However usually this is not the case it completes only when it is fully written to the cache on the physical disk.

    The downside of this is that it's possible to loose data if you pull the power plug (usually not just by hitting the power switch). However if the disks were to actually commit fully to the physical media on every fsync you would see a very very dramatic performance degredation. Not just a little slower so you look bad in a magazine article but incredibly slow, especially if you are running a database or similar application that fsyncs often.

    Server class machines solve this problem by providing battery backed cache on their controllers. This allow the full speed operation by fsyncing only to cache but if power is lost the data is then safe because of the battery.

    This doesn't matter too much for the average joe for a number of reasons. First the when the power switch is hit, the disks tend to finnish writing their caches before spinning down. IN the case of a power failure journaled file systems will usually keep you safe (but not always).

    This is a big issue however if you are trying to implement an enterprise class database server on everyday hardware.

    So turn off the write cache if you don't want it on but don't complain when your system starts to crawl.
  • by Rinzwind ( 870478 ) on Friday May 13, 2005 @04:01AM (#12517381)
    Why am I not surprised at this? First, they decide that a kilobyte = 1000 bytes, rather than the correct value of 1024. This leads the megabyte to be 1000 kilobytes, again, rather than 1024. The gig is likewise 1000 megabytes. You might think, ok, big deal, right?
    Wrong. If you start ranting get your FACTS STRAIGHT. It's been solved in 1998 allready.
    The Standards

    Although computer data is normally measured in binary code, the prefixes for the multiples are based on the metric system. The nearest binary number to 1,000 is 2^10 or 1,024; thus 1,024 bytes was named a Kilobyte. So, although a metric "kilo" equals 1,000 (e.g. one kilogram = 1,000 grams), a binary "Kilo" equals 1,024 (e.g. one Kilobyte = 1,024 bytes). Not surprisingly, this has led to a great deal of confusion. In December 1998, the International Electrotechnical Commission (IEC) approved a new IEC International Standard. Instead of using the metric prefixes for multiples in binary code, the new IEC standard invented specific prefixes for binary multiples made up of only the first two letters of the metric prefixes and adding the first two letters of the word "binary". Thus, for instance, instead of Kilobyte (KB) or Gigabyte (GB), the new terms would be kibibyte (KiB) or gibibyte (GiB).

    Here are brief summaries of the IEC Standard:

    bit bit 0 or 1
    byte B 8 bits
    kibibit Kibit 1024 bits
    kilobit kbit 1000 bits
    kibibyte (binary) KiB 1024 bytes
    kilobyte (decimal) kB 1000 bytes
    megabit Mbit 1000 kilobits
    mebibyte (binary) MiB 1024 kibibytes
    megabyte (decimal) MB 1000 kilobytes
    gigabit Gbit 1000 megabits
    gibibyte (binary) GiB 1024 mebibytes
    gigabyte (decimal) GB 1000 megabytes
    terabit Tbit 1000 gigabits
    tebibyte (binary) TiB 1024 gibibytes
    terabyte (decimal) TB 1000 gigabytes
    petabit Pbit 1000 terabits
    pebibyte (binary) PiB 1024 tebibytes
    petabyte (decimal) PB 1000 terabytes
    exabit Ebit 1000 petabits
    exbibyte (binary) EiB 1024 pebibytes
    exabyte (decimal) EB 1000 petabytes
    Check this: http://www.romulus2.com/articles/guides/misc/bitsb ytes.shtml [romulus2.com] and this: http://www.physics.nist.gov/Pubs/SP811/sec04.html# tab5 [nist.gov] Stop spreading FUD.
  • by ArbitraryConstant ( 763964 ) on Friday May 13, 2005 @04:02AM (#12517386) Homepage
    "What surprised me is that the manufacturers have their own bad sector table, so when you get the disk it's fairly likely that there are already bad areas which have been mapped out."

    Can't you get the count with SMART?
  • by spectecjr ( 31235 ) on Friday May 13, 2005 @04:05AM (#12517394) Homepage
    using the wrong definitions to make their products seem bigger. I bought a P4 2.4GHz CPU the other day, and was shocked to find it wasn't 2,576,980,377.6Hz like it should be! Lying thieves...

    Sad to see this post being marked "insightful". 2.4GHz has always meant 2,400,000,000,000 cycles per second, and nothing else. No matter what speed your crystal clocks at.

    The original poster was being ... kind of sarcastic.
  • by Dahan ( 130247 ) <khym@azeotrope.org> on Friday May 13, 2005 @04:07AM (#12517401)
    According to SUSv3 [opengroup.org]:
    The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.
    If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. This could be valid in the case where the system cannot assure non-volatile storage under any circumstances or when the system is highly fault-tolerant and the functionality is not required. In the middle ground between these extremes, fsync() might or might not actually cause data to be written where it is safe from a power failure.
    (Emphasis added). If you don't want your hard drive to cache writes, send it a command to turn off the write cache. Don't rely on fsync(). Either that, or hack your kernel so that fsync() will send a SYNCHRONIZE CACHE command to the drive. That'll sync the entire drive cache though, not just the blocks associated with the file descriptor you passed to fsync().
  • Re:Not really a Lie (Score:4, Informative)

    by ravenspear ( 756059 ) on Friday May 13, 2005 @04:09AM (#12517407)
    However if the disks were to actually commit fully to the physical media on every fsync you would see a very very dramatic performance degredation. Not just a little slower so you look bad in a magazine article but incredibly slow, especially if you are running a database or similar application that fsyncs often.

    I think you are confusing write caching with fsyncing. Having no write cache to the disk would indeed slow things down quite a bit. I don't see how fsync fits the same description though. Simply honoring fsync (actually flushing the data to disk) would not slow things down anywhere near the same level as long as software makes intelligent use of it. Fsync is not designed to be used with every write to the disk, just for the occasional time when an application needs to guarantee certain data gets written.
  • Re:What's this? (Score:5, Informative)

    by thsths ( 31372 ) on Friday May 13, 2005 @04:10AM (#12517414)
    > 1,000,000,000 bytes != 1 Gigabyte

    Actually, it is. The standard was updated in 1998 to avoid confusion (Standard IEC 60027-2). Giga is 10^9, and it is constant, which means it does not change just because you use it for hard disks or memory.

    If you mean 2^30, then you have to say gigabinary, abbreviated as gibi or Gi. Having different name for different things can avoid an awful lot of confusion, so it would very much recommend using them.

    And now please put the following events into the correct order: America goes metric, hell freezes over, people use Gibi correctly.
  • by spectecjr ( 31235 ) on Friday May 13, 2005 @04:10AM (#12517416) Homepage
    When both the journal and the data are in the write cache of the drive, the data on the platters is in an undefined state. Loss of power means filesystem corruption -- just the thing a JFS is supposed to avoid. ... except most drives use the angular momentum of the drive, the power left in the PSU and any spare voltage in the on-board capacitors to provide the power to finish writing and park the drive heads.

    At least, that was the state of the art in the early 90s.
  • fsync IS important (Score:2, Informative)

    by carstenkuckuk ( 132629 ) on Friday May 13, 2005 @04:13AM (#12517424)
    fsync semantic is needed whenever you want to implement ACID transactions. This lies at the core of database systems and journaling file systems, for example. No fsync, no data integrity.
  • by kasperd ( 592156 ) on Friday May 13, 2005 @04:16AM (#12517433) Homepage Journal
    It's not flash (EEPROM), it's battery-backed RAM.

    The suggestion was to use both, which I agree is a good idea, because you get the best from both worlds. Flash have a problem with being overwritten many times, which the suggested design solves by only using it in case of loss of power. Battery backed RAM have a problem with potential data loss if it needs to keep the data for longer time than there is power, which the suggested design solves by writing data to flash as soon as main power is lost. I hope what Samsung [anandtech.com] will also take care of those problems.
  • by deadcujo ( 233151 ) on Friday May 13, 2005 @04:21AM (#12517449) Homepage
    His surname is actually Fitzpatrick and not Fitzgerald.
  • by Anonymous Coward on Friday May 13, 2005 @04:23AM (#12517451)
    Please note that the definition says ...all data ... is to be transferred to the storage device... and it is what fsync() is actually doing! fsync() is not requried to transfer to the physical disk but just sending it to the harddisk is enough. Now what happens inside the harddisk is not of interest to fsync(): The idea is to flush all buffers in the software and the specs are not talking about the buffers in the hardware.
  • RTFM (Score:2, Informative)

    by BigYawn ( 842342 ) on Friday May 13, 2005 @04:25AM (#12517457)
    From the fsync man page (section "NOTES"):

    In case the hard disk has write cache enabled, the data may not really be on permanent storage when fsync/fdata sync return.
    When an ext2 file system is mounted with the sync option, directory entries are also implicitly synced by fsync.
    On kernels before 2.4, fsync on big files can be ineffi cient. An alternative might be to use the O_SYNC flag to open(2).

  • by cahiha ( 873942 ) on Friday May 13, 2005 @04:55AM (#12517541)
    Well, it's unlikely this is going to change. The real solution is to give power long enough to the disk drive to let it complete its writes no matter what, and/or to add non-volatile or flash memory to the disk drive so that it can complete its writes after coming back up.

    There is a fairly simple external solution for that: a UPS. They're good. Get one.

    And even then it is not guaranteed that just because you write a block, you can read it again, because nothing can guarantee it. So, file systems need to deal, one way or another, with the possibility that this case occurs.
  • Re:What's this? (Score:3, Informative)

    by KiloByte ( 825081 ) on Friday May 13, 2005 @04:56AM (#12517545)
    Well, they are usually (corrected) labelled as 1440KB instead of 1.44MB. They have 2 sides and 80 tracks with 18 512 byte sectors on each track.

    It's real 1440KB, without cheating on the sector's headers and/or inter-sector gaps. If you format the floppy yourself, you can shave quite a bit of space from those gaps, and this was a quite popular thing to do.
  • Re:What's this? (Score:4, Informative)

    by Crayon Kid ( 700279 ) on Friday May 13, 2005 @04:59AM (#12517559)
    I was also under the (wrong) impression that gigabit was the good old binary thing, and that gibi was something they made to express decimal alternatives. And in fact I find out it's quite contrary, thanks to the parent poster. :)

    Having repented, I point you to the this reference [absoluteastronomy.com] which does a very nice job of summing everything up.
  • by cowbutt ( 21077 ) on Friday May 13, 2005 @05:00AM (#12517560) Journal
    Sort of, yes:
    # smartctl -a /dev/hde | grep 'Reallocated_Sector_Ct'
    5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
    This indicates that /dev/hde is far from exhausting its supply of reserved blocks (the first 100) and never has been (the second 100, which is 'worst'). When it crosses the threshold (36) (or the threshold of any of the other 'Pre-fail' attributes for that matter), failure is imminent.
  • by Everleet ( 785889 ) on Friday May 13, 2005 @05:13AM (#12517596)
    fsync() is pretty clearly documented to cause a flush of the kernel buffers, not the disk buffers. This shouldn't come as a surprise to anyone.

    From Mac OS X --

    DESCRIPTION
    Fsync() causes all modified data and attributes of fd to be moved to a
    permanent storage device. This normally results in all in-core modified
    copies of buffers for the associated file to be written to a disk.

    Note that while fsync() will flush all data from the host to the drive
    (i.e. the "permanent storage device"), the drive itself may not physi-
    cally write the data to the platters for quite some time and it may be
    written in an out-of-order sequence.

    Specifically, if the drive loses power or the OS crashes, the application
    may find that only some or none of their data was written. The disk
    drive may also re-order the data so that later writes may be present
    while earlier writes are not.

    This is not a theoretical edge case. This scenario is easily reproduced
    with real world workloads and drive power failures.

    For applications that require tighter guarantess about the integrity of
    their data, MacOS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC
    fcntl asks the drive to flush all buffered data to permanent storage.
    Applications such as databases that require a strict ordering of writes
    should use F_FULLFSYNC to ensure their data is written in the order they
    expect. Please see fcntl(2) for more detail.

    From Linux --

    NOTES
    In case the hard disk has write cache enabled, the data may not really
    be on permanent storage when fsync/fdatasync return.

    From FreeBSD's tuning(7) --

    IDE WRITE CACHING
    FreeBSD 4.3 flirted with turning off IDE write caching. This reduced
    write bandwidth to IDE disks but was considered necessary due to serious
    data consistency issues introduced by hard drive vendors. Basically the
    problem is that IDE drives lie about when a write completes. With IDE
    write caching turned on, IDE hard drives will not only write data to disk
    out of order, they will sometimes delay some of the blocks indefinitely
    under heavy disk load. A crash or power failure can result in serious
    file system corruption. So our default was changed to be safe. Unfortu-
    nately, the result was such a huge loss in performance that we caved in
    and changed the default back to on after the release. You should check
    the default on your system by observing the hw.ata.wc sysctl variable.
    If IDE write caching is turned off, you can turn it back on by setting
    the hw.ata.wc loader tunable to 1. More information on tuning the ATA
    driver system may be found in the ata(4) man page.

    There is a new experimental feature for IDE hard drives called
    hw.ata.tags (you also set this in the boot loader) which allows write
    caching to be safely turned on. This brings SCSI tagging features to IDE
    drives. As of this writing only IBM DPTA and DTLA drives support the
    feature. Warning! These drives apparently have quality control problems
    and I do not recommend purchasing them at this time. If you need perfor-
    mance, go with SCSI.
  • by cowbutt ( 21077 ) on Friday May 13, 2005 @05:15AM (#12517603) Journal
    Part of me wonders if this explains the anecdotal stories that SCSI disks are more reliable than their cheaper ATA counterparts - even when they use the same physical hardware. Perhaps (and this is blind speculation) the drives with fewer errors get sold to the customers willing to pay more.

    Sort of. According to this paper from Seagate [seagate.com], the main differences between SCSI and ATA are:

    SCSI drives are individually tested, rather than tested in batch

    SCSI drives typically have a 5 year warranty, rather than 1 year for ATA (note that Seagate's ATA drives also have 5 years, and WD's Special Edition -JB ATA drives have 3 years).

    SCSI drives usually have higher rotational speeds (i.e. 10K or 15K RPM vs. 7200RPM)

    SCSI drives usually make use of the latest technology. ATA uses whatever older technology has been cost-engineered to a suitable price-point

    The physical and programming interface

    I also suspect that SCSI drives have a larger number of reserved blocks for remapping, and that they remap blocks on read operations when the ECC indicate that a block has crossed some threshold of near-unreadability. This would account for a) SCSI drives' lower capacities and b) a report I had from a SCSI-using friend running BSD who reports that a 'remapping' message turned up in his syslog without needing any special action to invoke.

    By contrast, in my experience, ATA drives only remap failed blocks on write operations. Lots of people think that when a drive returns a read error on a file, it's only fit for the bin, but I've forced the remapping to take place by writing to the affected blocks (either by zeroing the entire partition or drive using dd or badblocks -w, or by removing the affected file then creating a large file that fills all unallocated space in a partition, then removing it to reclaim the space).

  • Re:Here's how (Score:3, Informative)

    by arivanov ( 12034 ) on Friday May 13, 2005 @05:28AM (#12517639) Homepage
    And this is the exact reason why any good SLQ based system must have means of integrity checking.

    As someone who have been writing database stuff for 10+ years now, I get really pissed off when I see lunatics raving on Acid about ACID. ACID in itself is not enough.

    You must have reference checking, offline integrity tests as well as ongoing online integrity test. Repeating your example a transaction for buying tickets for a holiday must insert a record in the Requests table, Tickets table, Holidays table, etc and you must have an offline tool (even better backround thread) which checks that all records are present. In fact for the same reason in a well designed system you must violate 3rd normal form and have the integrity checking tool use the redundant data as well. Another alternative is a state load and checksum across the state storing it back in at least two different places (once again breaking 3rd normal form).

    If you do it this way you can get a working system even if ACID breaks (databases have bugs), you can recover if hardware breaks and most importantly you have a considerable level of fraud resistance.
  • by Anonymous Coward on Friday May 13, 2005 @05:29AM (#12517644)
    The idea is to flush all buffers in the software and the specs are not talking about the buffers in the hardware.

    That's nonsense. Applications that use fsync() do so in order to be certain that things are actually recorded in the hardware. It's by FAR the most important issue, and this is the whole purpose of fsync() --- a portable way of achieving it.
  • by pe1chl ( 90186 ) on Friday May 13, 2005 @05:35AM (#12517666)
    a report I had from a SCSI-using friend running BSD who reports that a 'remapping' message turned up in his syslog without needing any special action to invoke.

    SCSI drives can be set up to return "warning" codes like "I had trouble reading this sector but eventually I could read a good copy". When the driver is careful it will enable this, and when it occurs it will write back the sector to make sure a fresh copy is on the disk and/or it is remapped.
    Apparently BSD does this.

    By default, corrected sectors are just returned as OK. It is also possible to enable "auto remap on read" and the drive would be triggered to do the rewrite or remap by itself. Of course this means you have less control and less logging.
    (but you can read the remap tables)

    There are many details that can improve error handling but not all of them are fully worked out. For example, in Linux RAID-1, when a read error occurs the action is to take the drive offline, read the sector from the other disk and continue with 1 disk. Of course the proper handling would be to try writing the correct copy from the good disk back on the failed disk, and see if that fixes it. Only after several failures the disk should be taken offline, assuming that it has crashed.

    This has been like this for years, and is relatively easy to fix. I would be prepared to try fixing it but it seems one has to jump over many hurdles to get a fix in the kernel while not being the maintainer of the subsystem, and a mail to said person was not answered.
  • by Anonymous Coward on Friday May 13, 2005 @06:01AM (#12517803)
    there were many linux defects with no track cache flush command being recived by devices, but if you want one set of recent fixes for flush corruption ...

    refer to :

    -force-ide-cache-flush-on-shutdown-flush.patch
    -force-ide-cache-flush-on-shutdown-flush-fix.patch

    in Changes since 2.6.6-mm1

    ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/ patches/2.6/2.6.6/2.6.6-mm2/ [kernel.org]

    why the hell my informative parent post gets modded to only a "2" just because people do not like the truth is astounding.

    I was hoping this would happen to my INFORMATIVE post because it just means i will not bother helping anyone in slashdot again for another halfyear absence form posting.

    i figure... why bother... the S/N ratio is such that no low level coders seem to ever read slashdot anymore anyways in recent years.

    its probably time for me to more to other sites as well.

    "2"! on the only FACTUAL and informative post in the entire damned thread!
  • And your point is? (Score:4, Informative)

    by Moraelin ( 679338 ) on Friday May 13, 2005 @06:02AM (#12517804) Journal
    Yes, nothing by itself is enough, not even XA transactions, but it can make your life a _lot_ easier. Especially if not all records are under your control to start with.

    E.g., the bank doesn't even know that the money is going to reserve a ticket on flight 705 of Elbonian United Airlines. It just knows it must transfer $100 from account A to account B.

    E.g., the travel agency doesn't even have access to the bank's records to check that the money have been withdrawn from your account. And it shouldn't ever have.

    So you propose... what? That the bank gets full access to the airline's business data, and that the airline can read all bank accounts, for those integrity checks to even work? I'm sure you can see how that wouldn't work.

    Yes, if you have a single database and it's all under your control, life is damn easy. It starts getting complicated when you have to deal with 7 databases, out of which 5 are in 3 different departments, and 2 aren't even in the same company. And where not everything is a database either: e.g., where one of the things which must also happen atomically is sending messages on a queue.

    _Then_ XA and ACID become a lot more useful. It becomes one helluva lot easier to _not_ send, for example, a JMS message to the other systems at all when a transaction rolls back, than to try to bring the client's database back in a consistent state with yours.

    It also becomes a lot more expensive to screw up. We're talking stuff that has all the strength of a signed contract, not "oops, we'll give you a seat on the next flight".

    Yes, your tools discovered that you sent the order for, say, 20 trucks in duplicate. Very good. Then what? It's as good as a signed contract the instant it was sent. It'll take many hours of some manager's time to negotiate a way out of that fuck-up. That is _if_ the other side doesn't want to play hardbal and remind you that a contract is a contract.

    Wouldn't it be easier to _not_ have an inconsistency to start with, than to detect it later?

    Basically, yes, please do write all the integrity tests you can think of. Very good and insightful that. But don't assume that it suddenly makes XA transactions useless. _Anything_ that can reduce the probability of a failure in a distributed system is very much needed. Because it may be disproportionately more expensive to fix a screw-up, even if detected, than not to do it in the first place.
  • Re:What's this? (Score:3, Informative)

    by KiloByte ( 825081 ) on Friday May 13, 2005 @06:15AM (#12517866)
    Wrong. You're confusing low-level formatting (laying physical sectors onto the disk's surface) with creating a filesystem -- this is what's usually called formatting these days.

    If you obey the standard PC format, you'll get 18 sectors per track, letting quite a lot of margin space. The margins are needed as the drive doesn't really care whether the new data is put in the exactly same place as the old sector was. Still, the standard is way too conservative, and many programs like fdformat let you reduce the margins. Even Microsoft's original Win95 install floppies used a 1.7MB format.

    That was the physical low-level format, a rough equivalent to the level 2 ISO/OSI network layer (level 1 is twiddling the bits, level 2 defines the byte and sector boundaries in the raw bit stream).

    FAT formatting (the filesystem) uses up 33 sectors (on the whole disk, not per-track), reducing the useful space to 2847 sectors, that is 1457664 bytes. And this is what you see when you check the free space on an empty floppy.
  • by leehwtsohg ( 618675 ) on Friday May 13, 2005 @06:49AM (#12517981)
    fsync(2) man does state:
    fsync copies all in-core parts of a file to disk, and waits until the device reports that all parts are on stable storage.
    But then it goes on to state:
    NOTES
    In case the hard disk has write cache enabled, the data may not really be on permanent storage when fsync/fdatasync return.

    Which, as you point out, can be a BAD THING (TM) if someone opens a window. So, who should change? fsync, and it's man page's NOTES for devices that have a cache but actually are capable of flushing that cache? Or should there be a special really_fsync() call?
  • by Anonymous Coward on Friday May 13, 2005 @07:07AM (#12518051)
    From the UNIX spec, vol 2:

    ---
    NAME

    fsync - synchronise changes to a file

    SYNOPSIS

    #include

    int fsync(int fildes);

    DESCRIPTION

    The fsync() function can be used by an application to indicate that all data for the open file description named by fildes is to be transferred to the storage device associated with the file described by fildes in an implementation-dependent manner. The fsync() function does not return until the system has completed that action or until an error is detected.

    The fsync() function forces all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronised I/O completion state. All I/O operations are completed as defined for synchronised I/O file integrity completion.

    ---

    In short, fsync() is specifically designed to flush the data in memory to the device as well as ensure the device to writes right fucking now and stfu until the job is done. fsync() under Linux does indeed issue command E7h for ATA5, which the drive is expected to follow immediately.

    If the device fails to do so, then it's operating out of spec and therefore is either faulty, or the manufacturer is falsely claiming compliance with the spec and selling something other than what was promised.
  • by Anonymous Coward on Friday May 13, 2005 @07:18AM (#12518091)
    No, flush is still used to dump filesystem changes from system memory to the drive even if the drive doesn't have a cache.

    However, fsync() _is_ expected to ensure that the data is committed in a way that ensures data integrity, regardless of the medium being used.

    If the drive has a hardware cache, then fsync() implimentations are expected to ensure that this cache is also flushed. To this end, various incarnations of Linux and BSD employ ATA commands specifically designed for this task and which are mandatory for a drive to claim ATA compliance.

    If the drive manufacturers are failing to impliment these commands as specified, then we have what amounts to dirty pool and most likely consumer fraud.
  • by frinkazoid ( 880013 ) on Friday May 13, 2005 @07:25AM (#12518122)
    this is true .. Installing a fresh windows 98 SE on a fairly new pc and then doing windows update, there is an update witch this description:

    The Windows IDE Hard Drive Cache Package provides a workaround to a recently identified issue with computers that have the combination of Integrated Drive Electronics (IDE) hard disk drives with large caches and newer/faster processors. Computers with this combination may risk losing data if the hard disk shuts down before it can preserve the data in its cache.

    This update introduces a slight delay in the shutdown process. The delay of two seconds allows the hard drive's onboard cache to write any data to the hard drive.

    I found it nice to see how M$ worked around it, just waiting 2 seconds, how ingenious !
    link to the M$ update site: http://www.microsoft.com/windows98/downloads/conte nts/WUCritical/q273017/Default.asp [microsoft.com]
  • by indifferent children ( 842621 ) on Friday May 13, 2005 @07:25AM (#12518125)
    Windows is WYSIWYG; Linux is YAFIYGI (You asked for it, you got it).

    This is an old quote, but not everyone has seen it. This is much like Neal Stephenson comparing Linux to the Hole Hawg drill in "In the Beginning Was the Command Line". Great read!

  • by jonwil ( 467024 ) on Friday May 13, 2005 @07:51AM (#12518227)
    The right answer is for the drive not to respond to the "Sync" command with "Done" untill it really is done (however long it takes) and for the OS to not continue untill it sees the "done" command from the drive.
  • Re:fsync question (Score:3, Informative)

    by tomstdenis ( 446163 ) <tomstdenis AT gmail DOT com> on Friday May 13, 2005 @08:05AM (#12518274) Homepage
    Use reiserfs?

    At least then the file is either there or not there.

    My gentoo box has been through a few brownouts/powerouts [I have a UPS now ...] and hasn't skipped a beat. It even comes back up on it's own [go Asus bios ;-)] when I'm say on another continent ;-)

    Tom
  • by bill_mcgonigle ( 4333 ) * on Friday May 13, 2005 @08:05AM (#12518275) Homepage Journal
    I work on a block-level transactional system that requires blocks to be synced to the platters. There where two options, modify the kernel to issue syncs to the ata drives on all writes (to the the disk in question) or to just disable the physical write cache on the drive. Turned out to be a touch faster to just diable the cache but the two are effectivly equal.

    Just to clarify - use hdparm -W to fiddle with the write cache on the drive. I've built linux-based network appliances that go out in the field, sometimes overseas, and can't be touched by a competent tech and sometimes loose power. You have to use a journaled filesystem and turn off the write cache. The write speed starts to suck, but I was network-bound anyway.
  • The Linux man page (last updated 2001-04-18) states that all data should be written to stable storage. To me stable means that if power is pulled that data is still there. It does however, give a warning in the NOTES section that if write cache is enabled on the drive, "the data may not really be on permanent storage." I don't know if that warning is just there because of observed behavior, or if the various specs allow said behavior.
  • Re:Linux 2.6 and IDE (Score:2, Informative)

    by asaul ( 98023 ) on Friday May 13, 2005 @08:18AM (#12518371)
    I dont know about the Alan Cox comment, but for IDE this is a common thing. Simply put IDE disks struggle enough for performance, so by default have write caching enabled.

    I work for a major server vendor who creates their own firmware for their disks. By default all SCSI and FCAL disks are configured to have write cache disabled because data integrity is valued over performance. For ATA apparently the disk vendors dont give any option for it, so we are unable to work around that.

    This is actually quite a pain when it comes to benchmarks, because for SOME tests it makes OSes which enable the write cache to look really fast. Its not until you suffer a catastrophe that you find out the data never made the platter.

    And RAID devices dont lie about completing I/O - the device presents an "disk" interface to a slab of battery backed (hopefully) cache to disk which allows write performance to be massivly better. The RAID card itself takes care of syncing its cache to the disks, it just takes the data in cache and responds to the transaction immediately, flushing later. As far as the OS needs to know, the IO is complete - firmware bugs and battery failures aside, the RAID card handles it internally.

    And the author seems to be lacking clue about what he is testing - if anything all it is testing is the the OSes ability to get data down to the disks consistantly - all fsync() knows is that the calls it made to send data to the storage devices returned success, its totally dependant on the volume manger, disk and controller drivers as well as the actual physical storage to get the job done.

    For all he knows his drivers might be returning immediately just to make performance look better, but actually scheduling the I/O in some manner which causes it to be lost before commitment to storage.

    If he wants to complain about the disks, I think he is going to need a much lower level test that a perl script calling sync.
  • by putaro ( 235078 ) on Friday May 13, 2005 @09:37AM (#12518958) Journal
    Let's try a reply with a bit less flame attached.

    A journaling file system will know when it needs to get everything committed to disk in order to have a consistent state. At that point it will issue a sync to the drive to flush the drive's write cache. However, not every write has to get to the disk for the filesystem to be in a consistent state.

    Now, you're yelling BS, BS, BS...hold on and listen for a minute. I write file systems for a living and have done so for over 15 years.

    What is the commitment that a journaling file system makes to you? It makes the commitment that it will not be in an inconsistent state. It doesn't make the commitment that every last write will make it to disk. For example, ext3 in journaling mode only journals metadata transactions. Any data writes that you make are not guaranteed at all, unless you make the proper sync call. As someone pointed out above, fsync is not the proper call on many OS's.

    The way that we have settled on to make filesystems and databases work is to create atomic transactions and move from transaction to transaction. If a transaction fails (for any reason, but let's just assume it's because the system crashed), all of the data that was written as part of it is discarded when you restart. If the partial data was not discarded then the filesystem would be in an inconsistent state AND the data that you were writing (if you care about consistency) would be in an inconsistent state. So, forcing every write to immediately go to disk is pointless as if the transaction you're doing is interrupted you'll be discarding the data anyway. It's only when you are finishing the transaction that you need to make sure that everything is on disk. By that time it might be already, especially if that transaction was large.

    Let's take a simple situation. Say that you have a filesystem that guarantees that everytime you do a write() call, when the call returns that data will be on disk and available for you the next time and that if the write() errors or does not return, the file will be as it was before the write() was called. Now, you do a write of 100MB with a single call. The filesystem may scatter that data all over the disk depending on how fragmented it is. Forcing each write to disk in order will bang the head a lot and reduce your performance. By letting the write cache do its job and reorder writes as necessary your performance will be much better (we used to do this in the driver and file system cache. However, modern disk drives provided such an abstract interface that it's nearly impossible for the OS to micromanage write ordering. In the old days the OS knew where the head is because it told the damn drive where to put it. Now, you can sort of guess and you're usually wrong). Cache on ATA drives tops out at around 16MB so you will definitely flush most of the data out of the cache in the course of writing anyway. Finally, at the end, before returning, the FS would sync the drive's cache to the disks and mark the transaction as closed. Were the system to crash in the middle of the write when the system restarts it would need to discard any data that might have been written and it wouldn't matter which data had been written or not written. (Important note: Journaling file systems and databases have a recovery process after a crash. It's just a lot less involved than running fsck or DSKCHK over the whole disk)

    So, write caching is valuable and widely used. In order to avoid data corruption it's not necessary to turn off caching but it is necessary for the cache to do what it is told, when it is told (all of the write caches too, not just the disk's). Were the disks truly lying to the OS it would be bad. More likely, this guy's Perl script is just not OS specific enough to get the OS to really do what he thinks he is asking it to do. There's a reason why serious data management apps need to be ported and certified on an OS. Getting everything to do its job right is tough.
  • Re:What's this? (Score:3, Informative)

    by some guy I know ( 229718 ) on Friday May 13, 2005 @09:38AM (#12518966) Homepage
    The Amiga could format high-density disks to 1,760 Kb, given a high-density disk drive (which only came as standard in the high-end machines). However, if I remember rightly, to do this it had to slow the drive down.
    The drive itself was not slowed down.
    Instead, the Amiga had the drive write an entire track at a time, rather than just one sector at a time.
    (This meant that it could store more sectors per track, because there were no inter-sector gaps, just a lead-in and lead-out for the track as a whole.)
    The reason that the drive seemed slower was because to write one sector, the Amiga had to read an entire track, replace the sector to be written, and then rewrite the entire track back to disk.
    All a PC had to do to write a sector was to write the sector.
    So it was the OS and method of storage that caused the slowdown, not the drive (hardware/firmware) itself.

    What this meant was that writing random sectors would take more time, but writing sectors sequentially would not (usually).
    In fact, disk-to-disk copies were faster, because the Amiga could start reading in the middle of a track to get the whole track, whereas a PC had to wait for the particular sector that it wanted to read to come around under the read head.
  • by jgarzik ( 11218 ) on Friday May 13, 2005 @09:50AM (#12519114) Homepage
    All it would have taken is ten minutes of searching on Google to discover what is going on.


    You need a vaguely recent 2.6.x kernel to support fsync(2) and fdatasync(2) flushing your disk's write cache. Previous 2.4.x and 2.6.x kernels would only flush the write cache upon reboot, or if you used a custom app to issue the 'flush cache' command directly to your disk.


    Very recent 2.6.x kernels include write barrier support, which flushes the write cache when the ext3 journal gets flushed to disk.


    If your kernel doesn't flush the write cache, then obviously there is a window where you can lose data. Welcome to the world of write-back caching, circa 1990.


    If you are stuck without a kernel that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE (SCSI) command, it is trivial to write a userspace utility that issues the command.



    Jeff, the Linux SATA driver guy

  • by alexhs ( 877055 ) on Friday May 13, 2005 @09:57AM (#12519189) Homepage Journal
    anyone who used computers *knew* what "kilobyte" and friends meant.

    15 years ago, maybe. Nowadays, I don't think so. It's just that windows reports sizes by 2^10 chunks and not 10^3 ones, so people are thinking someone is lying, and, you know, Microsoft never lies.

    OTOH, cfdisk happily reports disk sizes by 10^3 units.

    I don't even think that there is some marketing push to use kilo instead of kibi :

    Once upon a time, disks (like floppies) were strictly divided into cylinders, heads, sectors, a sector being 2^9 bytes (what would be interesting would be to know WHY 512 bytes ?). You would multiply c*h*s and get your total disk capacity. But space was wasted on the outer tracks.

    Now, thinks have changed. You have reserved sectors for bad sectors handling (unadvertised space!), and sector per track isn't a constant. You just have a total number of (LBA) sectors, that is not a simple product of three factors. Moreover, capacities became important regarding to the 512 bytes unit.

    Total number of sectors still is printed on the hard disk, if you want it. And remember that all 160GB disks aren't equal (ie don't have the same number of sectors). Seriously, are you going to check the exact number of sectors when you're seeking for a new ca.200GB hard disk ? rpm, noise, ... seems to me to be better criterion that the few additional sectors I might get. And what would you think about CDs or DVDs ? Most CDR-80/700MB really are 703MiB but there might be little differences. They still are advertised 700MB and not 703MB. And DVDs however aren't 4.7GiB but 4.7GB.

    USB keys, ram sticks still are using MiB. Why ? What is doing the marketing ? It's just that they still are using a binary scheme. The other way, Ethernet or modem speeds never have used powers of two.

    The transition between GiB and GB was an unfortunate event but, formally speaking, it's better now in regards to (international) units.

  • by Afrosheen ( 42464 ) on Friday May 13, 2005 @10:43AM (#12519648)
    You must be new here. That's the whole point of meta-moderation. If you do it, it improves your chances of moderating in the future, because it basically reviews the moderating quality of previous moderation from other people.
  • I love that article/essay. Link: In the Beginning was the Command Line [cryptonomicon.com]. It's a plain CRLF text file in a ZIP archive.
  • Lucky for me, (Score:3, Informative)

    by EvilStein ( 414640 ) <spamNO@SPAMpbp.net> on Friday May 13, 2005 @01:36PM (#12521686)
    I'm both drunk *and* stoned.

    Should be a lot of moddin' fun today, lemme tell ya..
  • by tlambert ( 566799 ) on Friday May 13, 2005 @06:06PM (#12524887)
    Actually, it's a flaw in the ATA specification: ATA drives can do a disconnected read, but there is no way to do a disconnected write.

    Because of this, you can have a tagged command queue for read operations, but there is no way to provide a corresponding one for write operations.

    SCSI does not have this limitation, but the bus implementation is much more heavyweight, and therefore more expensive.

    The problem is exacerbated, in that ATA does not permit new disconnected read requests to be issues while the non-diconnected write request is outstanding. Therefore, any write acts as a read stall barrier.

    In order to compete with SCSI on both write performance, and interleaved read/write operation performance, manufacturers added write caching by default, breaking the historical contract about when a write completes to stable storage vs. the write operation not returning until it did.

    Today, there are still a number of disks that *actually* lie, and there are a number of firewire/ATA bridge chipsets that do not propagate the FW sync into an ATA sync, even if you didn't end up with a disk that lied.

    So you can be screwed if:

    1) The disk lies about honoring the cache flush request (there was one series of Quantum ATA disks that did this, for which Quantum promptly provided a firmware update. I really like Quantum for this, and you can find the discussion on the FreeBSD-hackers mailing list archives).

    2) The controller or bridge chipset responds to the flush request, but does not propagate it to the actual devices (there is one popular bridge chip that does this; since it was not recalled by the manufacturer, and there is no firmware update fix possible, in the interests of not being sued, I'm going to avoid naming names here.

    3) The OS may not issue the command for user perceived peroformance reasons relative to the competition (this is why, before the cache flush command existed in the ATA specification, FreeBSD turned back on the write cache by default, even though everyone knew that data integrity guarantees *would* go out the window).

    Unfortunately, I can no longer just say "ATA sucks; use SCSI", because a number of SCSI disk manufacturers have started doing the same pig tricks with their SCSI disks (again, not naming names), and ignore the SCSI cache flush command, or ignore the mode page setting for synchronous I/O completion with tagged write commands (writing is slow, especially if you have to read an entire track to write a block).

    Hopefully, this Slashdot article will cause the mainstream press to put enough light on this issue to shame the drive manufacturers into at least labelling actually compliant drives.

    -- Terry

An Ada exception is when a routine gets in trouble and says 'Beam me up, Scotty'.

Working...