Linux Not Quite Ready For New 4K-Sector Drives 258
Theovon writes "We've seen a few stories recently about the new Western Digital Green drives. According to WD, their new 4096-byte sector drives are problematic for Windows XP users but not Linux or most other OSes. Linux users should not be complacent about this, because not all the Linux tools like fdisk have caught up. The result is a reduction in write throughput by a factor of 3.3 across the board (a 230% overhead) when 4096-byte clusters are misaligned to 4096-byte physical sectors by one or more 512-byte logical sectors. The author does some benchmarks to demonstrate this. Also, from the comments on the article, it appears that even parted is not ready, since by default it aligns to 'cylinder' boundaries, which are not physical cylinder boundaries and are multiples of 63."
Parted / GPT (Score:1, Interesting)
I heard using parted and GPT labels instead of MSDOS will optimize it on 4096 byte sectors automatically. Any truth to it?
oh great.. i try to make a joke and ... (Score:2, Interesting)
Interesting (Score:1, Interesting)
Check with your distribution (Score:5, Interesting)
I know that Fedora seems to have addressed this with parted 2.1.1 [fedoraproject.org] and util-linux-ng 2.1 [fedoraproject.org]. Both are scheduled for Fedora 13, but can be pulled into Fedora 12 by those getting the hardware early.
Partitions are obsolete (Score:1, Interesting)
Easiest fix: stop dividing your disks into partitions.
Re:Good thread on this. (Score:3, Interesting)
GPT wraps itself in a MBR partition map. At the very least the GPT is supposed to include an MBR map that claims the whole disk as used by GPT to avoid issues with old disk tools and the like. And if you've got a partition scheme that's compatible with the MBR scheme they can both contain the same information, assuming your disk tool supports this, so that MBR-only environments can still find your partitions.
It's also possible to format with GPT and then use an MBR-only tool (fdisk) to go back and manipulate the (fake) MBR to contain a partition that points to the same start/end points as the GPT boot partition -- GPT-aware systems will just ignore the MBR record, and non-GPT systems will at least be able to find the boot partition.
As to whether your motherboard/firmware supports GPT, it can be hard to say. Anything with EFI is required to support GPT. Some systems with a legacy BIOS pre-boot environment also have support for GPT, because it's the only way to support large disks. But I can't name particular firmware versions that do/don't support GPT.
Drive lies and future fixes (Score:5, Interesting)
There is an excellent thread talking about how recent (2.6.31+) linux kernels try to report the underlying hard drive architecture [gmane.org] (found via the OSNews comments [osnews.com]). Alas, it looks like some of these drives are not reporting this data correctly and thus automatic adjustment (at partitioning time) is not taking place. It looks like in the future rather than trying to do detection by reported capability fdisk (and hopefully gparted) will default to sectors of 1MiB if the topology can't be found by default [gmane.org] (unless your media is small).
Additionally, I gather that recent Fedoras will try to adjust things like LVM to match larger sectors too [storagemojo.com]. Hopefully whatever is laying out LVM will also be fixed too.
Coincidentally, it looks like Oracle have a very committed dev trying to make this stuff work by default...
Re:Interesting (Score:5, Interesting)
Re:Open Source to the rescue (Score:3, Interesting)
Now I wonder why a hard drive company feels the need to have it's hardware LIE to the OS?
So the hardware is compatible with more software. For example, hard drives still report some number of cylinders, heads and sectors to the BIOS and the OS, but hard drives have been using ZBR [wikipedia.org] for 20 years now (IIRC) so the sector number is meaningless.
But, as it is now, if my old system needs a new hard drive, I do not need to find an old drive to be compatible with my system (as long as it is IDE or SCSI, I don't know of any adapter from the newer interfaces to ESDI or ST-506, but they probably exist).
They could have made it a jumper setting set to 512B by default though. I assume the hard drive is faster using 4KB sectors instead of true 512B sectors, they could have made an option to reformat the drive to 512B (or maybe it's not possible with modern drives, I have an old 4GB SCSI drive that can be reformatted to a different sector size (I never tried it though)).
Re:slashdot is not journalism (Score:4, Interesting)
I'm with you, but on the other hand that doesn't mean they should just not give a shit about the quality of their end-product. We know from experience that they can edit and correct stories as corrections arise in the comments, but how often does that happen in practice? (Hardly ever.) Somewhere between a third and half of the stories posted here are either outright lies, or extremely misleading-- I may be exaggerating, but not by much-- and almost never are they corrected.
Look, any site that posts this article: http://tech.slashdot.org/article.pl?sid=09/02/16/2259257 [slashdot.org] without a single correct simply Does. Not. Give. A. Shit.
I don't think anybody's expecting the New York Times when they visit here, but some minimum level of competence would be nice. I don't fault anybody for complaining.
DragonFly's solution (Score:5, Interesting)
We're adjusting our disklabel64 utility and kernel support to set the partition base offset such that it is physically aligned instead of slice-aligned, and we are using 32K alignment. That should fix the problem without having to mess around with fdisk.
The DragonFly 64-bit disklabel structure uses 64-bit byte offsets instead of sector addressing to specify everything. It ensures things are at least sector aligned but we wanted to make disk images more portable across devices with potentially different sector sizes. The HAMMER fs uses byte-granular addressing for the same reason, 16K aligned.
-Matt
Re:Set 32 sectors per track (Score:4, Interesting)
Actually this problem is potentially much worse on SSD's. Erase blocks are huge, and read-modify-write really sucks on flash.
Couldn't this be addressed (at least in part) by a battery-backed write cache like better RAID controllers use? Set it up like SAN snapshots (so it just stores the diff between what's in the actual flash storage and what's been changed so far), and then write the changed blocks when it's most advantageous (e.g. when there's an entire block's worth of data, so it would all have to be erased by the flash storage anyway).
Maybe combine that with something like a disk defrag, except instead of storing frequently-sequentially-read data in physical sequence, store frequently-written data (regardless of if it's sequentially-read or not) in physical sequence.
whoops, one thing about RawCHS (Score:4, Interesting)
I forgot, there is one thing RawCHS nowadays. That is that there is no proper spec for how to know if a partition in an MBR (fdisk) partition table is a valid partition. So there are heuristics that are applied to the entries to guess if they are real or to be ignored as empty. One of the heuristics that some software uses is to ignore all partition entries that don't begin on a cylinder boundary. To be on a cylinder boundary, the partition has to start on a sector number that is a multiple of the number of sectors (S in CHS) in order to be valid. And since all drives 8GB or greater present an S of 63, that is why the first partition on an MBR disk has always started at sector 63, which makes it unaligned when the internal sector size is 4K (8 internal sectors).
Windows before 2000 checks the CHS alignment of MBR entries and ignores any partition entries that don't start on a multiple of S. So all disks out there are misaligned. With Windows 2000 or later, you can start the partition on any boundary you want.
Western Digital has a jumper you can put on the drive that adds 1 to all access requests, making all those misaligned first partitions aligned. But it'll also make any aligned partitions misaligned. So the real answer is just to layout your disk different. I would recommend using GUID disk partitioning instead of MBR anyway, because MBR doesn't work for >2TB drives. And GUID doesn't have any weird alignment requirements (and doesn't have any knowledge of CHS).