Linux Not Quite Ready For New 4K-Sector Drives 258
Posted
by
CmdrTaco
from the when-more-is-less dept.
from the when-more-is-less dept.
Theovon writes "We've seen a few stories recently about the new Western Digital Green drives. According to WD, their new 4096-byte sector drives are problematic for Windows XP users but not Linux or most other OSes. Linux users should not be complacent about this, because not all the Linux tools like fdisk have caught up. The result is a reduction in write throughput by a factor of 3.3 across the board (a 230% overhead) when 4096-byte clusters are misaligned to 4096-byte physical sectors by one or more 512-byte logical sectors. The author does some benchmarks to demonstrate this. Also, from the comments on the article, it appears that even parted is not ready, since by default it aligns to 'cylinder' boundaries, which are not physical cylinder boundaries and are multiples of 63."
Good thread on this. (Score:4, Informative)
So don't do that... (Score:2, Informative)
Author claims a massive performance drop if things aren't aligned right. Ubuntu already does it with parted and fdisk can do it manually. So, no big problem; fdisk ought to be fixed to have sane defaults with a 4096 byte block size, sure. That can't be all that difficult.
The author also seems to think that only a 30% increase in times for misaligned writes should be expected. I'm not sure why. In a naive implementation I'd expect a 100% increase in time (each block now needs to be written twice). Linux, obviously, doesn't use a naive implementation. It's expected that if the hardware violates the assumptions behind the techniques Linux uses to achieve high performance, that those techniques end up making things very slow instead.
Re:Good thread on this. (Score:2, Informative)
The BIOS has no understanding of partition tables. It merely reads the first sector of the harddrive to 0x7C00 and then jumps to that location. The DOS partition table is used by convention for interoperability between operating systems. If you wanted to use a different partitioning scheme, there is no technical reason your operating system couldn't.
Re:Open Source to the rescue (Score:4, Informative)
Exactly. Drives are pretending to have 512-byte sectors because Windows can't deal with 4k sectors, and then silently reducing performance when you believe them and use 512-byte sector sizes. Had the drives reported 4k sector sizes, they'd work great under Linux and not at all under Windows.
This isn't a Linux problem, it's a drive problem caused by Windows. The solution is to implement yet another workaround for stupid devices, and start aligning partitions to 4k by default.
Nitpick: SDHC card sectors are always 512 bytes, and most SD card sectors are 512 bytes too. Flash memory would benefit from larger sector sizes too, but they've probably stuck to 512 bytes for Windows compatibility.
Re:Interesting (Score:3, Informative)
While a kernel tweak may help alleviate the issue, it is primarily an issue with our current (userspace) disk partitioning and formatting utilities. I'd also disagree with you on the point where the problem is the drive microcode; drives should do what they are told, and not guess on behalf of the instructions they are given what to do. Admittedly, the microcode tweak would be minor and largely trivial, but I'd rather not fix (primarily) userspace software problems in the kernel, nor the device firmware.
Re:Good thread on this. (Score:4, Informative)
Unless your BIOS is trying to be too smart and peeking into your partitions instead of launching the MBR (sadly, some do), it won't matter. It's the MBR's job to boot your system after the BIOS hands off control to it, and on most Linux systems the bootloader is installed straight into the MBR.
Re:I just bought one of these (Score:5, Informative)
dev/sdd:
Model=WDC WD15EARS-00Z5B1, FwRev=80.00A80, SerialNo=
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16
It looks to me that this should *really* be fixed by WD with a firmware update
.
Solution: Instead of fdisk, call it as fdisk -H 224 -S 56 as per Theodore Tso's blog [thunk.org].
Re:Good thread on this. (Score:3, Informative)
Even if you are using Windows, Vista and up support GPT. It's handy for servers where you expect to have partitions larger than 2 TB.
But I guess if one were using a modern version of Windows, you wouldn't have the 4K alignment problems to begin with.
Re:Set 32 sectors per track (Score:4, Informative)
Essentially we are back to the old problems of the ST412 interface where we had to figure out the best interleave for the drives as well when we were formatting them. Most drives then did have a fairly conservative interleave, but a reformat of them could improve the throughput considerably. A reformat could be done so that the whole track could be read in 2 rotations instead of 3, and what that does to performance is fairly easy to understand. C800:5 was a commonly used BIOS address where the low level format routine did reside.
But from what I understand this problem is an offset problem when the head steps from track to track, and that's also an issue to be considered. And today it's not common knowledge/practice to low level format hard drives.
And why stick at 4k sectors? Depending on the system you may want to use a different sector size. If you run Oracle on some systems the block size is 8k, and in that case you may want to have 8k disk blocks too since it would be good for performance.
Anyway - sooner or later we will have flash drives instead, and then this isn't a problem.
I was worried about this... and am still unclear (Score:5, Informative)
I just got one of the 1TB 64mb WD drives that is known to be 4kb sector based.
Here is how it shows up in dmesg:
[ 3.420488] sd 1:0:0:0: [sdb] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
and here's what hdparm -I says:
ATA device, with non-removable media
Model Number: WDC WD10EARS-00Y5B1
Serial Number: WD-WCAV55227529
Firmware Revision: 80.00A80
Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 1953525168
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 953869 MBytes
device size with M = 1000*1000: 1000204 MBytes (1000 GB)
cache/buffer size = unknown
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 1
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_B
Re:Set 32 sectors per track (Score:5, Informative)
The terminal is not irrelevant. If your Cisco router is ever compromised (it happens) or if IOS becomes corrupt (or if you have an IOS install with a nasty bug where the password does not save correctly, or when an IOS upgrade goes badly) or someone fudges the configuration up, the only way you can recover it is often through the serial port. Serial ports are also very handy for integrating video surveillance with point-of-sales systems that are not IP-aware (or worse, antiquated DVR appliances which can't do POS integration over IP), for some smart switches, *NIX boxes that have been rooted (I've rescued a Solaris box through a serial connection in an enterprise environment where reinstall was not possible due to poor timing - week of finals - and backups were sabotaged by a disgruntled gradute student and logins through IP and at the console were blocked), and so forth. However, I'd rather see RS-485 or RS-422 take RS232's place, since RS-485 and RS-422 can work over much longer distances and you can hang multiple serial devices off of a single bus.
RS-232 might be absent from a lot of consumer motherboards, but it is far from dead and certainly not irrelevant, even now in 2010.
Re:Set 32 sectors per track (Score:5, Informative)
Actually this problem is potentially much worse on SSD's. Erase blocks are huge, and read-modify-write really sucks on flash.
Couldn't this be addressed (at least in part) by a battery-backed write cache like better RAID controllers use? Set it up like SAN snapshots (so it just stores the diff between what's in the actual flash storage and what's been changed so far), and then write the changed blocks when it's most advantageous (e.g. when there's an entire block's worth of data, so it would all have to be erased by the flash storage anyway).
Maybe combine that with something like a disk defrag, except instead of storing frequently-sequentially-read data in physical sequence, store frequently-written data (regardless of if it's sequentially-read or not) in physical sequence.
That's exactly what most SSD controllers do!
Some now come with 32 to 64MB of cache, and some of the new Sandforce controller based SSDs also come with a little ultracapacitor that acts like a mini UPS. The cache is used as scratch space for reordering writes and defragging blocks.
There was a firmware patch recently for the OCZ Vertex series of SSDs that enabled background defrag. If you let the drive site there for a few minutes, it would start getting faster until it returned to 'as new' speeds
Re:Poorly researched article. (Score:4, Informative)
This is simply a matter of fdisk from that version of util-linux-ng (which is clearly named in the article) trusting the hardware vendor to specify correct block sizes. The vendor did not. Thus fdisk does not end up with 4k block sizes, as happens for many programs. And only(?) parted apparently contains a workaround that detects the correct block size.
Its not that you can't use parted on Gentoo, though, it is just that in the world of user choices that is Gentoo, not everyone will be using that program or that particular option.
Re:Set 32 sectors per track (Score:3, Informative)
We both agreed that most of windows land involves emailing shit to yourself, and a lot of USB thumb drive use...
Explorer: \\ComputerName\c$\Documents and Settings\UserName\My Documents\
Permissions permitting, this is all you need to do. Or you just share folders.
(Of which I could fire off a good half-hour rant on how poorly windows handles mass storage devices. It's a USB THUMB DRIVE for gods sake. It's not a fucking printer! I want to plug it in, and transfer files to/from it. It doesn't need to be "installed", indexed, and have drivers downloaded for it. Just fucking open a file browser like any sane OS does. )
This is a 10 year old complaint.
I have a hard time working on windows, because I'm so much more efficient with a terminal. It's not that I can't use a gui - I'm just an order of magnitude faster using the terminal.
That and you're not using Windows properly.
a firmware update isn't realistic (Score:3, Informative)
I'll get to why in a second, but first:
RawCHS hasn't meant anything in a decade. The largest drive you can describe with CHS is 8GB.
Track size hasn't meant anything in even longer than that. When drives went to zone bit recording (ZBR), the number of sectors per track became variable. This happened in about 1989.
The sector size does mean something, but it is the actual sector size, not the sector "grouping" size. If the drive reported a sector size of 4K, then it would expect that the host understand that sectors are actually 4K in size, not 512B in size. But really no major OS supports this, they all expect 512B sectors. That's why these drives internally use one sector size and show another size to the host. And there is no way in the ATA specification for devices to indicate their internal sector size when they are presenting a different external sector size.
So this won't be fixed with a firmware update, unless Vista, 7 and every other major OS is fixed to actually support large sectors presented to the host. Then the drive could be firmware updated to report the large sector size to the host. And the drive would then be completely unusable under any earlier OS or with any USB or Fireware adapter.
Re:Open Source to the rescue (Score:3, Informative)
And...
Total bullshit.
Linux kernel code had flexible block device sector size since the days of 1.x series of kernels. The "problem" is (and always was) with some of the user-space utilities for some of the file systems available under Linux, file systems specifically designed for ... compatibility with DOS and Windows (and through them with the original, ancient IBM PC XT BIOS).
Even then most of the same utilities have various override options that can be used to make them compliant with "unusual" (from the point of view of Windows) sector/block sizes and dive geometries, although it is not their default behavior. The very article you are responding to moans about this very thing, as if it were any news to long-time Linux users.
Windows apologists and their revisionist history are just pathetic.
Re:Poorly researched article. (Score:4, Informative)
I wrote the linked article.
I completely agree that the article is narrowly focused. VERY narrow. My objective was to demonstrate a problem and point out that Linux has not FULLY adapted. I didn't say Linux devs were idiots or that it would never be ready. I was trying to express the idea that Linux [distros in general but perhaps not all] is not QUITE ready for these drives, because not all the tools have fully adapted. Some tools make no mention of any problems in their man pages. Some (like parted's defaults) are even misleading if you mistakenly think that "track aligned" is a good thing.
And I was trying to do that in the very limited number of words I had available for a title.
Also, WD claimed that Linux is unaffected. Some distros probably are, but this could lead people to believe that the statement is universally true, which it isn't. Thus, my over-all objective is to educate people to the fact that if they don't know what they're doing, they can get this wrong. There are lots of mistakes I've made where I wished that someone had mentioned some critical fact on a how-to (like, don't use dmraid/fakeraid for RAID1 because reads aren't load-balanced; use mdraid instead). I've filed plenty of bug reports on such issues.
Re:whoops, one thing about RawCHS (Score:3, Informative)
The original reason for aligning to track boundaries (a track is a cylinder-head pair) is that the first four sectors of MS-DOS' IO.SYS (IBMBIO.SYS) had to be contiguous and on a single track.