Become a fan of Slashdot on Facebook


Forgot your password?
Data Storage Sun Microsystems IT

ZFS, the Last Word in File Systems? 564

guigouz writes "Sun is carrying a feature story about its new ZFS File System - ZFS, the dynamic new file system in Sun's Solaris 10 Operating System (Solaris OS), will make you forget everything you thought you knew about file systems. ZFS will be available on all Solaris 10 OS-supported platforms, and all existing applications will run with it. Moreover, ZFS complements Sun's storage management portfolio, including the Sun StorEdge QFS software, which is ideal for sharing business data."
This discussion has been archived. No new comments can be posted.

ZFS, the Last Word in File Systems?

Comments Filter:
  • Cool but.... (Score:3, Interesting)

    by otis wildflower ( 4889 ) on Thursday September 16, 2004 @12:11PM (#10267179) Homepage
    ... it took them long enough.

    Perhaps they had to rewrite an LVM from scratch in order to opensource it?
  • by joshtimmons ( 241649 ) on Thursday September 16, 2004 @12:13PM (#10267217) Homepage
    We heard earlier that solaris 10 will be open source.

    I wonder if that means that this filesystem can be included in other kernels.

  • UFS2/SU (Score:3, Interesting)

    by FullMetalAlchemist ( 811118 ) on Thursday September 16, 2004 @12:15PM (#10267239)
    I'm really happy with UFS2/SU, and have been more than happy with the original UFS in general since 1994 when I first started off with NetBSD.
    But, with ZFS, maybe we finally have found a FS with replacing it with. I sure look forward to trying Solaris 10, though I'm sure that I will find that SunOS has a better feal to it, like always.

    Maybe DragonflyBSD will be the one to do this, FreeBSD is generally more restrictive to radical changes; for good reasons, you don't get that stability without reason.
  • by LowneWulf ( 210110 ) on Thursday September 16, 2004 @12:16PM (#10267260)
    COME ON! It may be a slow day, but how is this news? There's only one link, and it's to Sun's marketing info.

    Can someone please provide a link to some technical details other than it being 128-bit? What does this file system actually do that is even remotely special? What's under the covers? And, more importantly, does it actually work as described?

  • by Bobo_The_Boinger ( 306158 ) on Thursday September 16, 2004 @12:17PM (#10267272)
    I was concerned about the ability to selectively remove a disk. Say I have 3 disks and ZFS has spread my data all over those three disks. How do I say, "I need to remove disk 2, please move all that data to other disks now."? Just a minor concern really, but something to think about.
  • Re:Open source (Score:5, Interesting)

    by balster neb ( 645686 ) on Thursday September 16, 2004 @12:18PM (#10267283)
    Yes, it does look like it would be open-sourced as part of Solaris 10 (it was mentioned as one of the major new features).

    Assuming the Solaris 10 will be true open source (not like Microsoft's "shared source"), as well as GPL compatibile, would I be able to use ZFS on my GNU/Linux desktop? Will ZFS be a viable alternative to ext3 and ReiserFS? Or is the overhead too big?
  • by perseguidor ( 777194 ) on Thursday September 16, 2004 @12:18PM (#10267294)

    With traditional volumes, storage is fragmented and stranded. With ZFS' common storage pool, there are no partitions to manage. The combined I/O bandwidth of all of the devices in a storage pool is always available to each file system.

    Until now it does sound just like raid, but:

    When one copy is damaged, ZFS detects it via the checksum and uses another copy to repair it.

    No competing product can do this. Traditional mirrors can only handle total failure of a device. They don't have checksums, so they have no idea when a device returns bad data. So even though mirrors replicate data, they have no way to take advantage of it.

    I guess I just don't get it; I know they are talking about logical corruption and not a physical failure, but this is kind of like raid with somethink like SMART, or isn't it?

    And what kinds of corruption can there be? Journaling filesystems already work well for write errors and such, or so I thought.

    I know the architecture seems innovative and different (at least for me), but is there really new functionality?

    Sorry if I seem ignorant this time. I don't know if I was able to get my point across; the things this filesystem does, wouldn't they be better left on a different layer?
  • Re:Hmf. (Score:5, Interesting)

    by elmegil ( 12001 ) on Thursday September 16, 2004 @12:21PM (#10267335) Homepage Journal
    "You'll never need more than 640K of memory". The point would be to be ready as storage densities increase. In the last 8 years we've gone from a terabyte filling a room to a terabyte on a desktop, and I'm sure there are more density breakthroughs coming.

    It's your density, Luke.

  • Curious points (Score:4, Interesting)

    by tod_miller ( 792541 ) on Thursday September 16, 2004 @12:25PM (#10267392) Journal
    "Sun's patent-pending "adaptive endian-ness" technology"

    ok, that aside. First 128bit file system, and get this: transactional object model

    I think this means it is optimistic but they figure it has blazing fast performance, who am I to argue. Fed up with killing this indexing garbage on the work machine, bloody microsoft, disabled it and everything and every full moon it seems to come out and graze on my HDD platter.

    From the MS article : This perfect storm is comprised of three forces joining together: hardware advancements, leaps in the amount of digitally born data, and the explosion of schemas and standards in information management.

    Then I started to suspect they would rant about moores law and sure e-bloody-nough

    Everyone knows Moore's law--the number of transistors on a chip doubles every 18 months. What a lot of people forget is that network bandwidth and storage technologies are growing at an even faster pace than Moore's law would suggest.

    That is like saying, everyone knows the number 9 bus comes at half 3 on wednesdays, but noone expects 3 taxis sat there doing nothing at half past 3 on a tuesday.

    Can we put this madness to rest? Ok back to the articles.

    erm... lost track now....
  • Shared data pools... (Score:4, Interesting)

    by vspazv ( 578657 ) on Thursday September 16, 2004 @12:37PM (#10267539)
    So what are the chances that someone could accidentally wipe the shared data pool for an entire company and how hard is recovery on a volume striped across a few hundred hard drives?
  • by twitter ( 104583 ) on Thursday September 16, 2004 @12:37PM (#10267540) Homepage Journal
    Opensource is useless when it's patent encumbered. While it's nice that the details will be available, it sucks to think that I can't use them except to serve Sun for the next 17 years. Such disclosure, of course, is what the patent system is supposed to provide but does not. What the patent is providing is ownership of ideas. How obvious those ideas are and if there's prior art is impossible to say from the linked puff piece.

    This article is shocking. I'm used to much less hype and far more technical details from Sun. Software patents and bullshit are not what I expect when I follow a link to them.

    I don't like any of this.

  • Re:billion billion? (Score:1, Interesting)

    by Anonymous Coward on Thursday September 16, 2004 @12:43PM (#10267627)
    Back on topic, so what licenses are ZFS compatable with? Is it OK for Linux systems to read them, or is it like FAT32, where it's just a landmine to suck the linux community into stepping on more IP risk.

  • by Anonymous Coward on Thursday September 16, 2004 @12:46PM (#10267676)
    For that matter, is anyone sure a 128 bit FS is even needed? 2^64 is an extraordinarily big number - a few years back, when I was a BeOS fanatic, I read an essay estimating the total amount of data generated by the human race in all of history, and concluding that all of it could be put into a 64-bit FS like BeFS without coming anywhere close to the theoretical storage capacity (i.e. using something like a millionth or a billionth of the potential space, don't remember the details). We are very, very far away from having any storage device that cannot be indexed with 64 bits. It may well be true that a device that outstrips the abilities of a 128 bit FS is a physical impossibility. For comparison, I remember hearing that IPv6 (a 128-bit scheme) could provide 5,000 distinct IP addresses for every atom on the surface of the earth.
  • by Too Much Noise ( 755847 ) on Thursday September 16, 2004 @12:49PM (#10267695) Journal
    The funny thing is, until the time an 128-bit FS will really be needed any patents Sun has on ZFS will have expired. So whatever that day's Open Source OS of choice will be, it will at least support ZFS (and probably that time's 128-bit incarnation of several of today's FS's).

    Somehow, an alternate history where 80286 was 64-bit instead of 16-bit (while everything else staying the same) comes to mind when reading the Sun's marketing on this.
  • Re:Open source (Score:5, Interesting)

    by GileadGreene ( 539584 ) on Thursday September 16, 2004 @12:50PM (#10267708) Homepage
    From the article:
    More important, ZFS is endian-neutral. You can easily move disks from a SPARC server to an x86 server. Neither architecture pays a byte-swapping tax due to Sun's
    patent-pending "adaptive endian-ness" technology, which is unique to ZFS.[emphasis mine]
    So while it might be open-sourced, you're not likely to see it migrating to Linux or the BSDs any time soon.
  • Re:Hmf. (Score:2, Interesting)

    by BJH ( 11355 ) on Thursday September 16, 2004 @12:51PM (#10267718)
    The limitation on storage systems is, and has been for a while, the speed of transferring data in and out of the system, rather than the overall capacity.

    The highest-speed systems currently available can (maybe) transfer data at 300MB/s or so. To transfer a dataset of only 40 bits, it'd take approximately an hour. A 64-bit dataset is more than 16 million times as large - which means it'd take nearly two millenia to transfer on today's best systems.

    Even if transfer rates are increased by two orders of magnitude (effectively unthinkable for the forseeable future without the development of entirely new and currently unknown technologies), you've still only reduced that time from 2000 years to 20 years.
  • British or American? (Score:3, Interesting)

    by abb3w ( 696381 ) on Thursday September 16, 2004 @12:54PM (#10267767) Journal
    Billion billion is a perfectly valid number.

    True. However, it is more ambiguous than "million million million", as absent minded Brits might interpret it as a "million million million million".

    Or would you rather they say 6.0 × 10^18?


    Most people can't imagine that.

    Most people can't imagine it anyway, whether you call it "six billion billion", "6.0 x 10^18", "6 x 2^60", or "1.27 x e^43". Or understand any number higher than the number of dollars they carry in their wallet, for that matter. Anyone who needs to make any decisions in life based on this ZMS number ought to be able to understand it any of those ways (although getting help from a calculator for the last one or even two is understandable). Of course, many people manage things they can't understand. This is life.

  • Re:billion billion? (Score:4, Interesting)

    by mikael ( 484 ) on Thursday September 16, 2004 @12:58PM (#10267820)
    Billion billion is a perfectly valid number. Or would you rather they say 6.0 × 10^18?

    Whenever I see or hear the word "billion" the first thing I ask is that US billion or British billion?

    "six times ten raised to the power of eighteen" seems much more clear and precise.
  • Re:Hmf. (Score:2, Interesting)

    by backslashdot ( 95548 ) on Thursday September 16, 2004 @01:04PM (#10267880)
    That freaking 640k quote is over used!

    It would have been ridiculous AT THE TIME to address more data.. CPU's and software werent there yet.

    Look, there are limits to the amount of stuff people need! yeah so 640k wasnt enough doesnt mean 6 billion terabytes isnt going to be enough for you tomorrow.

    You know what .. why not have 512 bit file systems? Or 1024 bit filesystems? After all .. they said 640k would be enough for everyone .. and what happened? Global chaos and economic meltdown. Surely we need to prevent that from happening again. Oh yeah what's that? It never happened. The world still rotates.
  • by kcbrown ( 7426 ) <> on Thursday September 16, 2004 @01:28PM (#10268197)
    There are several FS like this, but you don't know of them because they require completely new FS API to work with.

    Why is that? There's nothing inherently impossible about having the OS remember, via a transaction log, the changes that have taken place to a set of files made by a process, and then either committing them all or rolling back all of them at process exit time (or whenever the process does a commit() or rollback()). The file operations themselves can be identical, so all you really need are those 4 additional operations I mentioned previously.

  • by majid ( 306017 ) on Thursday September 16, 2004 @01:30PM (#10268230) Homepage
    I was in a chat session with their engineers yesterday. It looks like they have adaptive disk scheduling algorithms to balance the load across the drives (e.g. if a drive is faster than others, it will get correspondingly more I/O). The scheduler also tries to balance I/O among processes and filesystems sharing the data pool.

    This is a good thing - queueing theory shows a single unified pool has better performance than several smaller ones. People who try to tune databases by dedicating drives to redo logs don't usually realize what they are doing is counterproductive - they optimize locally for one area, at the expense of global throughput for the entire system.

    ZFS uses copy-on-write (a modified block is written wherever the disk head happens to be, not where the old one used to be). This means writes are sequential (as with all journaled filesystems) and also since the old block is still on disk (until it is garbage collected) this gives the ability to take snapshots, something that is vital for making coherent backups now that nightly maintenance windows are mostly history. This also leads to file fragmentation so enough RAM to have a good buffer cache helps.

    Because the scheduler works best if it has full visibility of every physical disk, rather than dealing with an abstract LUN on a hardware RAID, they actually recommend ZFS be hosted on a JBOD array (just a bunch of disks, no RAID) and have the RAID be done in software by ZFS. Since the RAID is integrated with the filesystem, they have the scope for optimizations that is not available if you have a filesystem trying to optimize on one side and a RAID controller (or separate LVM software) on the other side. Network Applicance does something like this with their WAFL network filesystem to offer decent performance despite the overhead of NFS.

    With modern, fast CPUs, software RAID can easily outperform hardware RAID. It is quite common for optimizations like hardware RAID made at a certain time to become counterproductive as technology advances and the assumptions behind the premature optimization are no longer valid. A long time ago, IBM offloaded some database access code in its mainframe disk controllers. It used to be a speed boost, but as the mainframe CPU speeds improved (and the feature was retained for backward compatibility), it ended up being 10 times slower than the alternative approach.
  • by pla ( 258480 ) on Thursday September 16, 2004 @01:32PM (#10268253) Journal
    Data integrity. Apparently it uses file checksums to error-correct files, so files will never be corrupted. About time someone did this.

    So, I take it that back in the days of DOS, you never got a CRC error trying to copy an important file off a floppy?
  • by Anonymous Writer ( 746272 ) on Thursday September 16, 2004 @01:32PM (#10268260)

    Opensource is useless when it's patent encumbered.

    The GPL [] states the following...

    Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.

    I thought that if the patent holder distributes patented material under the GPL, it is a declaration that the holder has relinquished control over the patented material for as long as it is applied under the GPL.

  • Easy upgrades (Score:2, Interesting)

    by dTb ( 304368 ) on Thursday September 16, 2004 @01:33PM (#10268278)
    I am very impressed by some of the ideas coming from Sun regarding this file system:

    "We're absolutely trying to make disk storage more like memory, and often use that analogy in our presentations. For example, when you add DIMMS to your computer, you don't run some 'dimmconfig' program or worry about how the new memory will be allocated to various applications; the computer just does the right thing. Applications don't have to worry about where their memory comes from. Likewise with ZFS, when you add new disks to the system, their space is available to any ZFS filesystems, without the need for any further configuration. In most scenarios it's fairly straightforward for the software to make the unequivocably best choices about how to use the storage. If you want to tell the system more about how you want the storage used, you'll be able to do that too (eg. this data should be mirrored but that not; it's more important for this data to be accessed quickly but that can be slower). We hope that with relatively modern hardware, all but the most complicated and demanding configurations will be handled adequately without any administrator intervention." read more []

  • by Dracolytch ( 714699 ) on Thursday September 16, 2004 @01:39PM (#10268365) Homepage
    Methinks you don't understand how insanely large 128 bits is.

    340282367000000000000000000000000000000 files.
    My first computer was about.. here ^
    My system is about... here ^
    And this... ^

    A gross overestimate of every file on every computer on the internet today (250 million computers, 5 million files per computer).
    Yep. I think they might be right on this one.

  • Re:fileless systems (Score:5, Interesting)

    by Tony ( 765 ) on Thursday September 16, 2004 @01:49PM (#10268498) Journal
    After years of everyone saying that the relational model was the answer to all data organziation needs... the hierarchical model reappeared in the form of XML, and people realized that it is convenient to organize some types of data hierarchically.

    Convenient, and flawed.

    XML isn't designed to handle changing data. It's designed to be a data markup language, which indicates it's used for presenting data, not managing data.

    So far, the relational model is the best mathematically-rigorous method of managing sets of data. There are many advantages to hierarchical data representation, but for manipulation, the relational still trumps.

    Do I want to use SQL to access my files? Not if I don't have to. There are perhaps better methods, even some transparent methods.

    But, do I want to continue to self-organize my data? Hell, no! There's just too much information stored on my computer, and on my network, these days. And, considering that much of my data has multiple relationships, the hierarchical model is growing a bit long in the tooth. Many of my documents belong in multiple hierarchies.

    But, there might be a real solution soon:

    Gnome Storage [] looks to be a good first step.
  • by laird ( 2705 ) <> on Thursday September 16, 2004 @01:53PM (#10268543) Journal
    "It would take over 500 years to fill a 64 bit filesystem written at 1GB/sec"

    This is about the same argument as IPv6 addressing: it's expensive to change the size of the address space, so make it absurdly large because bits of address space are cheap, you enable some interesting unforseen applications, and you put off a forced migration.

    While I agree that 128-bit block addressing is overkill for a single computer, once you're going to expand past a 64-bit filesystem, there's not much point in going smaller than a 128-bit fileystem. It's not like you'd save money making it an 80-bit filesystem.

    As to your point about the speed of a hard drive vs. the addressible space in the filesystem, keep in mind that filesystems are much larger than disks. For example, it's not that unusual (in cooler UNIX environments) for everyone in a company to work in one large distributed filesystem, which may run across hundreds or thousands of hard drives. Now imagine a building full of people working with very large files (e.g. video production) where you could easily accumulate terabytes of data. Wouldn't it be nice to manage your online, nearline, and offline storage as a system, extremely large filesystem? Or, for real blue-sky thinking, imagine that everyone on the planet uses a single shared, distributed filesystem for everything. Wouldn't it be cool to address _everything_ using a single, consistent scheme no matter where you are. Cool, eh?
  • by pikine ( 771084 ) on Thursday September 16, 2004 @02:28PM (#10268974) Journal
    One of the key feature of ZFS is that you can create a file system over a pool of storage. Nothing stops you from building a distributed storage pool of 18.3 million desktop drives (they don't have to be locally connected). You could apply the same concept as SETI@HOME and allow end users with excessive storage space to lend them. Didn't someone talk about a peer to peer backup system a while ago?

    And com'n, don't be so against hypes. Not all numbers are evil. And the overhead to process some extra bits are miniscule. The space and time required are in logarithmic time to the size of the number set. E.g., 128-bit is some billions billion times the size of 64-bit, but only takes 2 times more to store and process. And this time is already small compared to the actual I/O time, and the space compared to combined storage space.
  • by Mornelithe ( 83633 ) on Thursday September 16, 2004 @08:11PM (#10272848)
    Online, nearline and offline storage aren't the same address space. I have several partitions on my computer. Some use reiserfs, some use ext3 and some use ext2 (and vfat and ntfs...). In Linux, they're all mounted to look like one large filesystem hierarchy, but they're not. Each partition has its own filesystem 'address space.'

    So you don't need larger than a 64 bit filesystem unless you're going to have a single volume (real or virtual) that uses more than 16 billion terabytes of data. That's 64 billion 250 gig hard drives. What's the population of China these days? 2.5 billion or thereabouts? If you gave everyone in China 25 250 gigabyte hard drives, you'd come close to filling up a 64 bit filesystem (you'd fall short actually).

    And that's only if everyone in China uses a single, giant RAID array for those 64 billion hard drives.

    Or everyone on the planet gets 9 such hard drives. That 1.75 terabytes for every single human being right now, and we're still within the limits of a 64 bit filesystem.

    Your video editing analogy doesn't even come close, and the idea of a whole country using a single, centralized volume (let alone the whole planet) doesn't really make any sense. Addressing all the data in the entire world on every computer at the filesystem level seems like a very bad idea, to me.

    Maybe in 10 to 15 years we'll have individual disks large enough so that large clusters can exceed the bounds of a 64 bit filesystem, but you'll still have to buy entirely new hardware to take advantage of that capability, so a 128 bit filesystem on today's hardware offers no advantages over a 64 bit filesystem, and in fact only makes things slower. Not really very cool at all if you ask me (although the other features of the filesystem likely have merit).

COMPASS [for the CDC-6000 series] is the sort of assembler one expects from a corporation whose president codes in octal. -- J.N. Gray