Garbage Collection Algorithms Coming For SSDs 156

Posted by Soulskill on Friday August 07, 2009 @10:11PM from the take-out-the-tash dept.

MojoKid writes "A common concern with the current crop of Solid State Drives is the performance penalty associated with block-rewriting. Flash memory is comprised of cells that usually contain 4KB pages that are arranged in blocks of 512KB. When a cell is unused, data can be written to it relatively quickly. But if a cell already contains some data, even if it fills only a single page in the block, the entire block must be re-written. This means that whatever data is already present in the block must be read, then it must be combined or replaced, and the entire block is then re-written. This process takes much longer than simply writing data straight to an empty block. This isn't a concern on fresh, new SSDs, but over time, as files are written, moved, deleted, or replaced, many blocks are a left holding what is essentially orphaned or garbage data, and their long-term performance degrades because of it. To mitigate this problem, virtually all SSD manufacturers have incorporated, or soon will incorporate, garbage collection schemes into their SSD firmware which actively seek out and remove the garbage data. OCZ, in combination with Indilinx, is poised to release new firmware for their entire line-up of Vertex Series SSDs that performs active garbage collection while the drives are idle, in order to restore performance to like-new condition, even on a severely 'dirtied' drive."

Garbage Collection Algorithms Coming For SSDs

This discussion has been archived. No new comments can be posted.

Search 156 Comments Log In/Create an Account

Comments Filter:

Re:The logical next step... (Score:1, Informative)

by Anonymous Coward writes: on Friday August 07, 2009 @10:26PM (#28993197)

This is the third generation, the second was to fix speed degradation through fragmentation.

Re:Do cleanup in the OS (Score:4, Informative)

by mattventura ( 1408229 ) writes: on Friday August 07, 2009 @10:30PM (#28993217) Homepage

I think it ends up being like NCQ. The drive's processor can be much more specialized and can do the processing much more efficiently. Not to mention, it might require standards to be changed, since some busses (like USB, IIRC) don't provide commands to zero-out a sector on a low level. On an SSD, just writing a sector with zeros doesn't work the same as blanking the memory. It just makes the drive use a still-blank sector for the next write to that sector. The problem only comes when you run out of blank sectors.

Re:The logical next step... (Score:5, Informative)

by zach297 ( 1426339 ) writes: on Friday August 07, 2009 @10:48PM (#28993301)

From the summary: "This isn't a concern on fresh, new SSDs, but over time, as files are written, moved, deleted, or replaced, many blocks are a left holding what is essentially orphaned or garbage data, and their long-term performance degrades because of it." The are talking about clearing sectors of garbage data that is no longer in use. It would have to be done anyways before the sector can be reused. The new firmware is simply doing that time consuming step early while it is in idle. The actual number of write cycles is not changing.

Re:Filesystem info (Score:5, Informative)

by blaster ( 24183 ) writes: on Friday August 07, 2009 @10:51PM (#28993317)

There is an extensions that was recently added to ATA, the TRIM command. The TRIM command allows an OS to specify a blocks data is no longer useful and the drive should dispose of it. No productions support it, but several beta firmwares do. There are also patches for the Linux kernel that adds support to the black layer along with appropriate support to most filesystems. Windows 7 also has support for it.
There is a lot of confusion about this on the OCZ boards, with people thinking GC somehow magically obviates the needs for TRIM. As you pointed out the GC doesn't know what is data and what is not with respect to deleted files in the FS. I wrote a blog post [blogspot.com] (with pictures and everything) explaining this just a few days ago

Re:Filesystem info (Score:4, Informative)

by Wesley Felter ( 138342 ) writes: <wesley@felter.org> on Friday August 07, 2009 @10:52PM (#28993325) Homepage

You're about two months ahead of the times. The ATA TRIM command will allow the filesystem to tell the SSD which sectors are used and which are unused. The SSD won't have to preserve any data in unused sectors.

Re:Who had to creative/hates "defragmentation"? (Score:5, Informative)

by CountOfJesusChristo ( 1523057 ) writes: on Friday August 07, 2009 @11:04PM (#28993375)

So, I delete a file off of a drive such that the Filesystem no longer holds any references to the given data, and the firmware moves in and performs operations to improve the performance of the device. Its not really rearranging files in to contiguous sections like defragmentation does, its restoring unused sections to an empty state, probably using an algorithm similar to many garbage collectors -- sounds like garbage collection to me.

Re:At what cost? (Score:5, Informative)

by natehoy ( 1608657 ) writes: on Friday August 07, 2009 @11:09PM (#28993397) Journal

Simple. Well, not really, but...
SSD's can be written to in small increments, but can only be erased in larger increments. So, you've got a really tiny pencil lead that can write data or scribble an "X" in an area to say the data is no longer valid, but a huge eraser that can only erase good-sized areas at a time, but you can't re-write on an area until it's been erased. There's a good explanation for this that involves addressing and pinouts of flash chips, but I'm going to skip it to keep the explanation simple. Little pencil lead, big eraser.
Let's call the small increment (what you can write to) a "block" and the larger increment (what you can erase) a "chunk". There are, say, 512 "blocks" to a "chunk".
So when a small amount of data is changed, the drive writes the changed data to a new block, then marks the old block as "unused". When all the blocks in a chunk are unused, the entire chunk can then be safely wiped clean. Until that happens, if you erase a chunk, you lose some data. So as time goes on, each chunk will tend to be a mix of current data, obsolete data, and empty blocks that can still be written to. Eventually, you'll end up with all obsolete data in each chunk, and you can wipe it.
However, it's going to be rare that ALL the blocks in a chunk get marked as unused. For the most part, there will be some more static data (beginnings of files, OS files, etc) that changes less, and some dynamic data (endings of files, swap/temp files, frequently-edited stuff) that changes more. You can't reasonably predict which parts are which, even if the OS was aware of the architecture of the disc, because a lot of things change on drives. So you end up with a bunch of chunks that have some good data and some obsolete data. The blocks are clearly marked, but you can't write on an obsolete block without erasing it, and you can't erase a single block - you have to erase the whole chunk.
To fix this, SSD drives take all the "good" (current) data out of a bunch of partly-used chunks and write it to a new chunk or set of chunks, then marks the originals as obsolete. The data is safe, and it's been consolidated so there are fewer unusable blocks on the drive. Nifty, except...
You can only erase each chunk a certain number of times before it dies. Flash memory tolerates reads VERY well. Erases, not so much.
So if you spend all of your time optimizing the drive, you're moving data around unnecessarily and doing a LOT of extra erases, shortening the hard drive's life.
But if you wait until you are running low on free blocks before you start freeing up space (which maximizes the lifespan of the drive), you'll run into severe slowdowns where the drive has to make room for the data you want to write, even if the drive is sitting there almost empty from the user's perspective.
So, SSD design has to balance between keeping the drive as clean and fast as possible at a cost of drive life, or making the drive last as long as possible but not performing at peak all the time.
There are certain things you can do to benefit both, such as putting really static data into complete chunks where it's less likely to be mixed with extremely dynamic data. But overall, the designer has to choose somewhere on the continuum of "lasts a long time" and "runs really fast".

Re:Who had to creative/hates "defragmentation"? (Score:2, Informative)

by natehoy ( 1608657 ) writes: on Saturday August 08, 2009 @12:01AM (#28993673) Journal

Right, but recall that SSD can only be erased in large blocks, though it can be written to in smaller ones. Erases are what eventually kill a block.
So if I take a block that has only 25% garbage and I want to wipe it, I have to copy the good data over to another block somewhere before I can do that. So I've written 3/4 of a wipable sector's worth of data to a new sector to get rid of the 25% of garbage. Do that a lot, and you do a lot of unnecessary erases and the drive dies faster.
If, instead, you take a sector that is 90% garbage, you only have to use 10% of a new sector to move off the good stuff before you can wipe it. So if you want the drive to last as long as possible, do garbage collection only when absolutely necessary.
But allow garbage to grow too high, and you'll have to tell the operating system to wait while you rearrange data to make room when a write request comes in for a large file.
Do you want the drive to be neatly optimized with no garbage all the time, or do you want the drive to last? I'm not saying one answer is more or less right than the other, but it's a tradeoff.

Re:Do cleanup in the OS (Score:3, Informative)

by Hal_Porter ( 817932 ) writes: on Saturday August 08, 2009 @12:06AM (#28993685)

How does the firmware know what sectors are empty if it doesn't understand this stuff?
I am curious how it works, if it doesn't need knowledge of the filesystem. FAT, NTFS, UFS, EXT2/3/4, ZFS, etc are all very different.
The filesystem tells the SSD "LBA's x to y are now not in use" using the ATA trim command.
http://www.theregister.co.uk/2009/05/06/win_7_ssd/ [theregister.co.uk]
Over-provisioned SSDs have ready-deleted blocks, which are used to store bursts of incoming writes and so avoid the need for erase cycles. Another tactic is to wait until files are to be deleted before committing the random writes to the SSD. This can be accomplished with a Trim operation. There is a Trim aspect of the ATA protocol's Data Set Management command, and SSDs can tell Windows 7 that they support this Trim attribute. In that case the NTFS file system will tell the ATA driver to erase pages (blocks) when a file using them is deleted.
The SSD controller can then accumulate blocks of deleted SSD cells ready to be used for writes. Hopefully this erase on file delete will ensure a large enough supply of erase blocks to let random writes take place without a preliminary erase cycle.
Actually I used to work on an embedded system that used M Systems' TrueFFS. There the flash translation layer actually understood FAT enough to work out when a cluster was freed. I.e. it knew where the FAT was and when it was written it would check for clusters being marked free at which point it would mark them as garbage internally.

Wrong data in article? (Score:3, Informative)

by thePig ( 964303 ) writes: <rajmohan_h@@@yahoo...com> on Saturday August 08, 2009 @12:10AM (#28993701) Journal

In the article it says
But if a cell already contains some data--no matter how little, even if it fills only a single page in the block--the entire block must be re-written
Is this correct?
From whatever I read in AnandTech [anandtech.com], it looked like we need not rewrite the entire block unless the available data is less than total - (obsolete + valid) data.
Also, the article is light in details. How are they doing the GC? Do they wait till the overall performance of the disk is less than a threshold and rewrite everything in a single stretch or do they rewrite based on local optima? If the former, what sort of algorithms are used (and are the best for it) ?

Re:The logical next step... (Score:3, Informative)

by broken_chaos ( 1188549 ) writes: on Saturday August 08, 2009 @12:27AM (#28993769)

I *think* you're misunderstanding how this works, actually.
When a block is written to, the entire block (512KiB) has to be wiped and rewritten from a blank state. When a block is emptied entirely, it does not get touched - just marked as empty. When new data is written to it, the 'empty' block has to actually be wiped, and then the new data written on the just-blanked block.
What this seems to be proposing is to, periodically, actually wipe the blocks marked as empty, when the SSD is otherwise idle - meaning deletes are still fast, and new writes would speed up. I imagine rewrites would stay comparatively slow, though.
(I might be way off on this - someone correct me if I am.)

Re:OCZ already released the GC tool, but for Win o (Score:3, Informative)

by Wesley Felter ( 138342 ) writes: <wesley@felter.org> on Saturday August 08, 2009 @12:27AM (#28993771) Homepage

No, OCZ released wiper, which is a trim tool. Trim and GC are different; in particular, GC requires no tools or OS support.

Re:Wrong data in article? (Score:3, Informative)

by blaster ( 24183 ) writes: on Saturday August 08, 2009 @12:30AM (#28993779)

No, what the actual situation is is that a block consists of some number of pages (currently on the flash used in SSDs it tends to be 128). The pages can be written individually, but only sequentially (so, write page 1, then page 2, then page 3), and the pages cannot be erased individually, you need to erase the whole block.
The consequence of this is that when the FS says "Write this data to LBA 1000" the SSD cannot overwrite the existing page it is stored without erasing its block, so instead it find somewhere else to store it, and in its internal tables it marks the old page as invalid. Later when the GC is sweeping blocks for consolidation the number of valid pages is one of the criteria it uses to figure out what to do. If a block has very few valid pages and has been completely filled then those pages will probably be copied to another block that is mostly valid and the block the data was originally in will be erased.

Re:The logical next step... (Score:4, Informative)

by zach297 ( 1426339 ) writes: on Saturday August 08, 2009 @12:41AM (#28993825)

Read http://en.wikipedia.org/wiki/Flash_memory#Programming [wikipedia.org] . Clearing out an entire block is different than a write. Writing to an SSD is only possible by setting the value to 0. So when I save something to the SSD it is really only writing down the 0's of my file and just leaving the 1's alone. This is not the destructive part of using flash. The part that uses up actual write cycles is clearing a block back to 1's. This is explained in http://en.wikipedia.org/wiki/Flash_memory#Erasing [wikipedia.org] .

Taking from your list of actions: Pick a random block:
1. GC comes along, swoops up block, eliminates junk by flashing entire block into 1's (awhile later)
2. OS requires write, swoops up block, writes only the 0's from the file leaving everything else untouched.

In this manner each step does half of the writing amounting to one write when combined. This is exactly how all SSDs work. The major difference announced in the article is that they are separating the two steps.

Normally this is impossible because the SSD doesn't know if something can be cleared until the OS is trying to overwrite it. This makes writes take longer. The new firmware hopes to make writes faster by moving the first step into the idle time of the drive (by figuring out when a overwritten block is unused) sort of like how you can set up a download to only run when your not using the internet connection. It allows for more efficient use of time that the drive would otherwise be doing nothing with.

Re:when drives are "idle" ? (Score:2, Informative)

by 644bd346996 ( 1012333 ) writes: on Saturday August 08, 2009 @01:47AM (#28994065)

The drives don't have to be idle, just the portion being garbage collected. Flash drives typically consist of many independent memory chips united by a single drive. If the block being erased by the GC is on a chip that isn't being read from at that time, then the controller can issue the erase command without affecting the latency of any request from outside the drive. It would take a very full, random workload (and a very fast disk interface) to be able to detect the garbage collection, and even then, it couldn't be worse than the current method of erasing on an as-needed basis.

Re:Oh Wow (Score:5, Informative)

by Bryan Ischo ( 893 ) writes: on Saturday August 08, 2009 @02:37AM (#28994241) Homepage

You need to read up much, much more on the state of SSDs before making such sweeping, and incorrect, generalizations.
There are algorithms in existence, such as clever "garbage collection" (which is a bad name for this process when applied to SSDs - it's only a bit like "garbage collection" as it is traditionally known as a memory management technique in languages like Java) combined with wear levelling algorithms, and having extra capacity not reported to the OS to use as a cache of "always ready to write to" blocks, that can keep SSD performance excellent in 90% of use cases, and very good in most of the remaining 10%. Point being that for the majority of use cases, SSD performance is excellent almost all of the time.
Intel seems to have done the best job of implementing these smart algorithms in its drive controller, and their SSD drives perform at or near the top of benchmarks when compared against all other SSDs. They have been shown to retain extremely good performance as the drive is used (although not "fresh from the factory" performance, there is some noticeable slowdown as the drive is used, but it's like going from 100% of incredibly awesome performance to 85% of incredibly awesome performance - it's still awesome, just not quite as awesome as brand new), and except for some initial teething pains caused by flaws in their algorithms that were corrected by a firmware update, everything I have read about them - and I have done *alot* of research on SSDs, indicates that they will always be faster than any hard drive in almost every benchmark, regardless of how much the drive is used. And they have good wear levelling so they should last longer than the typical hard drive as well (not forever, of course - but hard drives don't last forever either).
Indilinx controllers (which are used in newer drives from OCZ, Patriot, etc) seem to be second best, about 75% as good as the Intel controllers.
Samsung controllers are in third place, either ahead, behind, or equal to Indilinx depending on the benchmark and usage pattern, but overall, and especially in the places where it counts the most (random write performance), a bit behind Indilinx.
There are other controllers that aren't benchmarked as often and so it's not clear to me where they sit (Mtron, Silicon Motion, etc) in the standings.
Finally, there's JMicron in a very, very distant last place. JMicron's controllers were so bad that they singlehandedly gave the entire early-generation SSD market a collective black eye. The one piece of advice that can be unequivically stated for SSD drives is, don't buy a drive based on a JMicron controller unless you have specific usage patterns (like, rarely doing writes, or only doing sequential writes) that you can guarantee for the lifetime of the drive.
I've read many, many articles about SSDs in the past few months because I am really interested in them. Early on in the process I bought a Mtron MOBI 32 GB SLC drive (I went with SLC because although it's more than 2x as expensive as MLC, I was concerned about performance and reliability of MLC). In the intervening time, many new controllers, and drives based on them, have come out that have proven that very high performance drives can be made using cheaper MLC flash as long as the algorithms used by the drive controller are sophisticated enough.
Bottom line: I would not hesitate for one second to buy an Intel SSD drive. The performance is phenomenal, and there is nothing to suggest that the estimated drive lifetime that Intel has specified is inaccurate. I would also happily buy Indilinx-based drives (OCZ Vertex or Patriot Torx), although I don't feel quite as confident in those products as I do in the Intel ones; in any case they all meet or exceed my expectations for hard drives. I've already decided that I'm never buying a spinning platter hard drive again. Ever. I have the good fortune of not being a movie/music/software pirate so I rarely use more than a couple dozen gigs on any of my systems anyway, so the smal
Read the rest of this comment...

Misconceptions / errors in parent article (Score:5, Informative)

by AllynM ( 600515 ) * writes: on Saturday August 08, 2009 @08:15AM (#28995237) Journal

I have been working closely with OCZ on this new firmware and wanted to clear things up a bit. This new firmware *does not*, *in any way at all*, remove or eliminate orphaned data, deleted files, or anything of the like. It does not reach into the partition $bitmap and figure out what clusters are unused (like newer Samsung firmwares). It does not even use Windows 7 TRIM to purge unused LBA remap table entries upon file deletions.
What it *does* do is re-arrange in-place data that was previously write-combined (i.e. by earlier small random writes taking place). If data was written to every LBA of the drive, then all files were subsequently deleted, all data would remain associated with those LBAs. This actually puts OCZ above most of the pack, because their algorithm restores performance without needing to reclaim unused flash blocks, and does so completely independent of the data / partition type used. This is particularly useful for those concerned with data recovery of deleted files, since the data is never purged or TRIMmed.
Slashdot-specific Translation: This firmware will enable an OCZ Vertex to maintain full speed (~160 MB/sec) sequential writes and good IOPS performance when used under Mac and Linux.
Hardware-nut Translation: This firmware will enable OCZ Vertex to maintain full performance when used in RAID configurations.
I'll have my full evaluation of this firmware up at PC Perspective later today. Once available, it will appear at this link:
http://www.pcper.com/article.php?aid=760 [pcper.com]
Regards,
Allyn Malventano
Storage Editor, PC Perspective

Re:The logical next step... (Score:5, Informative)

by Lothsahn ( 221388 ) writes: <Lothsahn@@@SPAM_ ... tardsgooglmailcm> on Saturday August 08, 2009 @10:22AM (#28995767)

I think you're somewhat close, but there are some inaccuracies...
Block devices (typically HD's) have two operations (read and write). These operations are what most modern operating system use. Flash SSD's emulate a block device, but the underlying flash memory uses three operations (read, write, and erase). The main difference, therefore, between the block device (what the OS references) and the underlying flash itself is the extra erase operation.
To write to a flash drive, assuming a cell has already been erased, all a user must do is a write operation. This operation is typically fast and does not affect the lifespan of the flash. A write can change any or all of the bits in a block from 1 to 0. Once this is complete, the requested data is written. However, if a user wants to overwrite or change existing data, they must first perform a block erase. This sets every bit in the block back to 1, and is typically very slow (compared to a write). In addition, this is what wears out the flash block, so we really want to avoid these operations.
Since flash blocks each have their own lifespan, we want to spread the erase operations around the disk. This is called wear leveling. To do this, the flash device appears like a block device to the operating system, but it remaps where the data is actually located at the physical flash layer with a remap table. For instance, let's say you overwrite a block in Linux. If there is an available free flash block, it may not even overwrite that block--it may allocate a new block for the file and write it there (updating the remap table). This avoids an erase command. Furthermore, there are a few files on a filesystem which change frequently, and if we did not move their location around the physical flash, we would wear out one cell in flash extremely quickly, even though the remainder of the cells had plenty of life left.
The garbage collection comes in due to this remapping. Typical block sizes for most OS filesystems are around 4k, but flash blocks are typically 512KB in flash devices. This means that if you send data to a SSD, it may or may not take up an entire page, as you may only be using 4k of actual data. Eventually, as writes are leveled around the drive and often fragmented (as we may not be occupying the entire 512KB block), future writes begin taking one (or more erase cycles). For instance, if you request that 512KB of data be written to the drive, but all the cells in the flash are physically occupied by a small amount of data, then data from multiple cells must be combined into one cell (multiple reads+erase+write), and then the destination cell that you are writing to must also be erased and written. This is what causes flash SSD's performance to significantly degrade over time.
By performing this recombining in the background (as a garbage collection), this should allow flash SSD's to maintain like-new performance even when containing a lot of data. In essence, they are performing background defragmentation on the SSD. As a sidenote, NEVER defragment an SSD from the Operating System, as this defragments the filesystem, but performs a ton of erase+write operations to the flash. At best, on new SSDs (Intel, Indilinx), this will wear out the drive sooner. On old SSD's, this will also increase fragmentation at the flash remap layer, causing further performance loss.
So to address your initial comment, rewrites would also see a performance increase by this garbage collection, as "rewriting" data in flash is virtually equivalent to a new write, since the remap table essentially moves the data anyway.
Source:
http://en.wikipedia.org/wiki/Flash_memory#Block_erasure [wikipedia.org]

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Garbage Collection Algorithms Coming For SSDs 156

Garbage Collection Algorithms Coming For SSDs More Login

Garbage Collection Algorithms Coming For SSDs

Re:The logical next step... (Score:1, Informative)

Re:Do cleanup in the OS (Score:4, Informative)

Re:The logical next step... (Score:5, Informative)

Re:Filesystem info (Score:5, Informative)

Re:Filesystem info (Score:4, Informative)

Re:Who had to creative/hates "defragmentation"? (Score:5, Informative)

Re:At what cost? (Score:5, Informative)

Re:Who had to creative/hates "defragmentation"? (Score:2, Informative)

Re:Do cleanup in the OS (Score:3, Informative)

Wrong data in article? (Score:3, Informative)

Re:The logical next step... (Score:3, Informative)

Re:OCZ already released the GC tool, but for Win o (Score:3, Informative)

Re:Wrong data in article? (Score:3, Informative)

Re:The logical next step... (Score:4, Informative)

Re:when drives are "idle" ? (Score:2, Informative)

Re:Oh Wow (Score:5, Informative)

Misconceptions / errors in parent article (Score:5, Informative)

Re:The logical next step... (Score:5, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot