IBM Speeds Storage With Flash: 10B Files In 43 Min 76

Posted by timothy on Saturday July 23, 2011 @03:40AM from the you-sure-have-a-lot-of-mp3s dept.

CWmike writes "With an eye toward helping tomorrow's data-deluged organizations, IBM researchers have created a super-fast storage system capable of scanning in 10 billion files in 43 minutes. This system handily bested their previous system, demonstrated at Supercomputing 2007, which scanned 1 billion files in three hours. Key to the increased performance was the use of speedy flash memory to store the metadata that the storage system uses to locate requested information. Traditionally, metadata repositories reside on disk, access to which slows operations. (See IBM's whitepaper.)"

IBM Speeds Storage With Flash: 10B Files In 43 Min

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 76 Comments Log In/Create an Account

Comments Filter:

- Re:File Sizes? (Score:5, Informative)
  
  by GuldKalle ( 1065310 ) writes: on Saturday July 23, 2011 @03:54AM (#36855218)
  
  As far as I can see, the files themselves were not read, only the metadata (who has access, modification time, position on the spinning platter, etc.).
  
- Re: (Score:2)
  
  by TheRaven64 ( 641858 ) writes:
  
  It says in the heading. They copied 10 byte files in 43 minutes. Not very impressive, even the old Mac troll copied files faster than that...
- Re: (Score:2)
  
  by isorox ( 205688 ) writes:
  
  Did anyone else read that as "10 byte files?" that seemed mighty slow lol
  Nope, I read 267
43 min for 10 bytes? (Score:5, Insightful)

by Tei ( 520358 ) writes: on Saturday July 23, 2011 @04:05AM (#36855256) Journal

Thats very slow.
Also, please, better technical expertise writing the articles.

- Re:43 min for 10 bytes? (Score:5, Funny)
  
  by impaledsunset ( 1337701 ) writes: on Saturday July 23, 2011 @04:11AM (#36855274)
  
  Come on! Adobe Flash has always been slow, that's a massive improvement!
  
- - Re: (Score:2)
    
    by maxwell demon ( 590494 ) writes:
    
    Make 10 Billion files on your ext3 filesystem and see how long an ls takes you
    Ext3 can store 10 billion files in 10 bytes? Must be the new Whoosh feature, which avoids reading metadata like the comment title.
    - Re: (Score:2)
      
      by kno3 ( 1327725 ) writes:
      
      Oh, BURN!
      - Re: (Score:1)
        
        by maxwell demon ( 590494 ) writes:
        
        Oh, BURN!
        Well, for burning I'd prefer ISO9660 with RockRidge extension to ext3. :-)
- - Re: (Score:2)
    
    by MichaelSmith ( 789609 ) writes:
    
    IBM are selling ClearCase with a straight face.
- Re: (Score:2)
  
  by physicsphairy ( 720718 ) writes:
  
  43 min for 10 bytes.
  I see they've copied the poorly hobbled together config for my SAMBA server.
- - Re: (Score:2)
    
    by Sulphur ( 1548251 ) writes:
    
    Maybe they are scanning them to see if they contain a 1 or a 0. That way they can claim insane numbers like 10B. Whatever a 10B is.
    2?
- Re: (Score:3)
  
  by MichaelSmith ( 789609 ) writes:
  
  I wonder how google would go indexing the contents of 10 billion files.
Huh.... (Score:2)

by Demena ( 966987 ) writes:

Traditional filesystems hold their metadata on disc? Ermmm... Exactly what do you think that the 'sync' command does. Traditionally metadata is held in memory and periodically written to disc for storage.
- - Re: (Score:2)
    
    by Demena ( 966987 ) writes:
    
    That really depends on the directory layout and directory sizes. Study the i-node structure to understand why.
- Re: (Score:2)
  
  by SuricouRaven ( 1897204 ) writes:
  
  Not all of it. Just that which has been recently accessed. Enough for most purposes, as usually only a tiny bit of the stored data is ever needed at once. Doesn't hold up well in some scientific and engineering uses though, and if you need fast response times even on files that haven't been accessed in weeks then it becomes a potential problem.
  - Re: (Score:2)
    
    by Demena ( 966987 ) writes:
    
    There is s difference between filesystem metadata and file metadata. You mention scientific and engineering uses as being particularly bad when it is my belief that the system architects are the cause of this. It is common to find bad architects in those fields. Directory structure is important. If you do not understand the particular filesystem architecture you cannot design for good and fast access. If you want a good fast access system it is absolutely necessary to understand things at that level. M
- Re: (Score:2)
  
  by smallfries ( 601545 ) writes:
  
  Are you confusing a system that stores something in memory, and a system that caches a copy of a small part in memory for fast access?
  - Re: (Score:2)
    
    by Demena ( 966987 ) writes:
    
    I'm not confusing anything. I know exactly how it works.
    - Re: (Score:2)
      
      by smallfries ( 601545 ) writes:
      
      It doesn't sound like you do. Sync is used to flush the cache of metadata back out to the disk. The metadata is actually stored on disk.
      - Re: (Score:2)
        
        by Demena ( 966987 ) writes:
        
        Which is precisely what I said. The filesystem metadata that is _used_ is in memory. It is periodically _saved_ to disk iff there have been changes (i-node 0 for standard unix filesystems).
        
        Re: (Score:2)
        
        by smallfries ( 601545 ) writes:
        
        So now you are shifting in your claims. Yes, when metadata is used it is in memory - the same is true of any data. But it is held (to use your term) on disk, where it is loaded into memory on use, changed and saved back to disk. The primary store of metadata, the one that persists between boots, is held on the disk. A small local cache is changed, as with any data. So going back to your original (erroneous) claim: traditional file-systems *do* hold their metadata on disk, even if they cache a portion of it
        
        Re: (Score:2)
        
        by Demena ( 966987 ) writes:
        
        No. There is no need to retract anything. I made no erroneous claim. Stop trolling.
        
        Re: (Score:1)
        
        by anamin ( 796023 ) writes:
        
        You be trollin.
Demand (Score:1)

by wesleyjconnor ( 1955870 ) writes:

......Is this kind of performance in scanning in high demand?
cost/performance (Score:1)

by maxwell demon ( 590494 ) writes:

They noted that while solid-state storage can cost 10 times as much as traditional disks, they can offer 100 percent performance boost.
So you get 2 times the performance for 10 times the price? I'd say that's still 5 times as expensive. What would be the performance boost with a RAID of 5 disks?
- Re: (Score:2)
  
  by FishTankX ( 1539069 ) writes:
  
  I think you misunderstood the point of the statement in that article.
  It's referencing using solid state as a cache, and how even though solid state memory costs 10x as much, when used for caching duty, it can increase the performance of the disk array by 100%. This would be in line with the numbers alot of sites are getting from intel's new hard disk SSD caching tech.
You can DIY it in linux. (Score:2)

by elsJake ( 1129889 ) writes:

Some filesystems allow you to store the journal on a different disk , such as a SSD
Numbers (Score:2)

by kramulous ( 977841 ) writes:

Now, some of my maths might be (a little) off, but ...
I've just spent half the day processing financial files ... 133KB average file size and processed (by process, I mean every byte is 'looked' at in c++ code) 4000 per second. I did this on a single file (compressed tar.gz) that when expanded is 7857 files and just over 1GB in size. The compressed file is temporarily stored in /dev/shm. The parallelisation is around one thread processing the ram drive file while the other file copies the next file (1GB
- Re: (Score:2)
  
  by smallfries ( 601545 ) writes:
  
  Your lack of understanding is quite simply astounding. You have completely missed the point of their research, which is to reduce the latency in randomly accessing information in a large dataset. They are not measuring throughput (or bandwidth) although the article does state that they hit 4.9GB/s. If you made your files much much smaller and then repeated your test you would find that your performance drops drastically as your program comes limited by a different IO bound. Instead of being bounded by the b
- Re: (Score:2)
  
  by Salamander ( 33735 ) writes:
  
  Doing something for 7857 files and doing it for 10 billion are very different situations. 7857 files, including metadata, can easily be sucked into memory in one big chunk and unpacked/examined from there. That simply doesn't work for datasets larger than memory. At the higher scale, modern filesystems do tend to fall apart, badly, so different approaches are needed. Comparing your paper airplane to an F-22 doesn't make it look like you know anything about writing software properly. Quite the opposite.
Try it for yourself (Score:2)

by BlackPignouf ( 1017012 ) writes:

time sudo ls -lAR / | grep -E '^[ld\-]+' | wc -l
It should give you the number of files on your filesystem and the time it took to "scan" them all.
- Re: (Score:2)
  
  by pakar ( 813627 ) writes:
  
  Well, you probably need to make sure you dont have any of the files or metadata in the buffercache before starting.. Also limit the search to the actual filesystem you want to test..
  # echo 3 >/proc/sys/vm/drop_caches
  # time find / -xdev -printf "%p %y %s %n %i %m %G %U %c %b %a\\n" |wc -l
  621847
  real 0m36.738s
  user 0m6.031s
  sys 0m12.737s
  This on a simple 40Gb Intel SSD with a ext4 fs
  - Re: (Score:2)
    
    by reset_button ( 903303 ) writes:
    
    FYI: "drop_caches" only drops clean pages, so you need to run "sync" first if you want to properly flush your cache.
Alternative summary (Score:3)

by tulcod ( 1056476 ) writes: on Saturday July 23, 2011 @07:34AM (#36855762)

IBM throws a lot of hardware at a problem; problem gets solved.

SUN were doing this in 1990 (Score:2)

by petes_PoV ( 912422 ) writes:

I have a vague memory of Sun producing an NFS accelerator about 20 years ago. This worked by caching remote file data in non-volatile memory.
something strange in the title? (Score:1)

by Anonymous Coward writes:

I was wondering what does it mean 10B files... Ok, the article talk of 10 Billion files... But 1 Billion is 10^9 or is 10^12. So If you have to use a symbol, use a sensible one... What about 10G files? :D
Comparison of several big UGG Deckers (Score:1)

by linyanyun ( 2419744 ) writes:

UGG Boots [cheap-boot.com] Australia is the United States under the company, in terms of degree of market recognition, to seize the market early, large leafy tree. However, in China, Ugg Australia is a trading company agent in product promotion maintenance, to ensure credibility, the existence of defects, which led to the present, very few pure through the headquarters of the United States authorized UGG authentic Cheap Ugg [cheap-boot.com]. Jumbo ugg [cheap-boot.com] from Australia Australia is a relatively big, and reflects the Australian tradition o
Coach high performance in China in 2010 (Score:1)

by linyanyun ( 2419744 ) writes:

Coach bags [gobuybagsonline.com] in the Chinese market is booming, the report shows same-store sales showed double-digit growth. The U.S. market grew only 6.3%. According to Bain & Company survey, China's luxury market in 2009 the total capacity of 23.3 billion U.S. dollars, Coach said he accounted for 5% of the total market. Lew Frankfort said: "Coach handbags [gobuybagsonline.com] brand awareness in China is not very high, about 8%, we decided to catch up over the next five years, China will surpass Japan to become the Coach after the United

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re:File Sizes? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

43 min for 10 bytes? (Score:5, Insightful)

Re:43 min for 10 bytes? (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Huh.... (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Demand (Score:1)

cost/performance (Score:1)

Re: (Score:2)

You can DIY it in linux. (Score:2)

Numbers (Score:2)

Re: (Score:2)

Re: (Score:2)

Try it for yourself (Score:2)

Re: (Score:2)

Re: (Score:2)

Alternative summary (Score:3)

SUN were doing this in 1990 (Score:2)

something strange in the title? (Score:1)

Comparison of several big UGG Deckers (Score:1)

Coach high performance in China in 2010 (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals