Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Data Storage Software Hardware Linux Technology

PetaBox: Big Storage in Small Boxes 295

An anonymous reader writes "LinuxDevices.com is reporting that a Linux-based system comprising more than a petabyte of storage as been delivered to the Internet Archive, the non-profit organization that creates periodic snapshots of the Internet. The PetaBox products, made by Capricorn Technologies, are based on Via mini-ITX motherboards running Debian or Fedora Linux. The IA's PetaBox installation consists of about 16 racks housing 600 systems with 2,500 spinning drives, for a total capacity of roughly 1.5 petabytes, according to the article. Now to strap one of those puppies to my iPod!" The Internet Archive continues to astound.
This discussion has been archived. No new comments can be posted.

PetaBox: Big Storage in Small Boxes

Comments Filter:
  • by simrook ( 548769 ) on Wednesday June 22, 2005 @03:35AM (#12879113)
    The Internet represents a great historical tool. Case and point is what happened on 9/11. Being able to go back and see the progression, paranoia, patrotism, and early iraq/afgahanistan/binladen/hussien posts and opinions on various new sites is amazing. cnn, fox, the ny times, all are archived several times on 9/11 on archive.org.

    I for one think that archive.org should turn into some UN effort, with a mission to chronical and store daily/timely snapshots of the internet and the culture at the time, preserving it for future generations. What a tool for future historians!

    The ability to look at a large representation of socity at one single critical moment in time, and being able to have first hand sources for all that information is something that can truely change the way history is recorded (and not in the bad newspeak ingsoc way either). Infact, a wholeistic archive of what happens day-to-day, in an easily accessible format, might well help written history to be more representative of actual history (instead of, say the history Bush wants us to believe; that the Iraq war was for human right and not wmd's). I love Foucault.

    The internet archive rocks... really hope this project continues full blast.

    - Peace
  • Re:Good to see. (Score:3, Insightful)

    by bigberk ( 547360 ) <bigberk@users.pc9.org> on Wednesday June 22, 2005 @03:50AM (#12879145)
    people from my univ might recognize this... there was a famous guy in our engineering faculty who, back in the 90s, had written some kind of an automated porn downloading app. It was running on their UNIX servers but he left it running unattended. apparently he had no quota because within a few days he had filled up the entire system storage with porn, several hundreds of megabytes worth which was very substantial back then.

    I had a similar experience, I was playing around on irc back when we were swapping video files through DCC. apparently some downloading got out of hand and paged the admin, who contacted me and politely pointed out that I had a process running wild and filling /tmp... oops, must be an experiment gone wrong I had to say
  • by Anonymous Coward on Wednesday June 22, 2005 @04:34AM (#12879248)
    The 911 targets where chosen in a way everyone would notice. Not exactly amazing that it's well reported on, it would have been if it happened 20 years ago. But that was just a single attack. If you look at the much bigger recent events that you mention, like the war on Iraq, you'll see that there really is hardly any detailed reporting. You have a lot of propaganda by the attackers, some propaganda from the Iraqi government, and some reports by angry people getting in the middle. You still have a completely unclear view of what happened.

    We already had people writing diaries and making lots of pictures in WWII. The improvement isn't that great.
  • by PReDiToR ( 687141 ) on Wednesday June 22, 2005 @05:38AM (#12879391) Homepage Journal
    (and not in the bad newspeak ingsoc way either)

    Funny you should mention that, but this whole "Internet as history" thing has me wound up tight.

    Books cannot be changed. They can be destroyed, reprinted and banned but the first edition will always exist in a collection.
    The first edition of a website only exists in digital form and there is no way to stop the original from being edited and timestamped back to the expected date.

    The IA is the MiniTruth's dream come true.

    But who cares? History has always been written by the victorious, hasn't it?
  • Re:copyright (Score:2, Insightful)

    by generic-man ( 33649 ) on Wednesday June 22, 2005 @06:51AM (#12879534) Homepage Journal
    Yes, I did. I got two responses, neither of which answered my question.

    1. FAIR USE!
    2. Google is merely providing a service. If you don't like it you can opt out.

    The Google Cache is not fair use, as it reproduces the entirety of a web page's text for none of the purposes for which Fair Use is defined. (Under Fair Use you are entitled to use a portion of a copyrighted work, not the whole thing.)

    The second one just cracks me up. I thought the Slashdot crowd didn't like being asked to opt out.

    Now, trifish, how can the Internet Archive evade copyright laws by reproducing the entirety of many copyrighted pages? Don't try and argue that they're a library. Libraries buy books; they don't photocopy them.
  • Re:No RAID?! (Score:3, Insightful)

    by iamplasma ( 189832 ) on Wednesday June 22, 2005 @06:54AM (#12879542) Homepage
    Yeah, but the thing is that the storage is spread out between lots of different 1U units, each with either 1 or 1.6Tb. So to make a RAID5 over 1.6Tb in size, you'd have to cross over multiple machines, adding a serious overhead, especially when you have to calculate parity for the parity drive. On the other hand, if you only did RAID 5 in the individual units, it'd be pretty pointless, because with that many units you'd be crazy to rely on no entire machine failures.

    So, while yes, if it really was just one giant supercomputer with a bajillion hard drives in it, RAID 50 would be an ideal solution (as long as the stripes were large enough to prevent too many accesses crossing too many drives, the one big advantage of JBOD here), but that's not what's really in use here.
  • by Anonymous Coward on Wednesday June 22, 2005 @07:18AM (#12879602)

    Depends heavily on your purpose of the system, of course.

    If you need something that is highly aviable and have good performance, then raid is wonderful. But archives don't need to be highly aviable, they just need to be highly redundant and backed up to several places.

    For instance if you have a RAID 5 array, then a single harddrive failing couldn't take it out. But a single controller failing could. If one drive starts spewing out nonsense then that corruption could be replicated automaticly between harddrives on a array before anybody notices or hardware monitors shutdown everything.

    So in this sense simply having multiple copies on different computers on different disks is actually preferable to raid setup. It is simplier, as long as you have high quality distributed filing systems, it's easier to restore materal. It'll be easier to access down the line.

    It just won't have the higher performance or high aviability that raid will provide.. but then again it doesn't realy need it.

    And remember:
    RAID != backups.
  • Re:copyright (Score:3, Insightful)

    by generic-man ( 33649 ) on Wednesday June 22, 2005 @07:20AM (#12879607) Homepage Journal
    If you consider your HTML content to be unproductible copyrighted material, might I ask why the hell is it publically accessible on the Web in the first place?

    If you consider your music to be copyrighted material, might I ask why the hell it's being played on the radio in the first place?

    If you consider your book to be copyrighted material, might I ask why the hell it's being lent out in the library in the first place?

    If you consider your movie to be copyrighted material, might I ask why the hell it's being broadcast on HBO in the first place?

    Just because something is available for free doesn't mean that the producer has granted you a permanent license to distribute it for commercial gain, as Google does with its cache.
  • Re:No RAID?! (Score:1, Insightful)

    by Anonymous Coward on Wednesday June 22, 2005 @08:27AM (#12879828)
    Bingo. Every distributed file system, whether RAID5, GFS, or the other fascinating software variants, pays significant overhead for all that striping. And managing a RAID set that big, striped across all those machines, is kind of tough. Then when several machines fail at once, as is inevitable across that large of arrays, or when a controller or two fail, you have to rebuild these wildly scattered RAID arrays.

    It basically triples the price without getting you much for such a large setup, where point replacement of lost systems without imperiling your other systems is much, much easier.

Things equal to nothing else are equal to each other.