Forgot your password?
typodupeerror
Data Storage Software Hardware Linux Technology

PetaBox: Big Storage in Small Boxes 295

Posted by timothy
from the always-impressive dept.
An anonymous reader writes "LinuxDevices.com is reporting that a Linux-based system comprising more than a petabyte of storage as been delivered to the Internet Archive, the non-profit organization that creates periodic snapshots of the Internet. The PetaBox products, made by Capricorn Technologies, are based on Via mini-ITX motherboards running Debian or Fedora Linux. The IA's PetaBox installation consists of about 16 racks housing 600 systems with 2,500 spinning drives, for a total capacity of roughly 1.5 petabytes, according to the article. Now to strap one of those puppies to my iPod!" The Internet Archive continues to astound.
This discussion has been archived. No new comments can be posted.

PetaBox: Big Storage in Small Boxes

Comments Filter:
  • by Anonymous Coward on Wednesday June 22, 2005 @01:58AM (#12879004)
    For all the jokes out there about people 'downloading the internet' it's good to know someone is actually doing it.
    • by FireballX301 (766274) on Wednesday June 22, 2005 @02:06AM (#12879029) Journal
      Who the heck cares about the rest of the internet, can this thing hold all the pr0n?
      • Re:Good to see. (Score:3, Insightful)

        by bigberk (547360)
        people from my univ might recognize this... there was a famous guy in our engineering faculty who, back in the 90s, had written some kind of an automated porn downloading app. It was running on their UNIX servers but he left it running unattended. apparently he had no quota because within a few days he had filled up the entire system storage with porn, several hundreds of megabytes worth which was very substantial back then.

        I had a similar experience, I was playing around on irc back when we were swapping
      • Re:Good to see. (Score:5, Interesting)

        by Council (514577) <rmunroe.gmail@com> on Wednesday June 22, 2005 @05:39AM (#12879511) Homepage
        In one of the weirder perspective exercises I've ever conceived:

        5 petabytes of storage is enough for a brief five-minute DVD-quality sex scene for each person of legal age in the US (two to a scene). 100 petabytes would be five minutes of porn of every pair of people in the world.

        I actually wonder about this a little; how many women have posed nude on the internet? There seem to be an awful lot; I haven't been able to see them all (though I will continue to try). Where do they mostly come from, I wonder.
        • *Writes above factoid down* I love it!

          I've always wanted to know the answer to your second question as well, actually, hopefully someone else will be able to answer or at least give some interesting insights. Another question is "why do they do it?", and it's not something I've easily been able to work out. A friend of mine (one of those born-again Christian types) admitted in one of those email forward-things to posing nude, which I can't quite believe. Another friend's ex-girlfriend apparently has a sui
        • Most of them probably have an exhibitionist streak in them, tend to need their self-esteem externaly reinforced. A good photographer/director make a sestion almost seductive for the model and and get many to go alot farther than the model/actress intended. It's interesting how a house-mouse can turn into a wild-cat with the right push. Photographers almost always retain full-rights to their photos which can be interesting because a set a nudes can be taken of a young starving want-a-be actress, forgotten fo
        • They're all the same woman. It's amazing what you can do with a false nose and glasses....
        • by Mark Hood (1630) on Wednesday June 22, 2005 @08:27AM (#12880199) Homepage
          There seem to be an awful lot; I haven't been able to see them all (though I will continue to try). Where do they mostly come from, I wonder.

          Let me get this straight, you're trying to see all the porn in the world, and you still don't know where babies come from? :)
          • Funny thing about your sig---I just noticed that, as your wishlist is on Amazon.co.uk, the items say things like "Usually dispatched within 24 hours". In US English, we say 'shipped' instead of 'dispatched'. I never knew that was a UK-ism.

            Learn something new every day, I suppose.

            --grendel drago
        • Yeah, but how many of them would you want to see naked [mercola.com]? Unless you have a chub fetish, you're unlikely to find the US demographic pool particularly attractive.

          On the other hand, you could just go grab a Livejournal account, join the communities "kaizersoze125" and "show_your_boobs", and marvel at the quantity of amateur porn folks throw out there for free.

          Seriously. There's some high quality out there. Some of it's not even members-locked (earningtails [livejournal.com], for instance).

          --grendel drago
      • That would make you a peta-file.
    • by Anonymous Coward on Wednesday June 22, 2005 @02:25AM (#12879080)
      But does it run Lin... um.

      How about a Beo.. oh damn

  • by Bananatree3 (872975) on Wednesday June 22, 2005 @01:58AM (#12879006)
    If, If only I could get a hold of one of those, I could Rival GOOGLE! Yes! I can become the next internet craze with my super, duper search engine crawling the web! I have the space, now I just need a connection in the middle of Alaska fast enough to rival google...
  • by Dancin_Santa (265275) <DancinSanta@gmail.com> on Wednesday June 22, 2005 @01:59AM (#12879009) Journal
    Michael Jackson was heard breathing a sigh of relief. He thought it was where they sent Petafiles.

    R. Kelly was scrambling to find the company's phone number.
  • archive.org (Score:5, Interesting)

    by Nasarius (593729) on Wednesday June 22, 2005 @02:01AM (#12879020)
    Internet Archive, the non-profit organization that creates periodic snapshots of the Internet.

    They do a lot more than that! I've just been downloading some Warren Zevon [archive.org] shows from their Live Music Archive.

  • copyright (Score:5, Interesting)

    by DualG5GUNZ (762655) * on Wednesday June 22, 2005 @02:02AM (#12879023)
    Not to sound like an advocate or anything... But how is it that the Internet Archives project resists claims of copyright infringement and the likes when they have copies of entire websites in their records?
    • Re:copyright (Score:4, Informative)

      by seifried (12921) on Wednesday June 22, 2005 @02:12AM (#12879050) Homepage

      You can exclude them from your website using the robots.txt:

      User-agent: ia_archiver
      Disallow: /

      For example if you go to archive.org and plug my site into the wayback machine:

      We're sorry, access to http://www.seifried.org/ [seifried.org] has been blocked by the site owner via robots.txt.

      and you can also request them to expunge your site from the archive.

      They go out of their way to make it easy to prevent your site being copied (more so then most search engines).

      • I feel they make it too easy. IA blocks not only the present version of the site, but also every page of every past version.

        I can't get older pages of a web site I operated several years ago because a robots.txt file was inadvertently added that blocks it. At the time, I didn't know about the Internet Archive, and as a result potentially years of this site's history is gone.
        • by QMO (836285)
          Shouldn't have used those backup tapes for streamers, I guess.
          Or was it backup CDs for coasters/frisbees?

          (CDs don't work well for frisbees. In my experience they break after just a few brick walls, and it costs a stroke, and makes it harder to get par.)
      • Re:copyright (Score:3, Interesting)

        by spacefight (577141)
        The Internat Archive is fucking up big time with their robots.txt stuff. If you exclude a site from beeing shown, it doesn't show anything, correct. But: If this site goes offline, the archived pages of that former site are all available, not blocked at all.
      • You can exclude them from your website using the robots.txt:
        They should ignore robots.txt altogether if they want to be a truly useful resource.

        Particularly for a robots.txt like this [whitehouse.gov].
      • I'm afraid that the burden is on the archive.org not to archive copyrighted material, not on the copyright holder to explicitely deny people permission.

        If they really wanted to go out of their way, they would ask permission before illegally copying and distributing copyrighted material for which they do not have permission.
    • "Historical Purposes"
    • I know that the US Copyright Office has granted a DMCA exemption for at least some of the material they archive.

    • Re:copyright (Score:2, Interesting)

      by trifish (826353)
      But how is it that the Internet Archives project resists claims of copyright infringement and the likes when they have copies of entire websites in their records?


      Did you ask this question when Google introduced site cache several years ago?
      • Re:copyright (Score:2, Insightful)

        by generic-man (33649)
        Yes, I did. I got two responses, neither of which answered my question.

        1. FAIR USE!
        2. Google is merely providing a service. If you don't like it you can opt out.

        The Google Cache is not fair use, as it reproduces the entirety of a web page's text for none of the purposes for which Fair Use is defined. (Under Fair Use you are entitled to use a portion of a copyrighted work, not the whole thing.)

        The second one just cracks me up. I thought the Slashdot crowd didn't like being asked to opt out.

        Now, trifi
    • They really saved my ass more than once, I'm sure I'm not special or anything.
  • Petabox? (Score:4, Funny)

    by eclectro (227083) on Wednesday June 22, 2005 @02:02AM (#12879025)

    Isn't that what naked girls climb out of to protest fur coats?

    Thank you, I'll be here all week.
    • Re:Petabox? (Score:2, Funny)

      by Anonymous Coward
      Actually, it's what geeks would like to do, but are seldom given the chance.
  • IPod? (Score:2, Funny)

    Right, sure, like anyone believes that you want that much storage for music. You just want to use it for pr0n.
    • Re:IPod? (Score:2, Funny)

      by BlackMesaLabs (893043)
      Decide to use it for "Pr0n" and you're gonna NEED a beowulf cluster of them...
  • great usage. (Score:5, Informative)

    by Bananatree3 (872975) on Wednesday June 22, 2005 @02:09AM (#12879045)
    Seriously, I think archive.org deservese sutch a storage system. I have very often wanted to go back to view an archive of a website a while ago, but the cache on Google was from yesterday. It also gives multiple archives of the website based on day which can be quite handy, especially for news related sites. I think they quite well deserve it.
  • 'small box' (Score:5, Funny)

    by MonoSynth (323007) on Wednesday June 22, 2005 @02:24AM (#12879078) Homepage
    So the inventor of the microprocessor dies and suddenly the definition of 'small box' for computer components is again reduced too 'fits in a big room'....
  • Puppies (Score:4, Funny)

    by Sinner (3398) on Wednesday June 22, 2005 @02:27AM (#12879082)
    An anonymous reader writes "LinuxDevices.com is ... according to the article. Now to strap one of those puppies to my iPod!"
    I'm sorry, baby dogs? That's so last week. I've got an arctic seal pup strapped to my iPod. You should see the looks I get on the subway. Bling, baby, Bling.
  • by qda (678333) on Wednesday June 22, 2005 @02:28AM (#12879087) Homepage
    "nobody needs more than a perabyte of storage"
    • by Anonymous Coward
      Well, I'd hope somewhere along the line somebody will fix that typo for you. Otherwise, you'll forever be quoted as "nobody needs more than a perabyte [sic] of storage."
  • I am more than slightly concerned about the lack of RAID in the system. They said that they had some sort of painful experience with RAID 5 not scaling to petabyte-size storage and therefore recommend JBOD. I wouldn't expect RAID 5 to scale to petabyte-size storage because of the parity all being done at once and in the same place but there has to be a way around this that still allows for redundancy. Take a RAID 50, with a lot of RAID 5 arrays in the hundred-terabyte range and a RAID 0 array striping ov
    • Re:No RAID?! (Score:3, Insightful)

      by iamplasma (189832)
      Yeah, but the thing is that the storage is spread out between lots of different 1U units, each with either 1 or 1.6Tb. So to make a RAID5 over 1.6Tb in size, you'd have to cross over multiple machines, adding a serious overhead, especially when you have to calculate parity for the parity drive. On the other hand, if you only did RAID 5 in the individual units, it'd be pretty pointless, because with that many units you'd be crazy to rely on no entire machine failures.

      So, while yes, if it really was just o
  • by kasnol (210803) on Wednesday June 22, 2005 @02:29AM (#12879092) Homepage
    Wow - have they calculate how much is the running cost per day ? I might just stay with my iPod instead for the time being~
    Haha~
    • 50kW at 10 cents per kilowatt hour = $120/day.

      I doubt it draws at a constant 50kW, though. It's probably an average (was given in TFA).

      My math might be completely wrong, given I don't have a clue how to calculate kilowatt hours. Is it just kW * hours_used_daily? :)
      • I doubt it draws at a constant 50kW, though. It's probably an average (was given in TFA).
        I think you meant "peak", because there isn't much difference as far as price goes between constant 50kWh and average 50kWh

        And yes, to compute energy consumption (in kWh) you merely multiply the power drawn from the grid (in kW) by the consumption timeframe (in hours).

        Therefore if a unit uses 50kW, it consumes 50KWh worth of energy.
      • My math might be completely wrong, given I don't have a clue how to calculate kilowatt hours. Is it just kW * hours_used_daily? :)

        Close. It is kw * hours_used. The "daily" part is only valid if (as in your case) you are talking about the amount of energy used over the course of a day.

        Electricity here is $.15/kWh, which would put this box's operation at $180/day. In some places, electricity is as low as $.04/kWh, which would put the energy cost of these boxes at only $48/day.

  • 1.5 Petabytes? (Score:4, Interesting)

    by TheFlyingGoat (161967) on Wednesday June 22, 2005 @02:29AM (#12879093) Homepage Journal
    Where can you purchase 600GB drives these days? (1.5PB / 2500 drives)

    The math doesn't work when you multiply the number of systems out either: 600 systems * 1.6TB/system = 960TB. That's just under a petabyte, or am I missing something?

    Also, if you've got those in a RAID5 setup, you're 'only' talking about approx 800TB of usable space. That's far less than the 1.5 petabytes claimed.

    800TB is a lot of space, but there must be a cheaper/easier way than purchasing 600 systems to do it.
    • They don't like RAID (Score:5, Interesting)

      by billstewart (78916) on Wednesday June 22, 2005 @04:07AM (#12879324) Journal
      I was a bit puzzled by that also - the article said the things come in racks of 40 or 64TB, and 16 racks times 64TB is about 1PB, not 1.5.

      Also, the article says they don't like RAID, due to bad experiences with RAID5, and the system is configured as JBOD (Just a Bunch Of Disks). It doesn't say why, or what users should do to get equivalent protection. My guess is that depending on RAID within a box means you're still vulnerable if the box's CPU or disk controller decides to scribble the disks, or the power supply decides to catch fire or short out and deliver 240VAC on the +5V line or whatever. So if you want a RAID-like set of redundancy, set up your applications or file system mounting or something to calculate the protection disk in software and hand it off to another 1U box for storage.

      The overhead of the motherboards here is not that high - they're about $150-200, and support 4 disks that probably cost $200-300 each, so they're only about 20% of the cost, which is not bad. The article didn't say they're using SATA, and it sounded like it's some IDE variant instead, but if you're only using 100 Mbps Ethernet to connect to the box and not the optional GigE, it's not the bottleneck anyway. If you wanted an alternative design, you could probably do something with a couple of 4-way SATA controllers per CPU, with a lot of disks stacked vertically in a 3-4U box looking like an X-serve or something. But that wouldn't necessarily have much of an advantage.

      • "Although Hitachi does not offer an 'enterprise' or '24x7' SATA drive, our testing found their drives to be as reliable as anything out there, enterprise distinction or not," Saikley said.

        I read that as SATA drives. What I wonder about is
        Pentaboxes are ~$ 2.00/GB per the article
        while
        Coraid, priced at $1,995.00 + (4*$314.99 hard drives) = 3918.94 + 664.00( 15U tabletop rackmount) or ~$0.41/GB per my calculations;
        looks like a price war is brewing here unless pentabox has some serious KW in BTU out or p

  • Slashdotted .... (Score:4, Informative)

    by theoddbot (520034) on Wednesday June 22, 2005 @02:32AM (#12879101)
  • No redundancy? WTF? (Score:3, Informative)

    by melted (227442) on Wednesday June 22, 2005 @02:34AM (#12879111) Homepage
    I've actually read TFA. They recommend JBOD configurations to their clients. One drive goes titsup and you've lost 400GB of data. Do they at least offer some kind of mirroring/redundancy solution to back the data up to another array?
    • by Depili (749436) on Wednesday June 22, 2005 @03:29AM (#12879235)
      Acording to the archive.org (http://www.archive.org/web/petabox.php [archive.org]) they indeed have some redundancy, but not raid. They are operating each system as a separete node, and mirroring nodes. The above link also sheds light on other questions regarding TFA
    • by puhuri (701880)

      The archive.org [archive.org] maintains its archives in several geographicaly different locations and files are mirrored between those sites. If one disk or node breaks, you still have two or more copies of that material.

      If you archive serious amounts of data, redundancy within node is not the best solution, but to distrbute information between systems. For very important data, you can have as many copies as you have nodes; lesser important data may have just a single copy. If it gets lost, then ok, shit happens but

  • by simrook (548769) on Wednesday June 22, 2005 @02:35AM (#12879113)
    The Internet represents a great historical tool. Case and point is what happened on 9/11. Being able to go back and see the progression, paranoia, patrotism, and early iraq/afgahanistan/binladen/hussien posts and opinions on various new sites is amazing. cnn, fox, the ny times, all are archived several times on 9/11 on archive.org.

    I for one think that archive.org should turn into some UN effort, with a mission to chronical and store daily/timely snapshots of the internet and the culture at the time, preserving it for future generations. What a tool for future historians!

    The ability to look at a large representation of socity at one single critical moment in time, and being able to have first hand sources for all that information is something that can truely change the way history is recorded (and not in the bad newspeak ingsoc way either). Infact, a wholeistic archive of what happens day-to-day, in an easily accessible format, might well help written history to be more representative of actual history (instead of, say the history Bush wants us to believe; that the Iraq war was for human right and not wmd's). I love Foucault.

    The internet archive rocks... really hope this project continues full blast.

    - Peace

    • Yes, otherwise such cultural gems as goatse.cx would be lost into the void forever...

    • by Anonymous Coward
      The 911 targets where chosen in a way everyone would notice. Not exactly amazing that it's well reported on, it would have been if it happened 20 years ago. But that was just a single attack. If you look at the much bigger recent events that you mention, like the war on Iraq, you'll see that there really is hardly any detailed reporting. You have a lot of propaganda by the attackers, some propaganda from the Iraqi government, and some reports by angry people getting in the middle. You still have a completel
    • (and not in the bad newspeak ingsoc way either)

      Funny you should mention that, but this whole "Internet as history" thing has me wound up tight.

      Books cannot be changed. They can be destroyed, reprinted and banned but the first edition will always exist in a collection.
      The first edition of a website only exists in digital form and there is no way to stop the original from being edited and timestamped back to the expected date.

      The IA is the MiniTruth's dream come true.

      But who cares? History has always

      • The first edition of a website only exists in digital form and there is no way to stop the original from being edited and timestamped back to the expected date.


        ...unless you make a digital signature of the timestamp?

        If you want trust, use trust tools. We already knew that digital media does not leave physical traits behind, but that doesn't mean that other checking processes can't be built.


        But who cares? History has always been written by the victorious, hasn't it?


        Actually yes. The originals coul
      • The IA is the MiniTruth's dream come true.

        Actually, it's so far been its nightmare come true. Many an effort to redact information or remove something embarrassing from corporate, government, and news websites has been foiled by the IA. For example, a page related to a plagiarism controversy local to me was conveniently pulled from where it was hosted, but remained on the IA--foiling the effort to suppress the ability to compare the infringing text.

    • That's "case in point". Like "under scrutiny" or "off topic". Which is what I should be modded.

      Sorry.
  • The MPAA and RIAA (Score:3, Interesting)

    by PrivateDonut (802017) <[moc.nacliam] [ta] [7735sirhc]> on Wednesday June 22, 2005 @02:35AM (#12879115)
    are going to make a killing of the IA when they have finished, it isn't like they haven't made enough money off others as it is, so they may let this one slide in the name of conserving data. On that note, is the IA downloading EVERYTHING or selectively downloading to prevent such issues as copyright infringment?
  • by mcrbids (148650) on Wednesday June 22, 2005 @02:57AM (#12879163) Journal
    Go ahead. Try Slashdot in the wayback machine.

    Slashdot has looked virtually identical since 1998!
  • A friend of mine used to work for Sony... he swears this is a true story:

    Sony had a petabyte tape backup system they wanted to sell into North America... called the "Peta-file". Thankfully, Sony NA managed to have the name changed prior to it's introduction here.

    So, PetaBox is slightly better... slightly. :)

    MadCow.
  • I read the article, and the website of the company, but I couldn't find out how you're supposed to access all this data? It's hardly practical that every node exports it's own NFS, is it? Is it supposed to use some kind of cluster file system such as (Open)GFS?

    Or is the user expected to do some kind of in-house thingy, like google or (presumably) the internet archive?
    • Re:NAS or SAN or ??? (Score:3, Informative)

      by TTK Ciar (698795)

      The Petabox is shipped to a customer running Debian Linux by default (though of course you can install whatever you want), so there are a number of solutions to choose from. OpenAFS and (as you pointed out) GFS are made specifically for this kind of setup, providing fairly good abstraction of the underlying cluster and easy access to random data. Within The Archive, we have experimented with different approaches, the one currently in production using an API based on a UDP locator service and rsync.

      Anoth

  • by paulatz (744216)
    It was 3 or 4 years ago when I saw a 600 terabytes (0.6 petabytes) tape-based storage system at CERN [www.cern.ch].
  • looks like generic mini-itx, but who makes the 1u? custom built?
  • 2,500 spinning drives!!! These folks are located in San Francisco... if there's ever an earth quake the gyroscopic effects could flip the building over! Perhaps they should mount every other drive upside down to cancel out the effect to prevent serious injury ;)
  • Two points (Score:5, Interesting)

    by Salamander (33735) <`jeff' `at' `pl.atyp.us'> on Wednesday June 22, 2005 @06:52AM (#12879688) Homepage Journal

    First off, this isn't quite an example of a company suddenly deciding to donate stuff to the Archive. As can be seen on their own website [capricorn-tech.com], Capricorn was spun off from the Archive on July 1, 2004. To a large extent, Capricorn exists for the specific purpose of providing storage to the Archive, and if that same storage can be sold to others so much the better.

    Second, what about interconnects and performance? The product descriptions say nothing about SCSI or FC or other storage-oriented connectivity, so one must assume that the connection to these boxes is through a network. That would mean each node is an NFS server (or similar), serving up 1.6TB using a 1GHz C3 processor, a maximum of 1GB of memory (for caching etc.) and what appears to be a single GigE link. Can you say unbalanced? The Internet Archive might be the only system with an access pattern so sparse that the ratio between capacity and performance wouldn't be crippling. Don't try using one of these with any other kind of application if performance is a concern...and BTW they don't seem to say anything about high availability or other storage functionality (e.g. integrated backup or snapshots) either. Capricorn's big play seems to be power consumption, but there are other players that can beat them on density (e.g. Copan with 224TB per rack [copansys.com]) and multitudes who can offer better performance/functionality. I hate to sound negative, but this is a product so specialized as to be uninteresting.

    Disclaimer: I think I met some of the Copan guys once and they seemed cool enough, but there's no other relationship between me and them. That just happened to be the first name I thought of in this space.

    • Gigabit ethernet is "good enough" when all you're really doing is serving up web pages. When you have 16 racks of 1U servers, spending the extra cash to get SCSI with raid and more bandwidth for each server has little or no extra benefits when those servers aren't necessarily talking to each other. Extra space at a low price is probably many times more important than speed.

      Capricorn's big play is also probably price. Price is mentioned quite a few times in the article.

      Their product kinda sounds like googl
  • I'd like to understand more about their filesystem. They say RAID doesn't work for them, so they use JBOD.

    What kind of metastructure do they put on the disks to achieve that kind of large filesystem, and improve reliability?

  • The first version will be called Capricorn One.

The first rule of intelligent tinkering is to save all the parts. -- Paul Erlich

Working...