Follow Slashdot stories on Twitter


Forgot your password?
Data Storage Software Hardware Linux Technology

PetaBox: Big Storage in Small Boxes 295

An anonymous reader writes " is reporting that a Linux-based system comprising more than a petabyte of storage as been delivered to the Internet Archive, the non-profit organization that creates periodic snapshots of the Internet. The PetaBox products, made by Capricorn Technologies, are based on Via mini-ITX motherboards running Debian or Fedora Linux. The IA's PetaBox installation consists of about 16 racks housing 600 systems with 2,500 spinning drives, for a total capacity of roughly 1.5 petabytes, according to the article. Now to strap one of those puppies to my iPod!" The Internet Archive continues to astound.
This discussion has been archived. No new comments can be posted.

PetaBox: Big Storage in Small Boxes

Comments Filter:
  • great usage. (Score:5, Informative)

    by Bananatree3 ( 872975 ) on Wednesday June 22, 2005 @03:09AM (#12879045)
    Seriously, I think deservese sutch a storage system. I have very often wanted to go back to view an archive of a website a while ago, but the cache on Google was from yesterday. It also gives multiple archives of the website based on day which can be quite handy, especially for news related sites. I think they quite well deserve it.
  • Re:copyright (Score:4, Informative)

    by seifried ( 12921 ) on Wednesday June 22, 2005 @03:12AM (#12879050) Homepage

    You can exclude them from your website using the robots.txt:

    User-agent: ia_archiver
    Disallow: /

    For example if you go to and plug my site into the wayback machine:

    We're sorry, access to [] has been blocked by the site owner via robots.txt.

    and you can also request them to expunge your site from the archive.

    They go out of their way to make it easy to prevent your site being copied (more so then most search engines).

  • Re:Downloading Kazaa (Score:3, Informative)

    by HyperChicken ( 794660 ) * on Wednesday June 22, 2005 @03:17AM (#12879062)
    Not "periodic", continuous. Own a website? Check your logs for the user-agent "ia_archive".
  • Slashdotted .... (Score:4, Informative)

    by theoddbot ( 520034 ) on Wednesday June 22, 2005 @03:32AM (#12879101)
  • No redundancy? WTF? (Score:3, Informative)

    by melted ( 227442 ) on Wednesday June 22, 2005 @03:34AM (#12879111) Homepage
    I've actually read TFA. They recommend JBOD configurations to their clients. One drive goes titsup and you've lost 400GB of data. Do they at least offer some kind of mirroring/redundancy solution to back the data up to another array?
  • Re:Electricity $$$ ? (Score:3, Informative)

    by TheFlyingGoat ( 161967 ) on Wednesday June 22, 2005 @03:40AM (#12879120) Homepage Journal
    50kW at 10 cents per kilowatt hour = $120/day.

    I doubt it draws at a constant 50kW, though. It's probably an average (was given in TFA).

    My math might be completely wrong, given I don't have a clue how to calculate kilowatt hours. Is it just kW * hours_used_daily? :)
  • Re:1.5 Petabytes? (Score:3, Informative)

    by TheFlyingGoat ( 161967 ) on Wednesday June 22, 2005 @03:43AM (#12879130) Homepage Journal
    No. They say 2500 drives (actually 2400 since it's 4 per system in 600 systems), which comes out to 600GB per drive for 1.5PB.
  • by Depili ( 749436 ) on Wednesday June 22, 2005 @04:29AM (#12879235)
    Acording to the ( []) they indeed have some redundancy, but not raid. They are operating each system as a separete node, and mirroring nodes. The above link also sheds light on other questions regarding TFA
  • by pcgabe ( 712924 ) on Wednesday June 22, 2005 @07:29AM (#12879632) Homepage Journal
    Linky Goodness: []

    • Episode 1 teaser sheets
    • Does the world really need a 25 gig drive?
    • Patents: how do we keep software free?
    Oh, how far we've come.
  • by budgenator ( 254554 ) on Wednesday June 22, 2005 @08:33AM (#12879868) Journal
    "Although Hitachi does not offer an 'enterprise' or '24x7' SATA drive, our testing found their drives to be as reliable as anything out there, enterprise distinction or not," Saikley said.

    I read that as SATA drives. What I wonder about is
    Pentaboxes are ~$ 2.00/GB per the article
    Coraid, priced at $1,995.00 + (4*$314.99 hard drives) = 3918.94 + 664.00( 15U tabletop rackmount) or ~$0.41/GB per my calculations;
    looks like a price war is brewing here unless pentabox has some serious KW in BTU out or performance advantages.
  • Re:copyright (Score:2, Informative)

    by generic-man ( 33649 ) on Wednesday June 22, 2005 @09:37AM (#12880251) Homepage Journal
    Just because you have a cache of something doesn't give you the right to redistribute it for commercial gain. The initial author still retains ownership.

    Imagine if you had a device designed to record audio and reproduce it []. That doesn't mean that you can resell your recordings; the original author retains ownership.

    I'm not claiming that it is unethical to cache web pages, just that companies such as Google presume that they have the right to redistribute content to which they own no rights. The web is not like Usenet, where each server hosts others' posts; content is served by an author for as long as the author wants.
  • Re:NAS or SAN or ??? (Score:3, Informative)

    by TTK Ciar ( 698795 ) on Wednesday June 22, 2005 @01:18PM (#12882248) Homepage Journal

    The Petabox is shipped to a customer running Debian Linux by default (though of course you can install whatever you want), so there are a number of solutions to choose from. OpenAFS and (as you pointed out) GFS are made specifically for this kind of setup, providing fairly good abstraction of the underlying cluster and easy access to random data. Within The Archive, we have experimented with different approaches, the one currently in production using an API based on a UDP locator service and rsync.

    Another approach uses a /net directory under which remote filesystems are NFS-mounted on demand (I'm not sure how it works, our chief sysadmin set it up for testing, but if /net/ia105783/0/foo is not mounted, and then you type 'ls /net/ia105783/0/foo' (or any other command which opens a hypothetical file off /net), the remote filesystem is automagically NFS-mounted so that the command can complete).

    I'm not sure that we'll ever use it in production to access our distributed information, though; NFS has a very, very low error rate, but when you have thousands of NFS mounts going on at once (as we do NFS-mount users' /home directories everywhere), "very, very low" translates to "tripping over errors every few days". I've seen some really weird NFS failures and partial failures at The Archive, and I've written some software to be tolerant of them, but most of our software is not, and realistically speaking never will be. It's written to be tolerant of rsync errors instead. *shrug*, six of one, half a dozen of the other. This is one of those things where you need to just pick a solution and use it, whether it's OpenAFS, GFS, NFS, or some homespun thing. All have their pros and cons, and you'll learn to deal with their problems as you use them.

    -- TTK

  • Re:Once upon a time (Score:3, Informative)

    by NaDrew ( 561847 ) <> on Wednesday June 22, 2005 @01:52PM (#12882575) Journal
    About then the DJ came on and said "We're playing 'Macarena' until you vomit." Then played the song again.

    After that iteration of the song the DJ came back and played some phone calls of people begging him to change the song, but he just said that it was "Macarena" until you vomit.

    I don't know when the thing started, but by the time I got to work it was the 17th or so "Macarena" in a row.

    This is called stunting []. Radio stations do it to mark a transition between formats, apparently in an attempt to drive off listeners to their previous format.
  • by Anonymous Coward on Wednesday June 22, 2005 @07:10PM (#12885581)
    Check out The PetaBox page at The Internet Archive [].
    That page has been around for years, and their forum talks about many of the things they went through. They custom-built the cases, and they couple nodes together, and they are mirrors of each other. If one fails, the other copy is still there. Not to mention the copies in other geographic locations. This also isn't just "one large file system". Each drive is a separate filesystem, and they serve the files up via standard means such as FTP and HTTP. (There is a UDP-based locator protocol they wrote as well, to find data in the massive amount of storage.)

Help! I'm trapped in a PDP 11/70!