PetaBox: Big Storage in Small Boxes 295
An anonymous reader writes "LinuxDevices.com is reporting that a Linux-based system comprising more than a petabyte of storage as been delivered to the Internet Archive, the non-profit organization that creates periodic snapshots of the Internet. The PetaBox products, made by Capricorn Technologies, are based on Via mini-ITX motherboards running Debian or Fedora Linux. The IA's PetaBox installation consists of about 16 racks housing 600 systems with 2,500 spinning drives, for a total capacity of roughly 1.5 petabytes, according to the article. Now to strap one of those puppies to my iPod!" The Internet Archive continues to astound.
great usage. (Score:5, Informative)
Re:copyright (Score:4, Informative)
You can exclude them from your website using the robots.txt:
User-agent: ia_archiver /
Disallow:
For example if you go to archive.org and plug my site into the wayback machine:
We're sorry, access to http://www.seifried.org/ [seifried.org] has been blocked by the site owner via robots.txt.
and you can also request them to expunge your site from the archive.
They go out of their way to make it easy to prevent your site being copied (more so then most search engines).
Re:Downloading Kazaa (Score:3, Informative)
Slashdotted .... (Score:4, Informative)
http://mirrordot.org/stories/83ede29a5f303f8c47d1
No redundancy? WTF? (Score:3, Informative)
Re:Electricity $$$ ? (Score:3, Informative)
I doubt it draws at a constant 50kW, though. It's probably an average (was given in TFA).
My math might be completely wrong, given I don't have a clue how to calculate kilowatt hours. Is it just kW * hours_used_daily?
Re:1.5 Petabytes? (Score:3, Informative)
Re:No redundancy? WTF? (Score:4, Informative)
Re:Wayback and Slashdot (Score:2, Informative)
http://web.archive.org/web/19981111190256/http://
Highlights:
Re:They don't like RAID (Score:3, Informative)
I read that as SATA drives. What I wonder about is
Pentaboxes are ~$ 2.00/GB per the article
while
Coraid, priced at $1,995.00 + (4*$314.99 hard drives) = 3918.94 + 664.00( 15U tabletop rackmount) or ~$0.41/GB per my calculations;
looks like a price war is brewing here unless pentabox has some serious KW in BTU out or performance advantages.
Re:copyright (Score:2, Informative)
Imagine if you had a device designed to record audio and reproduce it [pocketcalculatorshow.com]. That doesn't mean that you can resell your recordings; the original author retains ownership.
I'm not claiming that it is unethical to cache web pages, just that companies such as Google presume that they have the right to redistribute content to which they own no rights. The web is not like Usenet, where each server hosts others' posts; content is served by an author for as long as the author wants.
Re:NAS or SAN or ??? (Score:3, Informative)
The Petabox is shipped to a customer running Debian Linux by default (though of course you can install whatever you want), so there are a number of solutions to choose from. OpenAFS and (as you pointed out) GFS are made specifically for this kind of setup, providing fairly good abstraction of the underlying cluster and easy access to random data. Within The Archive, we have experimented with different approaches, the one currently in production using an API based on a UDP locator service and rsync.
Another approach uses a /net directory under which remote filesystems are NFS-mounted on demand (I'm not sure how it works, our chief sysadmin set it up for testing, but if /net/ia105783/0/foo is not mounted, and then you type 'ls /net/ia105783/0/foo' (or any other command which opens a hypothetical file off /net), the remote filesystem is automagically NFS-mounted so that the command can complete).
I'm not sure that we'll ever use it in production to access our distributed information, though; NFS has a very, very low error rate, but when you have thousands of NFS mounts going on at once (as we do NFS-mount users' /home directories everywhere), "very, very low" translates to "tripping over errors every few days". I've seen some really weird NFS failures and partial failures at The Archive, and I've written some software to be tolerant of them, but most of our software is not, and realistically speaking never will be. It's written to be tolerant of rsync errors instead. *shrug*, six of one, half a dozen of the other. This is one of those things where you need to just pick a solution and use it, whether it's OpenAFS, GFS, NFS, or some homespun thing. All have their pros and cons, and you'll learn to deal with their problems as you use them.
-- TTK
Re:Once upon a time (Score:3, Informative)
This is called stunting [wikipedia.org]. Radio stations do it to mark a transition between formats, apparently in an attempt to drive off listeners to their previous format.
Re:No, it can't just be JBOD. (Score:1, Informative)
That page has been around for years, and their forum talks about many of the things they went through. They custom-built the cases, and they couple nodes together, and they are mirrors of each other. If one fails, the other copy is still there. Not to mention the copies in other geographic locations. This also isn't just "one large file system". Each drive is a separate filesystem, and they serve the files up via standard means such as FTP and HTTP. (There is a UDP-based locator protocol they wrote as well, to find data in the massive amount of storage.)