PetaBox: Big Storage in Small Boxes 295
An anonymous reader writes "LinuxDevices.com is reporting that a Linux-based system comprising more than a petabyte of storage as been delivered to the Internet Archive, the non-profit organization that creates periodic snapshots of the Internet. The PetaBox products, made by Capricorn Technologies, are based on Via mini-ITX motherboards running Debian or Fedora Linux. The IA's PetaBox installation consists of about 16 racks housing 600 systems with 2,500 spinning drives, for a total capacity of roughly 1.5 petabytes, according to the article. Now to strap one of those puppies to my iPod!" The Internet Archive continues to astound.
Good to see. (Score:5, Funny)
Re:Good to see. (Score:5, Funny)
Re:Good to see. (Score:3, Insightful)
I had a similar experience, I was playing around on irc back when we were swapping
Re:Good to see. (Score:2)
! LaTeX Error: \include cannot be nested.
Not particularly impressive.
Re:In terms of bandwidth (Score:2)
I put my laptop with an 80GB hard drive onto my desk in a quarter second; does that mean I got 256 Gb/s?
Re:Good to see. (Score:5, Interesting)
5 petabytes of storage is enough for a brief five-minute DVD-quality sex scene for each person of legal age in the US (two to a scene). 100 petabytes would be five minutes of porn of every pair of people in the world.
I actually wonder about this a little; how many women have posed nude on the internet? There seem to be an awful lot; I haven't been able to see them all (though I will continue to try). Where do they mostly come from, I wonder.
Re:Good to see. (Score:2)
I've always wanted to know the answer to your second question as well, actually, hopefully someone else will be able to answer or at least give some interesting insights. Another question is "why do they do it?", and it's not something I've easily been able to work out. A friend of mine (one of those born-again Christian types) admitted in one of those email forward-things to posing nude, which I can't quite believe. Another friend's ex-girlfriend apparently has a sui
Re:Good to see. (Score:2)
Re:Good to see. (Score:2)
Re:Good to see. (Score:5, Funny)
Let me get this straight, you're trying to see all the porn in the world, and you still don't know where babies come from?
Your sig, and Amazon wishlists. (Score:2)
Learn something new every day, I suppose.
--grendel drago
Naked Americans? (Score:2)
On the other hand, you could just go grab a Livejournal account, join the communities "kaizersoze125" and "show_your_boobs", and marvel at the quantity of amateur porn folks throw out there for free.
Seriously. There's some high quality out there. Some of it's not even members-locked (earningtails [livejournal.com], for instance).
--grendel drago
Re:Good to see. (Score:2)
Re:Good to see. (Score:5, Funny)
How about a Beo.. oh damn
Re:MOD PARENT UP (Score:2)
We've better stop it, and the sooner the better. So please, for God's sake, mod GP up!
Storage galore! (Score:3, Funny)
You hear about the Petabox? (Score:5, Funny)
R. Kelly was scrambling to find the company's phone number.
Re:You hear about the Petabox? (Score:5, Funny)
Hmmm, this seems almost familiar [slashdot.org]...
Let's analyze this situation:
GET OUT OF MY MIND!!!
Re:You hear about the Petabox? (Score:2)
archive.org (Score:5, Interesting)
They do a lot more than that! I've just been downloading some Warren Zevon [archive.org] shows from their Live Music Archive.
copyright (Score:5, Interesting)
Re:copyright (Score:4, Informative)
You can exclude them from your website using the robots.txt:
User-agent: ia_archiver /
Disallow:
For example if you go to archive.org and plug my site into the wayback machine:
We're sorry, access to http://www.seifried.org/ [seifried.org] has been blocked by the site owner via robots.txt.
and you can also request them to expunge your site from the archive.
They go out of their way to make it easy to prevent your site being copied (more so then most search engines).
Re:copyright (Score:2)
I can't get older pages of a web site I operated several years ago because a robots.txt file was inadvertently added that blocks it. At the time, I didn't know about the Internet Archive, and as a result potentially years of this site's history is gone.
Oops (Score:2)
Or was it backup CDs for coasters/frisbees?
(CDs don't work well for frisbees. In my experience they break after just a few brick walls, and it costs a stroke, and makes it harder to get par.)
Re:copyright (Score:3, Interesting)
Re:copyright (Score:2)
Particularly for a robots.txt like this [whitehouse.gov].
Re:copyright (Score:2)
If they really wanted to go out of their way, they would ask permission before illegally copying and distributing copyrighted material for which they do not have permission.
Re:copyright (Score:3, Interesting)
Besides, the IA only archives HTML pages, and small images in them, nothing else. If you consider your HTML content to be unproductible copyrighted material, might I ask why the hell is it publically accessible on the Web in the first place?
Re:copyright (Score:3, Insightful)
If you consider your music to be copyrighted material, might I ask why the hell it's being played on the radio in the first place?
If you consider your book to be copyrighted material, might I ask why the hell it's being lent out in the library in the first place?
If you consider your movie to be copyrighted material, might I ask why the hell it's b
Re:copyright (Score:2)
Copying a book isn't standard practice just to know the story.
Copying a movie isn't required to see it on HBO.
Making a copy of html from the internet IS part of getting your computer to display it.
I hope that clears up the difference that was implied in the GP.
Re:copyright (Score:2, Informative)
Imagine if you had a device designed to record audio and reproduce it [pocketcalculatorshow.com]. That doesn't mean that you can resell your recordings; the original author retains ownership.
I'm not claiming that it is unethical to cache web pages, just that companies such as Google presume that they have the right to redistribute content to which they own no rights. The web is
Re:copyright (Score:2)
Re:copyright (Score:2)
I know that the US Copyright Office has granted a DMCA exemption for at least some of the material they archive.
Re:copyright (Score:2, Interesting)
Did you ask this question when Google introduced site cache several years ago?
Re:copyright (Score:2, Insightful)
1. FAIR USE!
2. Google is merely providing a service. If you don't like it you can opt out.
The Google Cache is not fair use, as it reproduces the entirety of a web page's text for none of the purposes for which Fair Use is defined. (Under Fair Use you are entitled to use a portion of a copyrighted work, not the whole thing.)
The second one just cracks me up. I thought the Slashdot crowd didn't like being asked to opt out.
Now, trifi
Re:copyright (Score:2)
Petabox? (Score:4, Funny)
Isn't that what naked girls climb out of to protest fur coats?
Thank you, I'll be here all week.
Re:Petabox? (Score:2, Funny)
IPod? (Score:2, Funny)
Re:IPod? (Score:2, Funny)
Once upon a time (Score:4, Funny)
"Macarena" was on the radio when I started the car. A few minutes later "Macarana" was still on, and I was thinking that the song must be longer than I thought, or something. About then the DJ came on and said "We're playing 'Macarena' until you vomit." Then played the song again.
After that iteration of the song the DJ came back and played some phone calls of people begging him to change the song, but he just said that it was "Macarena" until you vomit.
I don't know when the thing started, but by the time I got to work it was the 17th or so "Macarena" in a row.
Re:Once upon a time (Score:2)
Re:Once upon a time (Score:3, Informative)
This is called stunting [wikipedia.org]. Radio stations do it to mark a transition between formats, apparently in an attempt to
great usage. (Score:5, Informative)
'small box' (Score:5, Funny)
Puppies (Score:4, Funny)
Re:Puppies (Score:2)
hehehe
Re:Puppies (Score:3, Funny)
maybe i'll be quoted in 15 years.. (Score:4, Funny)
Re:maybe i'll be quoted in 15 years.. (Score:2, Funny)
No RAID?! (Score:2)
Re:No RAID?! (Score:3, Insightful)
So, while yes, if it really was just o
Electricity $$$ ? (Score:3, Funny)
Haha~
Re:Electricity $$$ ? (Score:3, Informative)
I doubt it draws at a constant 50kW, though. It's probably an average (was given in TFA).
My math might be completely wrong, given I don't have a clue how to calculate kilowatt hours. Is it just kW * hours_used_daily?
Re:Electricity $$$ ? (Score:2)
And yes, to compute energy consumption (in kWh) you merely multiply the power drawn from the grid (in kW) by the consumption timeframe (in hours).
Therefore if a unit uses 50kW, it consumes 50KWh worth of energy.
Re:Electricity $$$ ? (Score:2)
My math might be completely wrong, given I don't have a clue how to calculate kilowatt hours. Is it just kW * hours_used_daily? :)
Close. It is kw * hours_used. The "daily" part is only valid if (as in your case) you are talking about the amount of energy used over the course of a day.
Electricity here is $.15/kWh, which would put this box's operation at $180/day. In some places, electricity is as low as $.04/kWh, which would put the energy cost of these boxes at only $48/day.
1.5 Petabytes? (Score:4, Interesting)
The math doesn't work when you multiply the number of systems out either: 600 systems * 1.6TB/system = 960TB. That's just under a petabyte, or am I missing something?
Also, if you've got those in a RAID5 setup, you're 'only' talking about approx 800TB of usable space. That's far less than the 1.5 petabytes claimed.
800TB is a lot of space, but there must be a cheaper/easier way than purchasing 600 systems to do it.
They don't like RAID (Score:5, Interesting)
Also, the article says they don't like RAID, due to bad experiences with RAID5, and the system is configured as JBOD (Just a Bunch Of Disks). It doesn't say why, or what users should do to get equivalent protection. My guess is that depending on RAID within a box means you're still vulnerable if the box's CPU or disk controller decides to scribble the disks, or the power supply decides to catch fire or short out and deliver 240VAC on the +5V line or whatever. So if you want a RAID-like set of redundancy, set up your applications or file system mounting or something to calculate the protection disk in software and hand it off to another 1U box for storage.
The overhead of the motherboards here is not that high - they're about $150-200, and support 4 disks that probably cost $200-300 each, so they're only about 20% of the cost, which is not bad. The article didn't say they're using SATA, and it sounded like it's some IDE variant instead, but if you're only using 100 Mbps Ethernet to connect to the box and not the optional GigE, it's not the bottleneck anyway. If you wanted an alternative design, you could probably do something with a couple of 4-way SATA controllers per CPU, with a lot of disks stacked vertically in a 3-4U box looking like an X-serve or something. But that wouldn't necessarily have much of an advantage.
Re:They don't like RAID (Score:3, Informative)
I read that as SATA drives. What I wonder about is
Pentaboxes are ~$ 2.00/GB per the article
while
Coraid, priced at $1,995.00 + (4*$314.99 hard drives) = 3918.94 + 664.00( 15U tabletop rackmount) or ~$0.41/GB per my calculations;
looks like a price war is brewing here unless pentabox has some serious KW in BTU out or p
Re:1.5 Petabytes? (Score:3, Informative)
Slashdotted .... (Score:4, Informative)
http://mirrordot.org/stories/83ede29a5f303f8c47d1
No redundancy? WTF? (Score:3, Informative)
Re:No redundancy? WTF? (Score:4, Informative)
Re:No redundancy? WTF? (Score:2, Interesting)
The archive.org [archive.org] maintains its archives in several geographicaly different locations and files are mirrored between those sites. If one disk or node breaks, you still have two or more copies of that material.
If you archive serious amounts of data, redundancy within node is not the best solution, but to distrbute information between systems. For very important data, you can have as many copies as you have nodes; lesser important data may have just a single copy. If it gets lost, then ok, shit happens but
A Great Historical Tool (Score:5, Insightful)
I for one think that archive.org should turn into some UN effort, with a mission to chronical and store daily/timely snapshots of the internet and the culture at the time, preserving it for future generations. What a tool for future historians!
The ability to look at a large representation of socity at one single critical moment in time, and being able to have first hand sources for all that information is something that can truely change the way history is recorded (and not in the bad newspeak ingsoc way either). Infact, a wholeistic archive of what happens day-to-day, in an easily accessible format, might well help written history to be more representative of actual history (instead of, say the history Bush wants us to believe; that the Iraq war was for human right and not wmd's). I love Foucault.
The internet archive rocks... really hope this project continues full blast.
- Peace
Re:A Great Historical Tool (Score:3, Funny)
Yes, otherwise such cultural gems as goatse.cx would be lost into the void forever...
Re:A Great Historical Tool (Score:2, Insightful)
Re:A Great Historical Tool (Score:3, Insightful)
Funny you should mention that, but this whole "Internet as history" thing has me wound up tight.
Books cannot be changed. They can be destroyed, reprinted and banned but the first edition will always exist in a collection.
The first edition of a website only exists in digital form and there is no way to stop the original from being edited and timestamped back to the expected date.
The IA is the MiniTruth's dream come true.
But who cares? History has always
Re:A Great Historical Tool (Score:2)
The first edition of a website only exists in digital form and there is no way to stop the original from being edited and timestamped back to the expected date.
If you want trust, use trust tools. We already knew that digital media does not leave physical traits behind, but that doesn't mean that other checking processes can't be built.
But who cares? History has always been written by the victorious, hasn't it?
Actually yes. The originals coul
Re:A Great Historical Tool (Score:2)
Actually, it's so far been its nightmare come true. Many an effort to redact information or remove something embarrassing from corporate, government, and news websites has been foiled by the IA. For example, a page related to a plagiarism controversy local to me was conveniently pulled from where it was hosted, but remained on the IA--foiling the effort to suppress the ability to compare the infringing text.
Case and point? (Score:2)
Sorry.
The MPAA and RIAA (Score:3, Interesting)
Wayback and Slashdot (Score:5, Funny)
Slashdot has looked virtually identical since 1998!
Re:Wayback and Slashdot (Score:2, Informative)
http://web.archive.org/web/19981111190256/http://
Highlights:
Re:Wayback and Slashdot (Score:5, Funny)
Why, just last year they introduced an entirely new story into the rotation of duplicates . .
hawk
What's in a name? (Score:2)
Sony had a petabyte tape backup system they wanted to sell into North America... called the "Peta-file". Thankfully, Sony NA managed to have the name changed prior to it's introduction here.
So, PetaBox is slightly better... slightly.
MadCow.
NAS or SAN or ??? (Score:2)
Or is the user expected to do some kind of in-house thingy, like google or (presumably) the internet archive?
Re:NAS or SAN or ??? (Score:3, Informative)
The Petabox is shipped to a customer running Debian Linux by default (though of course you can install whatever you want), so there are a number of solutions to choose from. OpenAFS and (as you pointed out) GFS are made specifically for this kind of setup, providing fairly good abstraction of the underlying cluster and easy access to random data. Within The Archive, we have experimented with different approaches, the one currently in production using an API based on a UDP locator service and rsync.
Anoth
Not a big improvement... (Score:2, Interesting)
hardware supplier? (Score:2)
The angular momentum must be huge! (Score:2)
Two points (Score:5, Interesting)
First off, this isn't quite an example of a company suddenly deciding to donate stuff to the Archive. As can be seen on their own website [capricorn-tech.com], Capricorn was spun off from the Archive on July 1, 2004. To a large extent, Capricorn exists for the specific purpose of providing storage to the Archive, and if that same storage can be sold to others so much the better.
Second, what about interconnects and performance? The product descriptions say nothing about SCSI or FC or other storage-oriented connectivity, so one must assume that the connection to these boxes is through a network. That would mean each node is an NFS server (or similar), serving up 1.6TB using a 1GHz C3 processor, a maximum of 1GB of memory (for caching etc.) and what appears to be a single GigE link. Can you say unbalanced? The Internet Archive might be the only system with an access pattern so sparse that the ratio between capacity and performance wouldn't be crippling. Don't try using one of these with any other kind of application if performance is a concern...and BTW they don't seem to say anything about high availability or other storage functionality (e.g. integrated backup or snapshots) either. Capricorn's big play seems to be power consumption, but there are other players that can beat them on density (e.g. Copan with 224TB per rack [copansys.com]) and multitudes who can offer better performance/functionality. I hate to sound negative, but this is a product so specialized as to be uninteresting.
Disclaimer: I think I met some of the Copan guys once and they seemed cool enough, but there's no other relationship between me and them. That just happened to be the first name I thought of in this space.
Re:Two points (Score:2)
Capricorn's big play is also probably price. Price is mentioned quite a few times in the article.
Their product kinda sounds like googl
Petabox / Internet Archive (Score:2)
What kind of metastructure do they put on the disks to achieve that kind of large filesystem, and improve reliability?
It was obviously faked (Score:2)
Re:Downloading Kazaa (Score:3, Informative)
Re:Mega Systems (Score:2, Funny)
Re:What's wrong with hot swap and RAID 5? (Score:2)
Re:What's wrong with hot swap and RAID 5? (Score:3, Interesting)
They dont use hot swap and raid5 for the same reason google doesnt run on mainframes:
Its just cheaper to let a higher level logic take care of that stuff instead of strapping redundancy on every node...
Why hot swap if it isnt needed? The rest of the node will be mirrored somewhere else, so for the cost of fitting out everything with HS bays you could get 5 or 10% more nodes...
Same for raid5: good high performance Raid5 controllers would increase the system cost b
Re:What's wrong with hot swap and RAID 5? (Score:2, Interesting)
GOK, I have 3Pb of storage syncronised across two data centres here, all in 7+1 RAID5. Mostly self healing too, if a drive pops, then a spare drive in the same array builds itself into that stripe set, enabling hot replacement of the dead drive.
I would love to know what their "painful experience" was!
Using JBOD for this seems a tad courageous, to say the least.
And then, of cou
Re:Courageous? Try insane. (Score:2)
Those if properly managed, just using them as single disc would result in lower access times then RAID 0 is you can independently access files on the different discs instead of blocking all head for retrieving a single file. And as they are only connected via Gbit lan, STR doesnt matter anyway.
Re:Courageous? Try insane. (Score:2)
Re:What's wrong with hot swap and RAID 5? (Score:2)
They don't have "industry constraints", therefore don't need "industry practices"
Re:What's wrong with hot swap and RAID 5? (Score:2)
Re:article not clear (Score:2)
But its only the raw meat. In order to really use it, you need a storage solution taking care of things like redundency, node restore, ect.
Re:Mandatory (Score:2, Funny)
(sorry)
Re:Mandatory (Score:2)
Sometimes.
"Aren't you tired of the same old . .
Nope.
"Now I feel better
Slashdot cathartic therapy does it again!
Re:Just imagine... (Score:2)
Re:Umm.. (Score:2)
There are advantages to the VIA Eden platform.
No, it can't just be JBOD. (Score:2)