Internet Archive Gets 4.5PB Data Center Upgrade 235
Lucas123 writes "The Internet Archive, the non-profit organization that scrapes the Web every two months in order to archive web page images, just cut the ribbon on a new 4.5 petabyte data center housed in a metal shipping container that sits outside. The data center supports the Wayback Machine, the Web site that offers the public a view of the 151 billion Web page images collected since 1997. The new data center houses 63 Sun Fire servers, each with 48 1TB hard drives running in parallel to support both the web crawling application and the 200,000 visitors to the site each day."
Where do they store 4.5TB off site (Score:5, Interesting)
What about 1996 and earlier? (Score:5, Interesting)
Are there any resources the let us see websites from 1996, 95, 94, or 93? I would love to revisit the web as it appeared when I first discovered it (1994 at psu.edu).
Re:Story is meaningless without LOC measurement (Score:5, Interesting)
from http://www.lesk.com/mlesk/ksg97/ksg.html [lesk.com] The 20-terabyte size of the Library of Congress is widely quoted and as far as I know is derived by assuming that LC has 20 million books and each requires 1 MB. Of course, LC has much other stuff besides printed text, and this other stuff would take much more space.
1. Thirteen million photographs, even if compressed to a 1 MB JPG each, would be 13 terabytes.
2. The 4 million maps in the Geography Division might scan to 200 TB.
3. LC has over five hundred thousand movies; at 1 GB each they would be 500 terabytes (most are not full-length color features).
4. Bulkiest might be the 3.5 million sound recordings, which at one audio CD each, would be almost 2,000 TB.
This makes the total size of the Library perhaps about 3 petabytes (3,000 terabytes).
so 230 libraries by the old standard or 1.5 by the new standard
Comment removed (Score:5, Interesting)
You can ship it over OC-192... (Score:5, Interesting)
... one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot,..
As I recall from one of Brewster's talks: Part of the idea was that you can install redundant copies of this data center around the world and keep 'em synced.
You can ship 4.5 petabytes over a single OC-192 link in about 71 days.
Re:Slight problem? (Score:5, Interesting)
Here's a video tour [youtube.com] of one if you need it for reference.
Don't forget to turn off the water and unplug the ethernet cables. Just be very careful with the power cords.
Re:Where do they store 4.5TB off site (Score:3, Interesting)
one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system
Create snapshot of zpool (think LVM VG):
# zfs snapshot mydata@2009-03-24
Send snapshot to remote site:
# zfs send mydata@2009-03-24 | ssh remote "zfs recv mydata@2009-03-24"
Create a new snapshot the next day:
# zfs snapshot mydata@2009-03-25
Send only the incremental changes between the two:
# zfs send -i mydata@2009-03-24 mydata@2009-03-25 | ssh remote "zfs recv mydata@2009-03-25"
Now this looks a lot like rsync, but the difference is that rsync has to traverse the file system tree (directories and files), while ZFS only has to look at the 'birth time' (think ctime) of each block of data (not even the full file metadata) to see if it's newer than the first snap shot. If you're talking about tens (or hundreds) of thousands of directories, and an order of magnitude more files, that's a lot of overhead if nothing has changed. For 48 TB raw (what a Sun X4500 can have), ZFS can see nothing has changed in a few minutes.
Creation of snapshots is instantaneous and there is no overhead in them (except that the space from deleted files isn't reclaimed / reused). There are people who create them every five seconds, and sync with a remote server--so at most you would lose five seconds worth of data if your disk died.
All changes are also ACID, so if you start your send-recv, and the transmission dies part way through, the receiving end won't have a partial copy of the data latest snapshot--it's all or nothing of the last good change.
Re:"Sun Fire" (Score:5, Interesting)
This seems to be an exact use case for the X4500-type system, which as far as I'm aware is pretty unique.
Indeed. Sun is on a density kick. Check out the X4600, which does for processing power what the X4500 did for storage.
In both cases, there actually are competing products that are sort of the same. The most conspicuous difference is that the Sun versions cram the whole caboodle into 4 rack units per system, about half the space required by their competitors.
More absurdly-dense Sun products:
http://www.sun.com/servers/x64/x4240/ [sun.com]
http://www.sun.com/servers/x64/x4140/ [sun.com]
The point of these systems is that they take up less expensive rack space than equivalent competitors. They're also "greener": if you broke all that storage and computing power down into less dense systems, you'd need a lot more electricity to run them and keep them cool. That not only saves money, it gives the owner the ability to claim they're working on the carbon footprint.
Re:Story is meaningless without LOC measurement (Score:3, Interesting)
The CDs are already in digital format, so compressing them is a cardinal sin.
The photos, movies, and maps are in analog format to start with, so we don't feel so bad using lossy compression. Image files are really big. I think the 1GB estimate per movie is pretty good, considering shorts, black and white, and the standard (or lower) definition of most of them. That would allow for a very high detail scan of the movie in something like MPEG4.
And, since they started in analog formats, there's no fair way to determine what resolution to scan them. I mean, even a million by a million pixels could not be a 'lossless' interpretation of a 1x1cm image, so you have to accept that any digital conversion will be lossy regardless of encoding.
At least that would be my rationale. Not that this question needed to be answered...
Re:Where do they store 4.5TB off site (Score:5, Interesting)
Re:Where do they store 4.5TB off site (Score:4, Interesting)