Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Data Storage The Internet IT

Internet Archive Gets 4.5PB Data Center Upgrade 235

Lucas123 writes "The Internet Archive, the non-profit organization that scrapes the Web every two months in order to archive web page images, just cut the ribbon on a new 4.5 petabyte data center housed in a metal shipping container that sits outside. The data center supports the Wayback Machine, the Web site that offers the public a view of the 151 billion Web page images collected since 1997. The new data center houses 63 Sun Fire servers, each with 48 1TB hard drives running in parallel to support both the web crawling application and the 200,000 visitors to the site each day."
This discussion has been archived. No new comments can be posted.

Internet Archive Gets 4.5PB Data Center Upgrade

Comments Filter:
  • by wjh31 ( 1372867 ) on Wednesday March 25, 2009 @06:50PM (#27336227) Homepage
    one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system
  • by commodore64_love ( 1445365 ) on Wednesday March 25, 2009 @06:54PM (#27336277) Journal

    Are there any resources the let us see websites from 1996, 95, 94, or 93? I would love to revisit the web as it appeared when I first discovered it (1994 at psu.edu).

  • by Wingman 5 ( 551897 ) on Wednesday March 25, 2009 @07:02PM (#27336347)

    from http://www.lesk.com/mlesk/ksg97/ksg.html [lesk.com] The 20-terabyte size of the Library of Congress is widely quoted and as far as I know is derived by assuming that LC has 20 million books and each requires 1 MB. Of course, LC has much other stuff besides printed text, and this other stuff would take much more space.

    1. Thirteen million photographs, even if compressed to a 1 MB JPG each, would be 13 terabytes.
    2. The 4 million maps in the Geography Division might scan to 200 TB.
    3. LC has over five hundred thousand movies; at 1 GB each they would be 500 terabytes (most are not full-length color features).
    4. Bulkiest might be the 3.5 million sound recordings, which at one audio CD each, would be almost 2,000 TB.

    This makes the total size of the Library perhaps about 3 petabytes (3,000 terabytes).

    so 230 libraries by the old standard or 1.5 by the new standard

  • Comment removed (Score:5, Interesting)

    by account_deleted ( 4530225 ) on Wednesday March 25, 2009 @07:04PM (#27336365)
    Comment removed based on user account deletion
  • by Ungrounded Lightning ( 62228 ) on Wednesday March 25, 2009 @07:29PM (#27336571) Journal

    ... one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot,..

    As I recall from one of Brewster's talks: Part of the idea was that you can install redundant copies of this data center around the world and keep 'em synced.

    You can ship 4.5 petabytes over a single OC-192 link in about 71 days.

  • Re:Slight problem? (Score:5, Interesting)

    by rackserverdeals ( 1503561 ) on Wednesday March 25, 2009 @07:51PM (#27336767) Homepage Journal

    Here's a video tour [youtube.com] of one if you need it for reference.

    Don't forget to turn off the water and unplug the ethernet cables. Just be very careful with the power cords.

  • by Anonymous Coward on Wednesday March 25, 2009 @08:04PM (#27336905)

    one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system

    Create snapshot of zpool (think LVM VG):
    # zfs snapshot mydata@2009-03-24

    Send snapshot to remote site:
    # zfs send mydata@2009-03-24 | ssh remote "zfs recv mydata@2009-03-24"

    Create a new snapshot the next day:
    # zfs snapshot mydata@2009-03-25

    Send only the incremental changes between the two:
    # zfs send -i mydata@2009-03-24 mydata@2009-03-25 | ssh remote "zfs recv mydata@2009-03-25"

    Now this looks a lot like rsync, but the difference is that rsync has to traverse the file system tree (directories and files), while ZFS only has to look at the 'birth time' (think ctime) of each block of data (not even the full file metadata) to see if it's newer than the first snap shot. If you're talking about tens (or hundreds) of thousands of directories, and an order of magnitude more files, that's a lot of overhead if nothing has changed. For 48 TB raw (what a Sun X4500 can have), ZFS can see nothing has changed in a few minutes.

    Creation of snapshots is instantaneous and there is no overhead in them (except that the space from deleted files isn't reclaimed / reused). There are people who create them every five seconds, and sync with a remote server--so at most you would lose five seconds worth of data if your disk died.

    All changes are also ACID, so if you start your send-recv, and the transmission dies part way through, the receiving end won't have a partial copy of the data latest snapshot--it's all or nothing of the last good change.

  • Re:"Sun Fire" (Score:5, Interesting)

    by fm6 ( 162816 ) on Wednesday March 25, 2009 @08:27PM (#27337081) Homepage Journal

    This seems to be an exact use case for the X4500-type system, which as far as I'm aware is pretty unique.

    Indeed. Sun is on a density kick. Check out the X4600, which does for processing power what the X4500 did for storage.

    In both cases, there actually are competing products that are sort of the same. The most conspicuous difference is that the Sun versions cram the whole caboodle into 4 rack units per system, about half the space required by their competitors.

    More absurdly-dense Sun products:

    http://www.sun.com/servers/x64/x4240/ [sun.com]
    http://www.sun.com/servers/x64/x4140/ [sun.com]

    The point of these systems is that they take up less expensive rack space than equivalent competitors. They're also "greener": if you broke all that storage and computing power down into less dense systems, you'd need a lot more electricity to run them and keep them cool. That not only saves money, it gives the owner the ability to claim they're working on the carbon footprint.

  • by Xtravar ( 725372 ) on Wednesday March 25, 2009 @09:15PM (#27337471) Homepage Journal

    The CDs are already in digital format, so compressing them is a cardinal sin.

    The photos, movies, and maps are in analog format to start with, so we don't feel so bad using lossy compression. Image files are really big. I think the 1GB estimate per movie is pretty good, considering shorts, black and white, and the standard (or lower) definition of most of them. That would allow for a very high detail scan of the movie in something like MPEG4.

    And, since they started in analog formats, there's no fair way to determine what resolution to scan them. I mean, even a million by a million pixels could not be a 'lossless' interpretation of a 1x1cm image, so you have to accept that any digital conversion will be lossy regardless of encoding.

    At least that would be my rationale. Not that this question needed to be answered...

  • by notthepainter ( 759494 ) <oblique&alum,mit,edu> on Wednesday March 25, 2009 @09:59PM (#27337761) Homepage
    Sadly, even modern day archives get wrecked. See http://www.spiegel.de/international/germany/0,1518,611311,00.html [spiegel.de]
  • by Omniscient Lurker ( 1504701 ) on Thursday March 26, 2009 @12:00AM (#27338307)
    Instead of writing in binary you could write the data in a base-36 format and then convert back to binary. The stone cutters could then store more data per glyph increasing their write rate considerably (and decreasing read rate) by amounts I am unwilling to calculate.

An authority is a person who can tell you more about something than you really care to know.

Working...