Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Data Storage The Internet IT

Internet Archive Gets 4.5PB Data Center Upgrade 235

Lucas123 writes "The Internet Archive, the non-profit organization that scrapes the Web every two months in order to archive web page images, just cut the ribbon on a new 4.5 petabyte data center housed in a metal shipping container that sits outside. The data center supports the Wayback Machine, the Web site that offers the public a view of the 151 billion Web page images collected since 1997. The new data center houses 63 Sun Fire servers, each with 48 1TB hard drives running in parallel to support both the web crawling application and the 200,000 visitors to the site each day."
This discussion has been archived. No new comments can be posted.

Internet Archive Gets 4.5PB Data Center Upgrade

Comments Filter:
  • by wjh31 ( 1372867 ) on Wednesday March 25, 2009 @06:50PM (#27336227) Homepage
    one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system
    • by fuzzyfuzzyfungus ( 1223518 ) on Wednesday March 25, 2009 @06:56PM (#27336293) Journal
      TFA indicates that they have a mirror at the library of Alexandria. Unless things have changed since last I read about them, the mirroring is pretty much it. The Internet Archive does very impressive work; but they don't have that much money. No Real Big Serious Enterprise tape silos here.
    • by LiquidCoooled ( 634315 ) on Wednesday March 25, 2009 @06:57PM (#27336301) Homepage Journal

      one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system

      floppy disks.
      lots of floppy disks.

      • Re: (Score:2, Funny)

        by Bearhouse ( 1034238 )

        Not reliable enough.
        I suggest that this important resource be backed up to punched cards.
        This would also enable handy comparisons in units that us oldies understand, such as ELOCs
        (Equivalent Library of Congress).
        I'd calculate it myself, but seem to have mislaid my slide rule...

        • Re: (Score:3, Funny)

          I'd suggest also using stone slabs. Water can do serious damage to paper, and don't get me started on fire hazards. Good old Stone Slabs resist both of those really well. I'm not sure what the write speed is, however, so you'll probably need to hire many stonecutters to work in parallel.

          • by jd ( 1658 )

            When you get right down to it, any hard-coded data on silicon is just data on a stone slab. Since you can compile SystemC into a hardware spec, you can write stone slabs as fast as you can generate C.

          • by zach297 ( 1426339 ) on Wednesday March 25, 2009 @11:10PM (#27338131)

            I'd suggest also using stone slabs. Water can do serious damage to paper, and don't get me started on fire hazards. Good old Stone Slabs resist both of those really well. I'm not sure what the write speed is, however, so you'll probably need to hire many stonecutters to work in parallel.

            A math problem. My favorite. I don't know much about stone cutters but lets assume they can write one bit every 2 seconds. Thats 1 byte in 16 seconds. The internet archive is (4.5 x 1,125,899,906,842,624) 5,066,549,580,791,808 (5 quadrillion) bytes. That works out to 81,064,793,292,668,928 (81 quadrillion) seconds or about 2,570,547,732 (2.5 billion) years. That is far to long for their stringent 2 month backup cycle. They would need 15,423,286,395 (15.4 billion) stone cutters to keep schedule assuming they had unlimited stone. Last time I checked there were only between 6 and 7 billion people with only a small fraction of them being stone cutters. That leaves but one solution. Force the web developers to become stone cutters. This would not only increase the work force but also reduce the amount needed to backup because fewer people will be making more web pages to backup.

        • by bitrex ( 859228 )
          What you've got to do is you've got to make punch-card copies of all your data. Then you're gonna take those punch cards, and you're going to put them on a wooden table. Then you're going to take digital photographs of them, email them to your backup site in the Netherlands, where they'll get the photographs printed and stack them in a vacuum sealed airtight length of 24" PVC sewer pipe stored on the top floor of a windmill to avoid flood damage. That is what we call mission critical backup procedure, my f
      • Re: (Score:3, Funny)

        Comment removed based on user account deletion
    • by MichaelSmith ( 789609 ) on Wednesday March 25, 2009 @07:01PM (#27336335) Homepage Journal
      Its like the two USB hard disks I use for backups. Pick up the container and swap it with the container from secure storage,
    • by MrEricSir ( 398214 ) on Wednesday March 25, 2009 @07:03PM (#27336355) Homepage

      It's simple, the backups are compressed -- they simply remove all those useless zeroes from the binary data.

      • Re: (Score:2, Funny)

        It's simple, the backups are compressed -- they simply remove all those useless zeroes from the binary data.

        in music today, there is a so-called 'loudness war' and I think I've discovered what it is: they're removing the zeroes, thinking that 'all ones' will make the music even louder!

        I wonder if its reversable? where do the zeroes go? can they be unzeroed? we should try to find them.

      • Followed by run length encoding of the remaining ones.

      • Re: (Score:3, Funny)

        by Samah ( 729132 )

        It's simple, the backups are compressed -- they simply remove all those useless zeroes from the binary data.

        Compressed with XML! Because XML makes everything better... right?
        Right?

    • [subject correction]
      PB, not TB... hehe.
    • It's 4.5PB, which is a whole different thing, and TFA says it's mirrored at the library of Alexandria, Egypt. I guess that counts as off-site :)
    • by CannonballHead ( 842625 ) on Wednesday March 25, 2009 @07:09PM (#27336401)

      The Internet Archive also works with about 100 physical libraries around the world whose curators help guide deep Internet crawls. The Internet Archive's massive database is mirrored to the Bibliotheca Alexandrina, the new Library of Alexandria in Egypt, for disaster recovery purposes.

      • by Anonymous Coward on Wednesday March 25, 2009 @07:46PM (#27336719)

        Egypt could be a good choice. The area is fairly famous for reliable persistent storage. From papyrus scrolls to stone engravings, things tend to keep there better than most places. There really aren't many other geographical areas on earth that can claim the same kind of data retention rates over the time periods they've dealt with. Though despite their impeccable track record with avoiding hardware failures, they've done significantly worse when it comes to data loss due to theft and/or hackers/pirates.

        The one curious part about that choice is that the library at Alexandria is the one notable case where mass amounts of data were irreparably lost. So it's odd that they'd choose to entrust their data to that specific institution. Perhaps they felt that since it's under new management, the previous problems will have been resolved.

        However, had the choice been mine, I would have chosen to store my offsite data in Luxor. It's data retention was quite good, and included one data store that was preserved in its entirety for over 3000 years. As an added benefit, it seems that they've opened a second location [luxor.com] that's significantly more convenient for the IA since there's no overseas transmission to worry about.

        • by Darkk ( 1296127 )

          I oughta start a new company called:

          Off-Planet backups where I'd use the moon to store your precious data!

          Only three things I'd have to worry about would be:

          1) Aliens (if they are out there)
          2) Meteors
          3) Solar Flairs

          Other than that pretty solid plan to me!

    • by Ungrounded Lightning ( 62228 ) on Wednesday March 25, 2009 @07:29PM (#27336571) Journal

      ... one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot,..

      As I recall from one of Brewster's talks: Part of the idea was that you can install redundant copies of this data center around the world and keep 'em synced.

      You can ship 4.5 petabytes over a single OC-192 link in about 71 days.

    • one would assume that something like this does regular off-site back-ups

      there are BIG fat cables you connect, wait 3 seconds, then do a massively parallel 'dd if= ...'

    • Re: (Score:2, Insightful)

      by pedrop357 ( 681672 )

      4.5TB isn't that bad. Heck, we have 1TB tapes right now. 5 of them can be carried in a small bag.

      It's the 4.5PB that the Internet Archive could use that's hard to store offsite. 4500 1TB tapes can be pretty unruly.

    • Re: (Score:3, Interesting)

      by Anonymous Coward

      one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system

      Create snapshot of zpool (think LVM VG):
      # zfs snapshot mydata@2009-03-24

      Send snapshot to remote site:
      # zfs send mydata@2009-03-24 | ssh remote "zfs recv mydata@2009-03-24"

      Create a new snapshot the next day:
      # zfs snapshot mydata@2009-03-25

      Send only the incremental changes between the two:
      # zfs send -i mydata@2009-03-24 mydata@2009-03-25 | ssh remote "zfs recv mydata@2009-03-25"

      Now this looks a lot like rsync, but the difference is that rsync has to traverse the file system tree (directories and files), whil

    • one would assume that something like this does regular off-site back-ups, which must add up to a hell of a-lot, could someone with experiance in such matters shed a little insight into the logistics of backing up such a vast system

      Dude, the Internet Archive IS the offsite backup.

      At least mine anyway. Tape drives be damned.

  • by Dr_Banzai ( 111657 ) on Wednesday March 25, 2009 @06:51PM (#27336233) Homepage
    I have no idea how much 4.5 PB is until it's given in units of Libraries of Congress.
    • by Anonymous Coward on Wednesday March 25, 2009 @07:01PM (#27336339)
      1 library of congress = 10 terabytes [wikipedia.org]
      4.5 petabytes = 4608 terabytes [google.com]
      So, that's 460.8 LOCs.
    • by Wingman 5 ( 551897 ) on Wednesday March 25, 2009 @07:02PM (#27336347)

      from http://www.lesk.com/mlesk/ksg97/ksg.html [lesk.com] The 20-terabyte size of the Library of Congress is widely quoted and as far as I know is derived by assuming that LC has 20 million books and each requires 1 MB. Of course, LC has much other stuff besides printed text, and this other stuff would take much more space.

      1. Thirteen million photographs, even if compressed to a 1 MB JPG each, would be 13 terabytes.
      2. The 4 million maps in the Geography Division might scan to 200 TB.
      3. LC has over five hundred thousand movies; at 1 GB each they would be 500 terabytes (most are not full-length color features).
      4. Bulkiest might be the 3.5 million sound recordings, which at one audio CD each, would be almost 2,000 TB.

      This makes the total size of the Library perhaps about 3 petabytes (3,000 terabytes).

      so 230 libraries by the old standard or 1.5 by the new standard

      • Re: (Score:2, Insightful)

        by dln385 ( 1451209 )

        from http://www.lesk.com/mlesk/ksg97/ksg.html [lesk.com] The 20-terabyte size of the Library of Congress is widely quoted and as far as I know is derived by assuming that LC has 20 million books and each requires 1 MB. Of course, LC has much other stuff besides printed text, and this other stuff would take much more space.

        1. Thirteen million photographs, even if compressed to a 1 MB JPG each, would be 13 terabytes. 2. The 4 million maps in the Geography Division might scan to 200 TB. 3. LC has over five hundred thousand movies; at 1 GB each they would be 500 terabytes (most are not full-length color features). 4. Bulkiest might be the 3.5 million sound recordings, which at one audio CD each, would be almost 2,000 TB.

        This makes the total size of the Library perhaps about 3 petabytes (3,000 terabytes).

        so 230 libraries by the old standard or 1.5 by the new standard

        Compress each audio file to a 5 MB MP3. That's 17.5 TB. Total size would be 750 terabytes.

        So the data would be 6 LOC.

      • Re: (Score:3, Insightful)

        by merreborn ( 853723 )

        Bulkiest might be the 3.5 million sound recordings, which at one audio CD each, would be almost 2,000 TB.

        You compressed the video, and the photographs, but not the audio? And why do you need a full CD for every sound recording? Surely many of them are far shorter than a full CD?

        • Re: (Score:3, Interesting)

          by Xtravar ( 725372 )

          The CDs are already in digital format, so compressing them is a cardinal sin.

          The photos, movies, and maps are in analog format to start with, so we don't feel so bad using lossy compression. Image files are really big. I think the 1GB estimate per movie is pretty good, considering shorts, black and white, and the standard (or lower) definition of most of them. That would allow for a very high detail scan of the movie in something like MPEG4.

          And, since they started in analog formats, there's no fair way t

    • by commodore64_love ( 1445365 ) on Wednesday March 25, 2009 @07:08PM (#27336399) Journal

      83 terabyte in the LOC, so 4.5 petabytes == 54 Libraries of Congress

      4.5 petabytes == 4500 terabyte hard drives, times $75 each == ~$340,000 == how much taxpayers spend, each hour, to maintain the LOC

    • Bah, LOC is outdated. 4.5PB = 1 Shipping Container
      • by v1 ( 525388 )

        a new 4.5 petabyte data center

        4.5 PB? Is that the best you can do? sheesh, amateurs....

        Though it also did surprise me they only get 200,000 hits/day. I expected the WayBack Machine [archive.org] to get a lot more traffic than that.

        • I think that's 200K unique visitors. According to alexa, archive.org is the 386th most visited site on the internet last week which is not to be sneezed at
          • "According to alexa, archive.org is the 386th most visited site on the internet"

            386? I would think that with all those Sun boxes they already were 64 bits at least!

    • Riiiight... because you happen to have a really really good mental image of exactly how many rooms/shelves/books/pages are stored in the Library of Congress!

      (Which incidentally doesn't happen to be static, BTW; yo momma's LoC ain't the same size as my LoC.)

    • 1 Library Of Congress = 1/200 inch wide cube approx [zyvex.com].

      Feynman estimated the LOC at about 1 petabit, which would make the Internet Archive containing roughly 36 petabits a cube on the order of 1/50 inch wide.

      So it should fit in your pocket.

  • by jacksinn ( 1136829 ) on Wednesday March 25, 2009 @06:52PM (#27336241)
    Does lusting after all their space make me a peta-phile?
  • by Anonymous Coward on Wednesday March 25, 2009 @06:52PM (#27336243)

    so all one need to do to "own the internet" is to drive a big rig and ... lift the container off their parking lot?

  • by girlintraining ( 1395911 ) on Wednesday March 25, 2009 @06:52PM (#27336249)

    I can now theoretically steal "the internet" with a flatbed truck and a lift. There's something to be said for conventional data centers: They're rather hard to load onto a truck and drive off with.

  • Well I hope it is bolted down.
  • Yes, "thumper" refers to the rabbit. I have a Sun Managed Storage slide somewhere about how data tends to, er, multiply...

    --dave

  • by commodore64_love ( 1445365 ) on Wednesday March 25, 2009 @06:54PM (#27336277) Journal

    Are there any resources the let us see websites from 1996, 95, 94, or 93? I would love to revisit the web as it appeared when I first discovered it (1994 at psu.edu).

    • by Tumbleweed ( 3706 ) * on Wednesday March 25, 2009 @07:13PM (#27336431)

      I would love to revisit the web as it appeared when I first discovered it (1994 at psu.edu).

      No, you wouldn't.

    • Re: (Score:2, Funny)

      by Matheus ( 586080 )

      The entire internet prior to 1996 is archived on an old PC that I'm currently trying to get the 5GB disk restored on.. why I've kept all that old porn for so long completely escapes me tho. :)

      • Re: (Score:2, Informative)

        Because after 1996 women shaved all their hair off due to a mistaken belief that men prefer their women to look like little girls. We don't, we like the big bushes, and that is why you must save that porn for the good of mankind.
  • Unfortunately the Wayback Machine will still be slower than hell. :p
    • by jd ( 1658 )

      Fortunately, Hell has now been upgraded to 2 mb/s, thanks to British Telecom.

      • by kelnos ( 564113 )
        Two millibits per second? And that's an upgrade? Ouch.
        • by jd ( 1658 )

          One method of suspend-to-disk is to do a freeze/thaw. It has taken Hell over 16 billion years to do just the freeze. Two millibits per second should be able to do both in less than half the time.

  • In Other News (Score:5, Informative)

    by Erik Fish ( 106896 ) on Wednesday March 25, 2009 @07:09PM (#27336407) Journal

    Incidentally: FileFront [filefront.com] is closing in five days, taking with it any files that aren't hosted elsewhere.

    I am told that many of the Half-Life mods [filefront.com] hosted there are not available anywhere else, so get while the getting is good...

  • by Ungrounded Lightning ( 62228 ) on Wednesday March 25, 2009 @07:15PM (#27336449) Journal

    ... of a 4.5 petabyte datacenter in a shipping container in transit.

    • by giblfiz ( 125533 )

      Government economic stimulus: Treating a patient for anemia with an iron supplement made from his own extracted blood.

      I can't resist replying to your Sig...
      It's like treating a patient for anemia with iron supplements made from his own extracted blood from the future. We are taking on debt, not trying to push through a one year ballenced budget. I'm not sure it's a good idea, but it's a much better one than what your describing.

      • It's like treating a patient for anemia with iron supplements made from his own extracted blood from the future.

        Unfortunately, when you finance with debt on an economy-wide basis you pay double - or more. There's the return payment. (Plus the interest - which is the "more".) But there's also the cost to the economy of whatever WOULD have been done with the "borrowed" resources but now is not done because the resources were diverted.

        When they talk of how many jobs were created by the stimulus, ask how man

  • 63 x 48 = 3024Tb (Score:3, Insightful)

    by eotwawki ( 1515827 ) on Wednesday March 25, 2009 @07:21PM (#27336501)
    So wehre does the 4.5PB come in to this?
    • The article doesn't make it clear so I can only guess that the missing storage capacity is part of some SAN. Maybe the 48 1TB hard drives are only local storage (obviously) but are in addition to some existing SAN that they didn't mention in this particular article. Either that or the article is just wrong about the 4.5PB database.
    • Re:63 x 48 = 3024Tb (Score:5, Informative)

      by spinkham ( 56603 ) on Wednesday March 25, 2009 @07:55PM (#27336809)

      TFA says "...eight racks filled with 63 Sun Fire x4500 servers with dual- or quad-core x86 processors running Solaris 10 with ZFS. Each Sun server is combined with an array of 48 1TB hard drives." (emphasis mine)

      I would guess this means there's a x4500 with 24TB in local disks, and 48TB in attached storage per machine. (24+48)*63 does give us the quoted number

      • I don't think that's right. Sun's site has a video tour of it. Haven't finished it yet but it's here [sun.com].

        • The new datacenter is only 3PB. I guess the total storage, with the old data centers is 4.5 PB.

          So 48x63 gives you 3PB of raw storage. I'm guessing there using less because I can't imagine them running it in raid 0.

    • by pwnies ( 1034518 ) *
      Actually it's a bit less than that even. The Sunfire servers they're using, or "thumpers" as they're nicknamed generally use zfs to store their data. However, the default configuration of these systems is to use a Raidz config for the drives (think raid 5). Essentially, the configuration uses 8 6-disk raidz configs, all aggregated together into one giant pool. The reason why it's less than what they state here, is that one disk from each of those eight raidzs are parity disks. That drops the theoretical sto
      • Re: (Score:3, Informative)

        Sun has more information and an Interactive tour [sun.com] of the Internet Archive modular data center on their site.

        The total raw capacity of the container is 3 peta bytes. In reality it's going to be less than that. First, 2 disks are likely to be setup in a mirrored pool for the system disks. I believe the root pool only supports mirrors, not raidz. Not sure if this has changed.

        That leaves you with 46 disks for data. Maybe they partitioned part of the root pool to include in the data pools, not sure, but zfs works

  • Math (Score:3, Informative)

    by PowerKe ( 641836 ) on Wednesday March 25, 2009 @07:29PM (#27336579)
    63 servers * 48 disk of 1 TB = 3024 TB. According to the announcement [archive.org] on the archive.org 3 Petabytes would be right.
  • "Sun Fire" (Score:4, Informative)

    by fm6 ( 162816 ) on Wednesday March 25, 2009 @07:37PM (#27336647) Homepage Journal

    The new data center houses 63 Sun Fire servers

    That's not very specific. "Sun Fire" is a brand that for a while got applied to all of Sun's rack-mount servers (except for NEBS-compliant servers, which were and are called "Sun Netra"). A little confusing, of course, which is why they've started calling new SPARC boxes "Sun SPARC Enterprise" to differentiate them from those mangy x64 "Sun Fire" systems. Except that there are still SPARC systems called "Sun Fire", so I guess the confusion factor didn't get any better...

    Anyway, the specific server being used here is the Sun Firex X4500 [sun.com], a system with no less than 48 1 TB disks in a 4U space. Notice that this model is EOLed; presumably iarchive got a deal on some remaindered machines.

    The shipping container is something we've seen before [slashdot.org].

    • Re: (Score:3, Informative)

      by ximenes ( 10 )

      Since they're using one of Sun's modular datacenters that is actually on the Sun campus, I would imagine that they got some financial incentives / support from Sun for all of this.

      The X4500 is EOL as you mention, although it was still sold a few months back. It lives on as the X4540, which really isn't that different; the main thing is it's moved to a newer Opteron processor type and is a fair bit cheaper. So they didn't really miss out on anything.

      It's kind of interesting to me that they went this route, a

      • Re:"Sun Fire" (Score:5, Interesting)

        by fm6 ( 162816 ) on Wednesday March 25, 2009 @08:27PM (#27337081) Homepage Journal

        This seems to be an exact use case for the X4500-type system, which as far as I'm aware is pretty unique.

        Indeed. Sun is on a density kick. Check out the X4600, which does for processing power what the X4500 did for storage.

        In both cases, there actually are competing products that are sort of the same. The most conspicuous difference is that the Sun versions cram the whole caboodle into 4 rack units per system, about half the space required by their competitors.

        More absurdly-dense Sun products:

        http://www.sun.com/servers/x64/x4240/ [sun.com]
        http://www.sun.com/servers/x64/x4140/ [sun.com]

        The point of these systems is that they take up less expensive rack space than equivalent competitors. They're also "greener": if you broke all that storage and computing power down into less dense systems, you'd need a lot more electricity to run them and keep them cool. That not only saves money, it gives the owner the ability to claim they're working on the carbon footprint.

    • Re: (Score:2, Informative)

      by Anonymous Coward

      Anyway, the specific server being used here is the Sun Firex X4500 [sun.com], a system with no less than 48 1 TB disks in a 4U space. Notice that this model is EOLed; presumably iarchive got a deal on some remaindered machines.

      There are newer X4540s which are mostly the same, but have newer CPUs, and can hold more memory (16 -> 64 GB).

  • They cut the ribbon? How are they supposed to access that much data unless they buy a new one?
    • by jd ( 1658 )

      Easy. Ribbon's only good for short-distance parallel links. If they've got backups in Egypt, they must be using serial cables.

  • From TFA (yeah, I know):

    a Web site that gets about 200,000 visitors a day or about 500 hits per second on the 4.5 petabyte database.

    So they get all 200,000 hits in a 7-minute window? I picture a sysadmin going insane for a few moments then napping in a hammock for the rest of the day.

  • Inquiring minds want to know.

For God's sake, stop researching for a while and begin to think!

Working...