Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Data Storage Media

Info Glut - Five Exabytes of Data Created in 2002 284

securitas writes "If you had any doubts that you are overwhelmed by the volume of information in your life, a new Berekley study (PDF) shows that five exabytes of data were created in 2002, twice the 1999 total. That's five million terabytes of data, or 500,000 Libraries of Congress, which works out to about 800 MB of data for each of the 6.3 billion people on the planet. Of note is that 92 percent of the new information was stored on magnetic media, which may create an interesting problem for historians and archaeologists of the future. The study was conducted by University of California-Berkeley's School of Information Management and Systems professors Peter Lyman and Hal Varian. More at CNet, Infoworld, ByteAndSwitch and The Register."
This discussion has been archived. No new comments can be posted.

Info Glut - Five Exabytes of Data Created in 2002

Comments Filter:
  • by XNuke ( 5231 ) on Wednesday October 29, 2003 @02:39PM (#7340329)
    I looks like they are counting every tiny email about "going to lunch". Lots of DATA little INFORMATION.
    • by uberdave ( 526529 ) on Wednesday October 29, 2003 @02:46PM (#7340427) Homepage
      I wonder how much of that was duplicate data. How many copies of the Matrix are floating around online? Did they count FTP mirror sites as separate data?

      For that matter, how much of the data is real, and how much is virtual? If two sites point to the same download, is that data counted twice, or once?
      • by Tenebrious1 ( 530949 ) on Wednesday October 29, 2003 @02:57PM (#7340544) Homepage
        I wonder how much of that was duplicate data. How many copies of the Matrix are floating around online? Did they count FTP mirror sites as separate data?

        The blurb said 92% was stored on magnetic media; curious about the rest, I looked glanced around the article. Surprisingly a large part, 7%, is FILM! The reason film comprised such a large percentage is that each film reel is duplicated thousands of times to be sent to theaters around the world.

        So if they're counting duplicates in film, I'd guess they'd count duplicates in magnetic media.

      • Duplicate data is a _good thing_. It saves your ass when the unthinkable happens, anything between the dog eating over your cdr and a plane hitting.. Oh well, you get the idea.
        Trust me, the nicest thing about stored data is its own copy safely guarded somewhere else, at at least 10 km distance andsoon.
      • by kfg ( 145172 ) on Wednesday October 29, 2003 @03:15PM (#7340723)
        "I wonder how much of that was duplicate data."

        3% was [AOL] Me Too! [/AOL] posts.

        1% was In Soviet Russia jokes.

        0.5% Profit!!!

        So I guess there was a fair amount of duplication.

        KFG
      • "I wonder how much of that was duplicate data. How many copies of the Matrix are floating around online? Did they count FTP mirror sites as separate data?"

        Not to mention all the websites online that only have keywords aimed to hack google, and nothing else, but maybe links to OTHER void pages by the same author/group/company!!
      • What if you take a page with text and scan it? It can take a size anywhere between 30-1000 KB. The same text can be written in an text editor in 5-6 KB. In MS word in 60 KB.
        2 years back, CD-R's were the in thing. Everyone and anyone was storing data on it. Since its size was 700 MB, files were generally smaller and compressed. Higher broadband connections and DVD recorders(alongwith faster processors) are becoming common, people don't care so much about file sizes.

        Regarding duplicate data- ask five people
    • by tachin ( 590622 ) on Wednesday October 29, 2003 @02:49PM (#7340460)
      Lots of DATA little INFORMATION.
      From data you can extract "information", take a lot of those "going to lunch" mails and you can see what groups of people lunch together and at what time....
    • Of note is that 92 percent of the new information was stored on magnetic media, which may create an interesting problem for historians and archaeologists of the future

      I don't really think historians and archaeologists are ever going to be able to dig through Five Exabytes of Data. Maybe the magnetic storage is a blessing then...

      • historians and archaeologists are ever going to be able to dig

        A. They'll use machines to do the heavy digging.

        B. Or, the historians and archaelogists will be machines.

        A big problem will be that those 5 EB of data describing 5 years near Y2K will be dissolved in a much larger ocean of data by that time.

    • Its actually all my fault...

      I left this script running on the unix farm which did the following on each box

      while(true)
      rm filename
      echo "Whose the Daddy" > filename
      end while

      Its a big farm, and its been running all year. The net result is about 100k of files on the farm total... but terrabytes during the year.

      In otherwords what I mean is...

      How much of this "created" information was transient.

  • by Matey-O ( 518004 ) * <michaeljohnmiller@mSPAMsSPAMnSPAM.com> on Wednesday October 29, 2003 @02:39PM (#7340331) Homepage Journal

    That's a believable number. Consider the amount of published data on Kazaa, or that 45 minutes of raw DV video is roughly 12.5 Gb*. Move 100 of your CD's to MP3s and you're consuming/creating roughly 3.5 Gb* (or more if you're using higher than 128kb MP3's). And I'm not evern commentin on pr0n.

    (*I said roughly...comment on the comment, not the mathematical precision of the statement.)

    • Actually I think it probably undershoots the mark...

      By the article: The researchers relied on existing data such as ISBN numbers to count books and journals, as well as industry reports about data handled by enterprise servers for things such as supermarket sales and airline bookings. They performed surveys to estimate how much unique information exists on each type of hard drive.

      I don't think they attempted to collect information on more ephermeral data... For example, artists that go through many ver
  • Yeah... (Score:5, Funny)

    by the_mad_poster ( 640772 ) <shattoc@adelphia.com> on Wednesday October 29, 2003 @02:39PM (#7340332) Homepage Journal
    ...and most of it is still sitting in my Inbox at work right now.
    • Of that, how many PFUs [slashdot.org] of spam does it contain?
    • by twitter ( 104583 )
      I've got more than my share of data, enough to discard the 800MB or so that AOL likes to mail me. 800MB/person is not shocking when I think of all the CDs I've stumbled across in the field - literally grass fields in the midle of nowhere.

      It's a joke..

  • Comment removed based on user account deletion
  • Damn (Score:2, Funny)

    by Judg3 ( 88435 )
    That's a lot of porn. Though I think their stats are off a bit, as I have 800gb of porn, not mb. Oh well, better luck next year!
    • Re:Damn (Score:3, Funny)

      by Carnildo ( 712617 )
      You've got a thousand times your allotment of porn! Think of all the poor people in Africa who you are depriving of their annual allowance!
  • by BWJones ( 18351 ) on Wednesday October 29, 2003 @02:39PM (#7340345) Homepage Journal
    a new Berekley study (PDF) shows that five exabytes of data were created in 2002,

    Shoot, it felt like my doctoral dissertation was responsible for at least 2 of those 5 exabytes. :-)

  • by SirJaxalot ( 715418 ) on Wednesday October 29, 2003 @02:40PM (#7340347)
    here is the aritcal [nwsource.com]
    • by Vaevictis666 ( 680137 ) on Wednesday October 29, 2003 @02:45PM (#7340407)
      Your article [nwsource.com] states:

      They found that new information flowing across televisions, radios, telephones, Web sites and the Internet had increased by 3 1/2 times to a total of 18 exabytes as of 2002. The amount of new but stored (non-transmitted) information in 2002 was determined to be about five exabytes.

      This jives with the other articles. 5 exabytes generated content, 18 exabytes transferred content - still one heck of a lot of bits floating around :)

  • Of note is that 92 percent of the new information was stored on magnetic media, which may create an interesting problem for historians and archaeologists of the future.

    Well, why won't they just print it ? Sheesh...
    • Re:No problem here. (Score:5, Interesting)

      by GaelenBurns ( 716462 ) <gaelenb&assurancetechnologies,com> on Wednesday October 29, 2003 @02:58PM (#7340553) Homepage Journal
      I wonder how many pages of paper an exabyte of data would take up? We're talking about gigantic masses, here. Why not figure it out? I'm guessing, based on character counts from Open Office, that you can get about 2kB of data on a single sheet. That's 4kB if you use both sides. And you get around 125 sheets per pound... So, based on some guesses, it looks like it will take 2,251,799,813,685 pounds of paper to print one exabyte of this data. For all 5 exabytes, we're looking at a wieght 122 times that of the Great Pyramid. Not as much as I'd suspected... but still fun!
      • Do the evolution (Score:2, Interesting)

        by FrankoBoy ( 677614 )
        So this means 1.126 gigaton of paper. According to this research paper [thebulletin.org], the world's major nuclear arsenals is equal to about 5 gigatons of TNT.

        Now, here's a little math for you :
        • Print every single bit of information the whole world produced last year.
        • Copy all of the output four times.
        • Replace all this paper by TNT...

        ...and the result, my friends, is the perfect recipe for global annihilation. Conventional weapons sold separately.

      • by indianajones428 ( 644219 ) on Wednesday October 29, 2003 @04:29PM (#7341497)

        So 122 Great Pyramids = 500,000 Libraries of Congress?

        Great, another conversion factor to remember...

  • Huzzah! (Score:4, Interesting)

    by GaelenBurns ( 716462 ) <gaelenb&assurancetechnologies,com> on Wednesday October 29, 2003 @02:42PM (#7340370) Homepage Journal
    Hooray for exponential curves! It is daunting, though. As an illustration of this, I read that the White House has already turned over 2 million pages of documents relating to 9/11 to the independent investigation panel.
    • I read that the White House has already turned over 2 million pages of documents relating to 9/11 to the independent investigation panel

      Security by obfuscation?
  • How about temporary and efferent data, like SSH keys and data passed through X11, used for short point-to-point transfers? It might be just me, but if this doesn't take into account that data, the total could be much higher...
  • as i just received another couple of letter asking for assistance from the war torn regions of africa, how much of this is spam and related garbage?

    oddly enough the most useful information is often the most concise. duck!

  • Hmmmmm.... I think I might know where all that 'new data' came from. [sitefinder.com]

  • quote (Score:5, Interesting)

    by CGP314 ( 672613 ) <CGP@NOSpAM.ColinGregoryPalmer.net> on Wednesday October 29, 2003 @02:46PM (#7340425) Homepage
    All of the books in the world contain no more information than is broadcast as video in a single large American city in a single year. Not all bits have equal value. --Carl Sagan
    • Comment removed based on user account deletion
      • I wonder how many words a motion picture is worth?

        Looks like 599, assuming said motion picture is a complete rotting turd. Thanks for gems like this one, MPAA!

        Review: 'Gigli' is really, really bad

        It's better than 'Swept Away,' for what it's worth.

        By Paul Clinton

        CNN Reviewer

        Saturday, August 2, 2003 Posted: 12:13 AM EDT (0413 GMT)

        OK, so "Gigli" is not the worst film in years. That dubious title still goes to "Swept Away," or maybe "Freddy Got Fingered." But "Gigli" is still a huge waste of celluloid.

  • From the article Verian (an economist) states:
    ``We're producing all this information, but we don't necessarily have the tools to use it most effectively,'' he said.

    What does it mean to use data "effectively", and is the "We" producing the data the same "We" using it? My first instinct on not having the tools to use this data most effectively is "that's good". My second instinct tells me that data is already being used TOO effectively. Personally, I hope that cross-reference of mass data stores containin
  • by sulli ( 195030 ) * on Wednesday October 29, 2003 @02:48PM (#7340445) Journal
    525,600 minutes per year. Impressive.

    But if these data were recorded on floppies, and stacked up to the moon n times, how many VWs would it take to carry those floppies to the stack site?

    • But if these data were recorded on floppies, and stacked up to the moon n times, how many VWs would it take to carry those floppies to the stack site?

      ...how many golf balls falling on said stack it would take to knock it over. And if you laid all the bits in the data side by side, I wonder how many times it would go around the earth?

  • So what the writeup is saying is that there's a whole lotta data, which is a problem, and that 92% of that data probably won't survive that long, which is a problem. It sounds like these two problems cancel each other out! (That is, as long as the 8% that does survive is the useful stuff.)
  • Storage (Score:4, Interesting)

    by 3Suns ( 250606 ) on Wednesday October 29, 2003 @02:50PM (#7340462) Homepage
    I work at EMC [emc.com], and this fact (along with projections for similar growth in the future) is a big marketing strategy for the company, especially toward investors. The storage market grows with the amount of information produced... it's gotta be stored somewhere!
    • > ...it's gotta be stored somewhere!

      For most of it /dev/null is the prime choice of storage medium. This should really be an opportunity companies producing high speed, high capacity null-devices.

      Where are the VC when one needs them?
  • Not long-term data (Score:3, Interesting)

    by micromoog ( 206608 ) on Wednesday October 29, 2003 @02:51PM (#7340474)
    That's a big-sounding number, but most of this is not going to be useful or stored long term. Examples:
    • Many large companies are building VERY large data warehouses, to capture and analyze every iota of information about every transaction. In a year or two, much of today's data will be largely irrelevant, and will likely be summarized and deleted.
    • People send a lot of email, and post a lot of messages, about day-to-day stuff that has no long-term value.
    • Surveillance video is used more than ever. This is not going to be stored long-term, except perhaps in the most security-sensitive areas.
    Either way, I highly commend the article's author for using both "Libraries of Congress" and "feet of books" as measurement units.
  • What with all the (expected) porn jokes out, keep in mind that the goal is to count new data generated this year, without duplicates.

    You only get to count data you have generated yourself, anything you got from somewhere else (99% of porn, everything on P2P apps) doesn't count.

    As such, I think I'm under my one-cd-per-person (800mb) limit for the year, but I do know a few friends (artists) that would definitely be over :P

    Another interesting question is whether data conversion counts - If I copy a CD to

  • http://www.wired.com/wired/archive/11.09/full.html
  • How much of that was in kids' artwork for the refrigerator door? Cause that would store a lot better in a vector file format...
  • Mass replication (Score:3, Interesting)

    by binaryDigit ( 557647 ) on Wednesday October 29, 2003 @02:52PM (#7340498)
    I think the more interesting thing to study would be to determine how much unique data is being generated. I mean who cares if two million people have the latest Britanny Spears song in mp3 format? And that's not even talking about "information", but just simply raw "data". I also wonder if they took into account "data in transit" (being transmitted over the ethernet) and temporary data (caches, etc).
    • From the article I was under the impression that they WERE talking abour Unique data.

      "They performed surveys to estimate how much unique information exists on each type of hard drive."

      Still, it seems like it would be a difficult thing to discern.
  • by The Jonas ( 623192 ) on Wednesday October 29, 2003 @02:53PM (#7340508)
    ...how much info is destroyed each year to offset these numbers. I mean shredded files, stuff thrown in trash, bills, deleted data files, discarded/lost storage media, etc... In the end (of each year), I wonder, what is the actual increase in stored information?
  • by mengel ( 13619 ) <mengel@noSpAM.users.sourceforge.net> on Wednesday October 29, 2003 @02:53PM (#7340509) Homepage Journal
    At Fermilab [fnal.gov] where I work, the larger experiments are expecting [fnal.gov] to generate 1PB/year of data in around 2005, up from somewhere around 300TB/year currently.
    • At Fermilab where I work, the larger experiments are expecting to generate 1PB/year of data in around 2005, up from somewhere around 300TB/year currently.

      The remarkable thing is that after analysis is complete, all that data is reduced to just two bytes: "42"

  • Tera, Giga, Exa, Don't give it to me in those terms. Put it in terms I can understand!

    Just how much of that was porn?

    -Goran
  • by Entropy248 ( 588290 ) on Wednesday October 29, 2003 @02:56PM (#7340529) Journal
    500,000 Libraries of Congress, huh? I've always had several problems (SI questions aside) with this unit of measurement. The Library of Congress is constantly expanding & adding new material. What year Library of Congress do they mean? I imagine they aren't working w/ up to the minute data and that the libary is expanding much faster now. Not to mention the fact that everyone always makes exabytes ~2.4% smaller than they really are (and with numbers this big, it actually makes a difference!)... So call me the new number nazi troll already and get it over with...
  • Why is it that everything that is data is related to either/or x libraries of congress or y encyclopedia brittanicas, as if either of those is actually an approachable figure. I want to lobby for a new measure, such as x two hour porn dvd's or y illegally downloaded songs.
  • Then think of how many bytes of that number are actually backed up if they are irreplacable?

    I'd bet not much. And what is backed up may only have a shelf life of about 20 months if on poor CD-R or Floppies.
  • Damn- that puts some stuff in perspective... 800 MB per person is really not that much... just over one CD per person on the planet.

    I personally burned over 500 CDs last year, filled a couple of hard drives, and sent God knows how much email...

    I think this goes to show what a wealthy little world we computer people live in.
    • by Anonymous Crowhead ( 577505 ) on Wednesday October 29, 2003 @03:04PM (#7340611)
      I personally burned over 500 CDs last year

      Congrats, you balanced out 1 medium-sized tribe in Africa.
    • by IM6100 ( 692796 )
      What did you burn on those 500 CDs?

      Do you run your own particular psuedo-random number generator and store the results? Do you go out with a digital camcorder and record tons and tons of images of the world? Do you write that much prose or poetry in a year?

      Or are you just talking about 500 CDs of data that you or somebody else 'ripped' from exisiting media and are shuffling around?
  • ... how much of it was porn? :)
  • Hey, way to add another 800k to the glut with this pdf file!!
  • My figures (Score:4, Interesting)

    by robogun ( 466062 ) on Wednesday October 29, 2003 @03:04PM (#7340604)
    I just did another backup, so the figures are right at hand.
    I'm a news photographer, shooting digital.
    In 2002 I saved 78,742 photos to disk. (Bad images were not saved.)
    That worked out to 122 gig. The output was transferred fromt he CF cards and archived to DVDs.
    But how much of that 122 gig is really information? The image file saved by the Canon 1d is mostly empty air, as far as I can tell. There is also EXIF data and IPTC, and who knows how much hidden BS is included a'la Microsoft Word documents?
    Simple compression was able to whittle that down to 33.2 gig. So that's my contribution.
    The main beneficiary is the DVD-R blank disc makers and Western Digital, I guess.
  • It's about 6 Exabytes.
  • Doesn't just one experiment produce 45 zillion
    megabytes. (Don't quote me on that.)
  • An mp3 is usually about 1 meg a minute. But a raw wav file is several times more. The same goes for raw video verus mpg2 or quicktime.

    I suppose the number could be much larger if you expand data before counting it.

  • I don't understand, how many elephants does an exabyte weigh?

  • ln2(5 exabytes) is a little over 62!
    (62.3 for RAM style exabytes or 62.1 for HD style exabytes).
  • Of note is that 92 percent of the new information was stored on magnetic media, which may create an interesting problem for historians and archaeologists of the future.

    Not least for those historians who want to know what my Amazon.com session ID was on the day that my Runescape character hit mining level 33.

  • What's the big deal? That's only five 8mm tapes, isn't it?
  • by targo ( 409974 ) <[targo_t] [at] [hotmail.com]> on Wednesday October 29, 2003 @03:27PM (#7340838) Homepage
    5 billion files are created every day.
    3 billion of them will never be found again.
    Poor files...
  • OK, if we've only created that much data, it's time to get to work. Screen-saver authors, please add the following to the main() segment of your code:

    long x;
    { for (true)
    x = rand();
    send_to_info_glut(x); }

    Please send the data created to Info Glut, and while you're at it, send it to all the spammers and to SCO. With some luck, you might DDOS them off the internet.

  • I'm 173.205 percent sure these numbers are not very accurate. I'm 314.159 percent sure that they won't affect how I sleep. And I'm 628.318 percent sure that the funding for this kind of "research" has an upper bound.
  • I just want to point out that 800 MB per person works out to 1,600 slices of 512x512 CT data (the standard size of CT slices at 16 bits per voxel) - which means that this amount of data is roughly the same thing as about a 1mm * 1mm * 1mm CT scan of every human on the planet.
  • Statistics like this only serve to amaze and astound pointy haired boss types. Oh my God! They shriek. Do we REALLY??? Meanwhile, the world keeps turning, we all keep getting up in the morning, and I keep wishing I could get laid. Just once. I mean, REALLY!

    Seriously, though, I bet the breakdown is something like this:

    1. Most of the "information" is probably composed of music and film. We all know how much bandwidth and disk space music and film take up. Here's another thing: different sites might have dif
  • by Pedrito ( 94783 ) on Wednesday October 29, 2003 @03:46PM (#7341033)
    Of note is that 92 percent of the new information was stored on magnetic media, which may create an interesting problem for historians and archaeologists of the future.

    They fail to mention that also of note is that 99% of that informations is in the form of pr0n! That's a lot!
  • Dangit, Cowboyneal! I told you to turn off that packet sniffer at MAE East!

    Now look what you've done.

    -Adam
  • 500,000 Libraries of Congress

    If poster had carefully read the report it is noted in the report that the comparison is to the print collection of the Library of Congress. If you add in their audio and film collections they have at least two orders of magnitude more data. Even the LOC doesn't seem to be sure how much their entire collection is.

  • Five exabytes of data is a meaningless figure if you consider that probably 52% of that was pr0n. The other 35% was source code (non-human readable data). And the remaining 13% was made up of spam, web logs, and e-mail to grandmaw.
  • According to the study...

    Regarding web pages:

    Porn. 2,743 sites (or 28%) appeared to contain pornographic content. To generate this statistic, we matched a list of 94 pornographic stopwords to terms in the associated URL and the index page.

    You read that right, 28% of the internet sampled appears to be porn. Anyone surprised? Read on...

    Regarding P2P networks:

    The largest file types are .AVI video files, followed by archival .ZIP files. AVI files are video files playable on a computer. The range of t

  • Of note is that 92 percent of the new information was stored on magnetic media, which may create an interesting problem for historians and archaeologists of the future.

    Many nine-track magtapes from the 1960s are still readable. For those that aren't, typically the problem is not with the magnetic coating, but the substrate. By now the properies of the substrate materials are much better understood, so this should be less of a problem with modern magnetic media.

    Most optical media does not have any be

  • I've 3.15 Gigabytes of photos from 2002 on my laptop... and that's AFTER I weeded them out. So far for this year, I'm at 7.13 Gigabytes of photos, and it's not even Christmas season yet!

    I admit I take more pictures than most, but I haven't gotten a video camera yet... just think of the Terabytes I'll consume with that bad boy.

    --Mike--

  • Now, are you using the current Library of Congress Measurement, or are you using an old one? I mean, new books must be coming in. I presume that's not just the ASCII, but scans of the pictures as a decent resolution.

    How will I ever do the proper conversions if you aren't using the up-to-date standards?

    =Brian
  • Holy crap! There's a lot of everything in the world. Why is data much more exciting?
  • by ziggy_zero ( 462010 ) on Wednesday October 29, 2003 @05:16PM (#7341912)
    That there can't be an accurate data representation of the data in the Library of Congress because THEY don't know how much stuff they have. My cousin worked there this past summer, and he said they still have a large portion of the basement filled up with (unorganized, mind you) stacks of CD's that they haven't even put into their database yet. Same goes for books. It'll be awhile until anybody knows how much data the LoC has.
  • by danny ( 2658 ) on Wednesday October 29, 2003 @10:55PM (#7344312) Homepage
    I used to think in 7-bit ascii, but the digital camera changed all that... In the last year I've taken over 5000 photos - 5gig of data - as well as writing my usual couple of megabytes.

    But only a fraction of that will make it onto my web site - I have maybe 60 megabytes of photos (cut-down to around 100k each) online and 10 megabytes of text on my web sites, and would be adding less than 40 megabytes a year to that.

    Maybe I'll get a video camera, though, or put up some MP3s of my gamelan group...

    Though much is taken, much abides; and though We are not now that strength which in old days Moved earth and heaven, that which we are, we are.

    Danny.

Time is the most valuable thing a man can spend. -- Theophrastus

Working...