

27 Billion Gigabytes to be Archived by 2010 178
Lucas123 writes "According to a Computerworld survey of IT managers, data storage projects are the No. 2 project priority for corporations in 2008, up from No. 4 in 2007. IT teams are looking into clustered architectures and centralized storage-area networks as one way to control capacity growth, shifting away from big-iron storage and custom applications. The reason for the data avalanche? Archive data. In the private sector alone electronic archives will take up 27,000 petabytes (27 billion gigabytes) by 2010. E-mail growth accounts for much of that figure."
We have the prefixes, why not use them? (Score:5, Informative)
Note to science and tech journalists: please stop stringing together "millions" and "billions" in an attempt to make the numbers seem large, impressive, and incomprehensible. Scientific notation and SI exist for a reason.
Re:We have the prefixes, why not use them? (Score:5, Funny)
Re:We have the prefixes, why not use them? (Score:5, Funny)
Re: (Score:2)
Re: (Score:2)
Re:We have the prefixes, why not use them? (Score:5, Insightful)
Re: (Score:3, Insightful)
No, standard != wrong.
In this case, there's precisely the same thing wrong that is with all of journalism: use specific language constructs to push certain emotional messages along with information. AKA manipulation.
Re: (Score:2)
!standard = wrong
or maybe even
~standard = wrong
but
(standard != wrong) == wrong
Re: (Score:2)
a helpful reference page for large numbers (Score:5, Interesting)
Cow stacking is where you select cow as the animal and from earth to moon as the place and you'll see a graphic of cows being stacked to the moon and the number of cows which would be required to complete that stack.
Hamster Canyon will be where you select a hamster and the Grand Canyon and you'll see a picture of the Grand Canyon filled with hamsters and a number that indicates the total number of hamsters required to fill the canyon.
Re: (Score:3, Funny)
Re:We have the prefixes, why not use them? (Score:5, Insightful)
Joe Sixpacks digest technobabble at a rate that is relevant to them. While few would know what an Exabyte is, most would know what a Gigabyte is since they deal with numbers that size in relation to their own computing systems. I think it's less writing for sensationalism than it is writing in a language your audience will understand.
Re: (Score:2, Informative)
Re: I need a SB of storage (Score:2, Funny)
Re:We have the prefixes, why not use them? (Score:5, Insightful)
We gotta start using the prefixes before they start to become common. I'd rather see "27 Exabytes" followed by a parenthetical comment saying (27 Billion GigaBytes)
Re:We have the prefixes, why not use them? (Score:5, Insightful)
People didn't become familiar with Gigabyte because of Back to the Future anyway, they are familiar with it because that's what they now buy hard drives and ipods in. When they are sold in Exabytes, you'll see the term used in journalism too.
Re: (Score:2)
Jiggawhats is scientific-sounding? Are you sure?
Re: (Score:2, Informative)
That's just a different way of pronouncing Gigawatts
will someone think of the kids! (Score:5, Funny)
I agree. However, I would go even further and instead of using geekish bytes and bits we should use something like 400 billions of mp3s. You know, so that myspace user out there can understand TFA. They clearly have interest in this sort of news.
Re: (Score:2)
And by golly that's few words it's almost some words!
Re: (Score:2)
Scientific notation makes that goal extremely simple to obtain. Or at least, it would, if journalists could trust that their audiences had the basic high-school level understanding that they ought to have.
Concepts like "million" and "billion" are hard to visualize and even harder to distinguish, and that's without the regionalization issue over whether
Re: (Score:2, Funny)
So, in other words... (Score:5, Interesting)
"E-mail growth accounts for much of that figure."
We're archiving spam?
Re:So, in other words... (Score:5, Insightful)
We're archiving spam?
Which raises a question I find interesting, do we check for redundancy when archiving mails, in a way so that we can save a hell of a lot of space on spam (and other legitimate automated messages), since spam is by definition essentially the same message sent to a number of persons. Also, couldn't correlating stored mails for redundancy allow for better spam identification (although it would be no silver bullet since legitimate automated messages are often redundant).
Re: (Score:2, Insightful)
~They use pictures of text, instead of text, so it takes more effort to filter based on content.
~They use random text at the bottom of their message to give the filter something to read.
~They generate random noise to superimpose over the picture. Every batch has a different noise layer.
I'm sure they do more [IANASB - spam bot - so I wouldn't know the details] but the slight differences between what WE would perceive as the same message foil
Re: (Score:2)
OK so basically you're dismissing my entire idea (which was part a question, I mean why wouldn't it be done to a certain extent already?) just because some an unknown (by you and me) ratio of the spam data isn't redundant.
That would be kind of like saying "Why bother with implementing compressed file systems! Most people fill their disks with file that can't be significantly compressed anyways!". Sure, but you've still got millions of copies of the exact same Nigerian scams out there which are stored withou
Re: (Score:2, Insightful)
Re: (Score:2)
I see, but my idea is more focused on solving the storage problem, and to get around the "95% redundancy" problem my idea was based on cutting messages into blocks depending on whether they're redundant or unique, as described here [slashdot.org].
Re:So, in other words... (Score:5, Interesting)
Actually, I have a partial answer to this question. As a sysadmin for a Novell GroupWise email system, I can tell you that the actually message data for duplicate incoming messages (such as spam that is sent to many people at the same time) are only stored on disk once. Some sort of "pointer" is used to reference the messages to the individual users mailboxe's. Check out the docs [novell.com] if you are interested.
That said with about 1400 users (spread across multiple postoffices), we have probably about 400gb of email data. We are able to keep it low, by having a 120 day retention policy. After that point, email can be archived locally, otherwise its deleted. Independant of that, and to comply with regulations and disaster recovery scenarios, email data is backed up and replicated offsite using disk-to-disk backup (eVault [evault.com] in case anyone is interested).
This gives us the ability to archive email for up to 27 years or something like that (with relatively low storage costs because the disk-to-disk is incremental, storing changes at the per-block level).As for Microsoft Exchange, I have not the slightest clue how data is stored.
Re: (Score:2)
Either that, or when the sending system sends the same message in multiple transactions (i.e. poor mailer, or a mailer interrupted by a 452 response code) and the messages have the same Message ID header.0
That said, the original pos
Re: (Score:2)
That said, the original poster makes an assumption that identical-looking messages are likely to be indistinguishable
No, I make the assumption that identical-looking messages have most of their data in common, and that this common data, even if only a chunk of the message starting and stopping at an arbitrary point, could be stroed efficiently.
That means cutting messages into blocks, if it is found that some part has something in common with another one, to store common blocks of data all in one place. Th
Re: (Score:2)
Substitute "words" for "blocks" and you will find you have invented a dictonary.
Re: (Score:2)
Substitute "words" for "blocks" and you will find you have invented a dictonary.
Duh, of course by blocks I mean blocks of a significant threshold size. You're just nitpicking ;-)
Re: (Score:3, Informative)
Ummm, no. I have CS degree and 20yrs experience. What you are talking about is the attacking the problem of redundant information [wikipedia.org] by comparing blocks, this has already been 'solved'.
Re: (Score:2)
Ummm, no. I have CS degree and 20yrs experience.
And? You were nitpicking anyways... Yay, a Wikipedia link that's barely even relevant! Anyways, maybe that's already been 'solved', but the question is not whether this has ever been solved but if it's ever been implemented as such for e-mail storage. But maybe you can tell me what's flawed with my idea of (large) block redundancy detection for e-mail storage to begin with instead of rubbing your credibility in my face.
Re: (Score:2)
we check for redundancy when archiving mails, in a way so that we can save a hell of a lot of space on spam
I could see that helping if the same spam is sent to the clients on your network, but it doesn't account for all the subsequent iterations of the spam.
YMMV, but I see a lot of spam carrying highly varied introductory garbage (to attempt to fool spam filtering software, of course). Some of my email accounts easily receive 10x as much spam as legitimate email, which would make a redundancy check difficult to apply.
But if it works for you, then more power to you.
Re: (Score:2)
you have to first hash every email on your server, then submit that hash to every other email server in the whole world
I didn't talk about hashing entire e-mails but parts of e-mails (which makes the problem more complicated) and then, who talked about other e-mail servers in the rest of the world? Why would you wanna do that?
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Interesting)
"E-mail growth accounts for much of that figure."
We're archiving spam?
Re: (Score:2)
Archiving is the best way to deal with any unnecessary and unneeded information, spam included. So many times I archived my workfiles with the thought that if I don't open that archive in 12 months, it is all junk and I can just toss it away. I believe my brain is working the same way only faster. What are we talking about again?
Email Squared (Score:2)
Ignoring even the spam issue, there's also the issue that Outlook encourages people to include the previous message in its entirety, causing an O(n^2) effect for legitimate message chains; that is, every message in a conversation tends to include all previous messages. This not only increases archival size, but it also causes mailboxes to approach their seemingly arbitrary upper bound on mailbox size much more rapi
Re: (Score:2)
The biggest problem I found with Outlook is that its performance is O(n^2) based on the number of messages
E-mail growth... (Score:5, Funny)
Distributed Storage (Score:3, Informative)
For example the Folding@home is implementing a distributed storage mechanism for their data and we'll likely have a new @home project soon - Storage@home.
http://en.wikipedia.org/wiki/Storage@home [wikipedia.org]
http://www.stanford.edu/~beberg/Storage@home2007.pdf [stanford.edu]
http://folding.stanford.edu/English/Papers#ntoc7 [stanford.edu]
How Much do We Need to Store? (Score:5, Insightful)
Re: (Score:2, Insightful)
Re: (Score:2)
Wrong. (Score:2)
Re: (Score:2)
Re: (Score:2)
On top of that, the sheer
Re: (Score:2)
So, to answer your question:
What if we just stored less of it?
You might get fined or jailed.
Re: (Score:2)
duh...users store their files in their email! (Score:5, Informative)
Users in a lot of places use their email as a document management system. This is somewhat effective on an individual basis, but in large organizations shared documents get duplicated dozens or even hundreds of times as each user has their own copy. In the next few years products like Sharepoint will alleviate some of that, though storage is cheap enough that it may not be worth the cost to both reeducate users and build the infrastructure for it. A SAN can hold real a lot of word documents and PDFs after all...
Practical Internet Groupware (Score:2)
That's exactly the message of this book [oreilly.com]. Email, although widely used, is neither practical nor effective as a means of divulging information in a company. And duplication of information is the lesser problem.
For instance, suppose someone leaves the company, either permanently or in a vacation, a
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
For enterprise storage, hard drives are not cheap. Yes, you can buy domestic IDE drives for cheap, but check the prices on SAS or "enterprise grade" storage. A large company will have potentially petabytes of data - ba
Re: (Score:2)
Re: (Score:2)
Re:duh...users store their files in their email! (Score:5, Insightful)
Storage vendors want to sell expensive solutions to gullible execs, pay analysts to produce credible-sounding FUD scenarios.
"monthly e-mail traffic at more than 30 million messages, vs. 17 million just one year ago."
Like, wow. In the meantime 500GB disks cost the same or less than 250GB disks did a year ago.
"The university settled on an IBM storage infrastructure that will afford the institution 350TB of capacity"
350TB? 350 disks? Half that in a year and a quarter in 2? That's not really a huge amount of storage. Anymore. It's an amount of storage I could go order from my friendly online computer store and get delivered tomorrow.
The fact is, corporate storage isnt driving the market anymore, the consumer market is. Most people I know have more storage in their home PC than the average server requires. Companies want to save video? Consumers want their PVR's to save the cable-tv stream.
Re:duh...users store their files in their email! (Score:4, Insightful)
Re: (Score:2)
Re: (Score:2)
In only a year the size and value of hard disk drives has increased monumentally, and tomorrow at work i see no sign
2010 (Score:5, Funny)
Re: (Score:2, Funny)
Re: (Score:2)
There, I fixed that typo for you.
Use standard units people understand. (Score:4, Funny)
Surprising . . . (Score:4, Insightful)
But it is mostly email they're talking about here, and I bet a HUGE part of this archiving is:
Yep! Solve problems 1-3, and you'd vastly decrease the amount of email that you have to archive! I won't complain about #4, since I actually value my job, but it would be nice if more PHBs knew more about tech,...
Re: (Score:3, Insightful)
Re: (Score:2)
Because if not, that might be an (admittedly crummy) attempt at a backup system.
Re: (Score:2)
For Fucks sake (Score:3, Insightful)
30 million emails? (Score:2)
I suppose if I was crazy enough, I'd post my address here on slashdot to see if we can slashdot Pitt's email servers,... maybe we can turn 30 million messages into 60 million messages. On second thought, I don't want 30 million messages,... ;-)
Re: (Score:2)
how much is surveillance data? (Score:3, Interesting)
And a great deal of video archive from CCTV as well I expect.
The question that arises is how would you index all this?
Re: (Score:2)
The question that arises is how would you index all this?
By time. And then you can go by difference and then by motion.
You could even have a second pass running that picks out faces and objects. These can then be compared to another database of similar faces and objects. All of these would then also be stored with references back to the original video.
It can be as simple or as complicated as you want. The technology exists today (and I'm sure is b
The solution was available a decade ago (Score:2)
They called it "Napster."
Moving away from Big Iron? (Score:3, Funny)
NetApp is a great company and makes a great product aimed for a specific market segment: Fileservices (NFS/CIFS). I don't see many customers tossing out the EMC DMX, HDS Tagmastore or IBM Shark for a FC enabled netapp array. I also don't see a lot of FICON shops asking netapp to support FICON.
Now the phase storage mgmt is entering is the 'good enough' phase. Does my organization need the current generation of "high end" arrays? Maybe not. The current generation of midrange with its better or cheaper $/GB and increasingly parallel featureset to the highend arrays, is starting to looking more attractive to many customers.
Re: (Score:3, Insightful)
Re:Moving away from Big Iron? (Score:4, Funny)
and yes I do.
Re: (Score:2)
Re:Moving away from Big Iron? (Score:4, Funny)
Hooking up a pair of EMC DMX's (or IBM ESSes, or HDS USPs) over a pair of OC48s for SRDF/PPRC/USR unless you are a zOS shop, then you could run XRC. Since this is a BC/DR plan, we'll run it over FCIP protected by IPSec over a DWDM leased line, which must be protected by a UPSR/BLSR, otherwise in the event of a link failure, the R1s will split from the R2s.
Then you're SOL.
Re: (Score:2)
Redundant Data (Score:2, Interesting)
So that's about 20 billion gigabytes of data... (Score:2, Insightful)
The solution is data compression (Score:2, Funny)
p0rn 365/24 for everyone on the planet (Score:2)
Even Worse ... (Score:3, Funny)
genome... (Score:2)
This is starting to be Manditory (Score:3, Funny)
My pr0n collection takes at least 3 Internets* to store, archived.
*(sorry, forgot the conversion rate for Libraries of Congress)
Re: (Score:2)
Re:Wow, welfare for programmers... (Score:4, Interesting)
And what do data-archiving rules have to do with welfare for programmers? Maybe for disk manufacturing firms or data admins, but programmers?
Re: (Score:2)
We are not talking about Storage needs here. We are talking about stuff you have to keep around because of regulations on big business but which have no value, add no value to the company, and most of all add no value to the product or service the company sells. The point is that all of this takes away from profit margins and so it is a "Drain on the Economy".
Re: (Score:2)
We don't have a complete enough picture of the effects of data storage requirements. First, they may have some economic benefits. Second, it seems unlikely that the costs are so massive that they have any serious impact on bottom-line product development. Third, welfare would imply that there was no productive benefit caused by these "computer people", which we know is untrue.
Re: (Score:2)
Re: (Score:2)
I'm in the process of putting together backup functions for my home network right now. Almost all of my backup costs are incurred getting networking and servers (And powe
The 21st Century Dark Ages (Score:2)
I suspect he's onto something!
You probably already can. (Score:2)
Even if you start to scan every document 500 gigabytes is going to be a lot of documents.
Most servers I bet are pretty small compared to what people are using at home. You just don't need to store video or even a lot of audio in most businesses.
Of course this doesn't apply to video production houses, print shops, or any places that actually deals with a lot of media data.
I know that my companies customer database is under one gig