Slashdot Log In
27 Billion Gigabytes to be Archived by 2010
Posted by
timothy
on Tue Jan 01, 2008 05:02 PM
from the if-not-sooner dept.
from the if-not-sooner dept.
Lucas123 writes "According to a Computerworld survey of IT managers, data storage projects are the No. 2 project priority for corporations in 2008, up from No. 4 in 2007. IT teams are looking into clustered architectures and centralized storage-area networks as one way to control capacity growth, shifting away from big-iron storage and custom applications. The reason for the data avalanche? Archive data. In the private sector alone electronic archives will take up 27,000 petabytes (27 billion gigabytes) by 2010. E-mail growth accounts for much of that figure."
Related Stories
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
We have the prefixes, why not use them? (Score:5, Informative)
Note to science and tech journalists: please stop stringing together "millions" and "billions" in an attempt to make the numbers seem large, impressive, and incomprehensible. Scientific notation and SI exist for a reason.
Re:We have the prefixes, why not use them? (Score:5, Funny)
Parent
Re:We have the prefixes, why not use them? (Score:5, Funny)
Parent
Re:We have the prefixes, why not use them? (Score:5, Insightful)
Parent
Re: (Score:3, Insightful)
No, standard != wrong.
In this case, there's precisely the same thing wrong that is with all of journalism: use specific language constructs to push certain emotional messages along with information. AKA manipulation.
a helpful reference page for large numbers (Score:5, Interesting)
Cow stacking is where you select cow as the animal and from earth to moon as the place and you'll see a graphic of cows being stacked to the moon and the number of cows which would be required to complete that stack.
Hamster Canyon will be where you select a hamster and the Grand Canyon and you'll see a picture of the Grand Canyon filled with hamsters and a number that indicates the total number of hamsters required to fill the canyon.
Parent
Re: (Score:3, Funny)
Re:We have the prefixes, why not use them? (Score:5, Insightful)
Joe Sixpacks digest technobabble at a rate that is relevant to them. While few would know what an Exabyte is, most would know what a Gigabyte is since they deal with numbers that size in relation to their own computing systems. I think it's less writing for sensationalism than it is writing in a language your audience will understand.
Parent
Re:We have the prefixes, why not use them? (Score:5, Insightful)
We gotta start using the prefixes before they start to become common. I'd rather see "27 Exabytes" followed by a parenthetical comment saying (27 Billion GigaBytes)
Parent
Re:We have the prefixes, why not use them? (Score:5, Insightful)
People didn't become familiar with Gigabyte because of Back to the Future anyway, they are familiar with it because that's what they now buy hard drives and ipods in. When they are sold in Exabytes, you'll see the term used in journalism too.
Parent
will someone think of the kids! (Score:5, Funny)
I agree. However, I would go even further and instead of using geekish bytes and bits we should use something like 400 billions of mp3s. You know, so that myspace user out there can understand TFA. They clearly have interest in this sort of news.
Parent
So, in other words... (Score:5, Interesting)
"E-mail growth accounts for much of that figure."
We're archiving spam?
Re:So, in other words... (Score:5, Insightful)
We're archiving spam?
Which raises a question I find interesting, do we check for redundancy when archiving mails, in a way so that we can save a hell of a lot of space on spam (and other legitimate automated messages), since spam is by definition essentially the same message sent to a number of persons. Also, couldn't correlating stored mails for redundancy allow for better spam identification (although it would be no silver bullet since legitimate automated messages are often redundant).
Parent
Re: (Score:2, Insightful)
~They use pictures of text, instead of text, so it takes more effort to filter based on content.
~They use random text at the bottom of their message to give the filter something to read.
~They generate random noise to superimpose over the picture. Every batch has a different noise layer.
I'm sure they do more [IANASB - spam bot - so I wouldn't know the details] but the slight differences between what WE would perceive as the same message foil
Re: (Score:2)
OK so basically you're dismissing my entire idea (which was part a question, I mean why wouldn't it be done to a certain extent already?) just because some an unknown (by you and me) ratio of the spam data isn't redundant.
That would be kind of like saying "Why bother with implementing compressed file systems! Most people fill their disks with file that can't be significantly compressed anyways!". Sure, but you've still got millions of copies of the exact same Nigerian scams out there which are stored withou
Re: (Score:2, Insightful)
Re: (Score:2)
I see, but my idea is more focused on solving the storage problem, and to get around the "95% redundancy" problem my idea was based on cutting messages into blocks depending on whether they're redundant or unique, as described here [slashdot.org].
Re:So, in other words... (Score:5, Interesting)
Actually, I have a partial answer to this question. As a sysadmin for a Novell GroupWise email system, I can tell you that the actually message data for duplicate incoming messages (such as spam that is sent to many people at the same time) are only stored on disk once. Some sort of "pointer" is used to reference the messages to the individual users mailboxe's. Check out the docs [novell.com] if you are interested.
That said with about 1400 users (spread across multiple postoffices), we have probably about 400gb of email data. We are able to keep it low, by having a 120 day retention policy. After that point, email can be archived locally, otherwise its deleted. Independant of that, and to comply with regulations and disaster recovery scenarios, email data is backed up and replicated offsite using disk-to-disk backup (eVault [evault.com] in case anyone is interested).
This gives us the ability to archive email for up to 27 years or something like that (with relatively low storage costs because the disk-to-disk is incremental, storing changes at the per-block level).As for Microsoft Exchange, I have not the slightest clue how data is stored.
Parent
Re: (Score:2)
Either that, or when the sending system sends the same message in multiple transactions (i.e. poor mailer, or a mailer interrupted by a 452 response code) and the messages have the same Message ID header.0
That said, the original pos
Re: (Score:2)
That said, the original poster makes an assumption that identical-looking messages are likely to be indistinguishable
No, I make the assumption that identical-looking messages have most of their data in common, and that this common data, even if only a chunk of the message starting and stopping at an arbitrary point, could be stroed efficiently.
That means cutting messages into blocks, if it is found that some part has something in common with another one, to store common blocks of data all in one place. Th
Re: (Score:3, Informative)
Ummm, no. I have CS degree and 20yrs experience. What you are talking about is the attacking the problem of redundant information [wikipedia.org] by comparing blocks, this has already been 'solved'.
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Interesting)
"E-mail growth accounts for much of that figure."
We're archiving spam?
E-mail growth... (Score:5, Funny)
Distributed Storage (Score:3, Informative)
For example the Folding@home is implementing a distributed storage mechanism for their data and we'll likely have a new @home project soon - Storage@home.
http://en.wikipedia.org/wiki/Storage@home [wikipedia.org]
http://www.stanford.edu/~beberg/Storage@home2007.pdf [stanford.edu]
http://folding.stanford.edu/English/Papers#ntoc7 [stanford.edu]
How Much do We Need to Store? (Score:5, Insightful)
Re: (Score:2, Insightful)
Re: (Score:2)
duh...users store their files in their email! (Score:5, Informative)
Users in a lot of places use their email as a document management system. This is somewhat effective on an individual basis, but in large organizations shared documents get duplicated dozens or even hundreds of times as each user has their own copy. In the next few years products like Sharepoint will alleviate some of that, though storage is cheap enough that it may not be worth the cost to both reeducate users and build the infrastructure for it. A SAN can hold real a lot of word documents and PDFs after all...
Practical Internet Groupware (Score:2)
That's exactly the message of this book [oreilly.com]. Email, although widely used, is neither practical nor effective as a means of divulging information in a company. And duplication of information is the lesser problem.
For instance, suppose someone leaves the company, either permanently or in a vacation, a
Re: (Score:2)
Re: (Score:2)
For enterprise storage, hard drives are not cheap. Yes, you can buy domestic IDE drives for cheap, but check the prices on SAS or "enterprise grade" storage. A large company will have potentially petabytes of data - ba
Re:duh...users store their files in their email! (Score:5, Insightful)
Storage vendors want to sell expensive solutions to gullible execs, pay analysts to produce credible-sounding FUD scenarios.
"monthly e-mail traffic at more than 30 million messages, vs. 17 million just one year ago."
Like, wow. In the meantime 500GB disks cost the same or less than 250GB disks did a year ago.
"The university settled on an IBM storage infrastructure that will afford the institution 350TB of capacity"
350TB? 350 disks? Half that in a year and a quarter in 2? That's not really a huge amount of storage. Anymore. It's an amount of storage I could go order from my friendly online computer store and get delivered tomorrow.
The fact is, corporate storage isnt driving the market anymore, the consumer market is. Most people I know have more storage in their home PC than the average server requires. Companies want to save video? Consumers want their PVR's to save the cable-tv stream.
Parent
Re:duh...users store their files in their email! (Score:4, Insightful)
Parent
2010 (Score:5, Funny)
Use standard units people understand. (Score:4, Funny)
Surprising . . . (Score:4, Insightful)
But it is mostly email they're talking about here, and I bet a HUGE part of this archiving is:
Yep! Solve problems 1-3, and you'd vastly decrease the amount of email that you have to archive! I won't complain about #4, since I actually value my job, but it would be nice if more PHBs knew more about tech,...
Re: (Score:3, Insightful)
I make several excel files every week for reporting. They are located on a shared drive. Only extra data is added every monday, yet instead of puting a link to the files, or the directory, management wants me to send them by email every week to several people.
Utterly stupid, if you ask me.
For Fucks sake (Score:3, Insightful)
30 million emails? (Score:2)
I suppose if I was crazy enough, I'd post my address here on slashdot to see if we can slashdot Pitt's email servers,... maybe we can turn 30 million messages into 60 million messages. On second thought, I don't want 30 million messages,... ;-)
how much is surveillance data? (Score:3, Interesting)
And a great deal of video archive from CCTV as well I expect.
The question that arises is how would you index all this?
Moving away from Big Iron? (Score:3, Funny)
NetApp is a great company and makes a great product aimed for a specific market segment: Fileservices (NFS/CIFS). I don't see many customers tossing out the EMC DMX, HDS Tagmastore or IBM Shark for a FC enabled netapp array. I also don't see a lot of FICON shops asking netapp to support FICON.
Now the phase storage mgmt is entering is the 'good enough' phase. Does my organization need the current generation of "high end" arrays? Maybe not. The current generation of midrange with its better or cheaper $/GB and increasingly parallel featureset to the highend arrays, is starting to looking more attractive to many customers.
Re: (Score:3, Insightful)
Re:Moving away from Big Iron? (Score:4, Funny)
and yes I do.
Parent
Re:Moving away from Big Iron? (Score:4, Funny)
Hooking up a pair of EMC DMX's (or IBM ESSes, or HDS USPs) over a pair of OC48s for SRDF/PPRC/USR unless you are a zOS shop, then you could run XRC. Since this is a BC/DR plan, we'll run it over FCIP protected by IPSec over a DWDM leased line, which must be protected by a UPSR/BLSR, otherwise in the event of a link failure, the R1s will split from the R2s.
Then you're SOL.
Parent
Even Worse ... (Score:3, Funny)
This is starting to be Manditory (Score:3, Funny)
My pr0n collection takes at least 3 Internets* to store, archived.
*(sorry, forgot the conversion rate for Libraries of Congress)
Re:Wow, welfare for programmers... (Score:4, Interesting)
And what do data-archiving rules have to do with welfare for programmers? Maybe for disk manufacturing firms or data admins, but programmers?
Parent
Re: (Score:2)
We don't have a complete enough picture of the effects of data storage requirements. First, they may have some economic benefits. Second, it seems unlikely that the costs are so massive that they have any serious impact on bottom-line product development. Third, welfare would imply that there was no productive benefit caused by these "computer people", which we know is untrue.