Forgot your password?
typodupeerror
Data Storage Microsoft Supercomputing

Too Much Data? Then 'Good Enough' Is Good Enough 56

Posted by timothy
from the ready-when-it-ships dept.
ChelleChelle writes "While classic systems could offer crisp answers due to the relatively small amount of data they contained, today's systems hold humongous amounts of data content — thus, the data quality and meaning is often fuzzy. In this article, Microsoft's Pat Helland examines the ways in which today's answers differ from what we used to expect, before moving on to state the criteria for a new theory and taxonomy of data."
This discussion has been archived. No new comments can be posted.

Too Much Data? Then 'Good Enough' Is Good Enough

Comments Filter:
  • by Daetrin (576516) on Thursday June 02, 2011 @07:26PM (#36326392)
    The data quality and meaning of this summary is rather fuzzy. I have no clue what exactly they're talking about. No, i haven't RTFA yet, but the summary isn't making it very clear if TFA is something i'd be interested in or not.
  • Re:GOATSE ALERT (Score:2, Insightful)

    by drb226 (1938360) on Thursday June 02, 2011 @07:30PM (#36326416)
    a tinyurl almost always means goatse. Honestly, trolls, you can do better. (pick a more obscure, or even homemade, url shortener)
  • Re:Obligatory (Score:4, Insightful)

    by Fluffeh (1273756) on Thursday June 02, 2011 @07:36PM (#36326466)

    It's not that there is too much data. That's not a problem at all. From my own experience (I work as a senior analyst for a multinational retailer employing around 200,000 people) it is rather that there isn't a single plan to utilize all the data we have available. Every time we introduce a new system or change the way we do something, the project inevitably drops a new table into our data warehouse. Now, this may seem like an acceptable way to do things, but after this has happened twenty times, it is nigh impossible to run a query that will return data from all these tables in any sort of reasonable time.

    Would it cost more time, effort and money to properly introduce the new data to proper fact tables each time? Of course. However, the benefits would be that we could stop pretending that "we have too much data these days..." - because we don't. We just have too much mess with our data and it becomes unusable.

    In the example above (different descriptions for green) the base system may need these particular terms, but if the data needs to be aggregated or used in another system, then the jobs that pass this to your data repository need to make those changes to adapt the data to work with the rest of your data warehouse. Having said that, if the new system is being developed inhouse, then during development the question should be asked "Can we store the color information in RGB right off the bat and adapt our own system to mask these values behind pretty descriptions?" rather than having to later do it via an ETL.

  • Re:Obligatory (Score:4, Insightful)

    by icebike (68054) on Thursday June 02, 2011 @08:07PM (#36326698)

    It's not that there is too much data. That's not a problem at all.

    Often, (more often then not, I contend), there is indeed just too much data.

    Because we have all these marvelous computerized data capture system doesn't mean the data is necessary, useful, or worth keeping. However, someone always comes along in the project design stage and insists the millisecond by millisecond weight of a bag of popcorn weighed in real time as it is being filled is going to provide a wealth of data for the design of future bagging systems and materials handling in general.

    The scale was only there to assure that 10 pounds were in the sack and to shut the hopper. Then some fool found out it measured ever few milliseconds and recorded the data.

    So the project manager gets brow beaten into recording this trash which invariably never gets used for anyone for any purpose at any time, as those who lobbied for it wander off to sabotage other projects and never revisit the cesspool they created.

    This happens way way more than you might imagine in the real world these days.

    It used to be projects had to fight for every byte of data collected, there were useful sinks identified for every field. But with falling storage costs the tendency is to simply keep shoveling it in because its easier than dealing with the demands by those "researchers" looking for another horse to ride.

  • Re:Obligatory (Score:5, Insightful)

    by StuartHankins (1020819) on Thursday June 02, 2011 @08:30PM (#36326850)
    +1 Insightful. I would argue that -- just like you have a lifecycle for software development -- you have a lifecycle for nontrivial amounts of data. Some data is useful in detail for a short term, but wherever possible it should be more coarsely aggregated as time progresses, and you should get sign-in from executives that it can be dumped after a period of time.

    Where I work, I estimated the cost to upgrade our SAN to continue to store a set of large tables which helped everyone understand the cost in real terms. People tend to think once the data is imported or created that it's a small incremental cost to house it from that point forward, but backup times and storage along with execution plan costs increase with size. There is a performance benefit to this trimming; partitioning and check constraints will only get you so far.

    What is difficult to gauge in advance sometimes is how the data will be used -- some things are obvious in the short-term, but as the company looks to different metrics or to shine some light on an aberration, you really need to be able to determine how quickly you can dump the detail. Get signoff then add some padding so you are conservative when you destroy. Make a backup "just in case" and delete it after a few months. The good news in my work is that changing your mind later to adapt to the new requirements means expectations are already set to change the way it works "from this point forward". There are many fields of work that do not have that luxury, because of the time or cost to gather detail again.
  • Nothing new (Score:4, Insightful)

    by Whuffo (1043790) on Friday June 03, 2011 @02:34AM (#36328706) Homepage Journal

    If the people that write these stories would familiarize themselves with Information Theory (Claude Shannon, in the 1940's) then they'd understand that you still can't make silk purses from sow's ears.

    Yes, it's a lot of records. Yes, the data entry people made mistakes. All this really means is that there's more noise in the data. As the signal to noise ratio declines, the value of the results also declines. Making decisions based on noisy data isn't science, it's only guesswork. That's fine for weather forecasting (a similar problem) but expecting the results from the described data to be more accurate than weather forecasts is foolish. Remember: garbage in, garbage out.

I judge a religion as being good or bad based on whether its adherents become better people as a result of practicing it. - Joe Mullally, computer salesman

Working...