Forgot your password?
typodupeerror
Data Storage Privacy Your Rights Online

Why Anonymized Data Isn't 280

Posted by kdawson
from the can't-keep-good-PII-down dept.
Ars has a review of recent research, and a summary of the history, in the field of reidentification — identifying people from anonymized data. Paul Ohm's recent paper is an elaboration of what Ohm terms a central reality of data collection: "Data can either be useful or perfectly anonymous but never both." "...in 2000, [researcher Latanya Sweeney] showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex. ... For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm. ... Reidentification science disrupts the privacy policy landscape by undermining the faith that we have placed in anonymization."
This discussion has been archived. No new comments can be posted.

Why Anonymized Data Isn't

Comments Filter:
  • by Ethanol-fueled (1125189) * on Tuesday September 08, 2009 @03:03PM (#29355107) Homepage Journal

    For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm.

    ...And this is the first thing that the author(s) though of regarding data-mining? Okay, but how would this happen? Why go through all the trouble to gather all that data when you could just hire a P.I. or know (or bribe) a law-enforcement official or an ISP employee? It Reminds me of a conversation I had with a guy who bragged that he could get anybody's info because a very good friend of his worked at the DMV. There were a couple semi-profile firings at the State Department because some employees snooped through celebrities' records for no reason other than voyeurism..er..curiosity.

    Those types, the ones with the direct access to the info, are the weakest link. They're only human. "Hey, Bob, there's this guy I really hate. Look up his IP logs and tell me what you see!"

    It all boils down to voyeurism. People would rather bring others down before bring their own lives up. It's the nature of the beast! Pathetic.

    • Re: (Score:3, Insightful)

      by mea37 (1201159)

      Do you mean, you think you could've gotten an individual's medical records in MA for less than $20? Or maybe you can't see why someone would dig up an individual's medical records? (I can think of many... but then my employer was extorted by someone who'd stolen a bunch of medical-related data from them not that long ago.)

      I think I hear a bit of "nobody would go to all that trouble" in your message. If in the early days of WiFi networks I described to you in tedius yet vague terms how to compromise WEP e

      • by causality (777677) on Tuesday September 08, 2009 @04:12PM (#29356179)

        Do you mean, you think you could've gotten an individual's medical records in MA for less than $20? Or maybe you can't see why someone would dig up an individual's medical records? (I can think of many... but then my employer was extorted by someone who'd stolen a bunch of medical-related data from them not that long ago.)

        I think I hear a bit of "nobody would go to all that trouble" in your message. If in the early days of WiFi networks I described to you in tedius yet vague terms how to compromise WEP encryption, you probably would've thought the same thing. Today anyone who cares to can break WEP using readily available tools - it's really no bother at all if you're even slightly inclined to do it.

        I've seen companies with contractual and regulatory obligations to protect data privacy make half-gestures to make it look like they're honoring privacy while still engaging in whatever easy-money scheme or shortcut they want. Shedding light on why those half-gestures don't work is a big deal.

        That's the thing that I also think people don't understand. With good reason, I am not satisfied merely that someone probably wouldn't want to abuse my information. I am satisfied only when I know that they cannot do so.

        I think the solution is to have the concept of "intellectual property" work both ways. Obviously your private information has value, otherwise advertisers and other companies wouldn't go to such great lenghts to obtain and use it. The problem is that they obtain it without your consent and without directly compensating you. For example, if I don't actively block web bugs, cookies, HTTP "ping", analytics tools, and other similar attempts, then that data will be gathered whether or not I like it.

        The reason why I actively go out of my way to prevent companies from gathering data on me is simple. No one asked me if I wanted to be data-mined. I refuse to honor agreements in which I did not participate. Why anyone else would do so is a mystery to me.

        So make each individual's private data their personal property. They can set whatever value they like, and if that value is more than a company thinks it is worth, the company is free to decline the sale. Most importantly, any attempt to just take that data will be theft, and anyone who does this can be prosecuted in a criminal court. I mean, think about it: why is it "marketing" when a company helps itself to my information against my will and "piracy" or "industrial espionage" if I helped myself to THEIR zeroes and ones against their will?

        • Re: (Score:3, Interesting)

          by andy_t_roo (912592)
          i think i found a new sig (a bit too long for /. unfortunately):
            "why is it "marketing" when a company helps itself to my information against my will and "piracy" or "industrial espionage" if I helped myself to THEIR zeroes and ones against their will?"
  • Paul Ohm? (Score:5, Funny)

    by Yvan256 (722131) on Tuesday September 08, 2009 @03:07PM (#29355161) Homepage Journal

    Paul Ohm's recent paper is an elaboration of what Ohm terms a central reality of data collection: "Data can either be useful or perfectly anonymous but never both."

    Great, another Ohm's law [wikipedia.org] to learn.

    • Re:Paul Ohm? (Score:5, Informative)

      by natehoy (1608657) on Tuesday September 08, 2009 @03:12PM (#29355231) Journal

      Nonsense, it could be a extension of the current Law:

      "In electrical circuits, Ohms' law states that the current through a conductor between two points is directly proportional to the potential difference or voltage across the two points, and inversely proportional to the resistance between them. In data anonymity, the law states that the general usefulness of any set of data that originally contained personally-identifiable information is inversely proportional to the degree of anonymity applied to said data."

      See, on simple law to memorize, and now data analysts learn just a teensy bit about electricity and EEs learn just a teensy bit about data anonymization.

      • Re: (Score:3, Funny)

        by 2names (531755)
        Could you put that in the form of a car analogy so us laymen can understand it please? :)
        • Re:Paul Ohm? (Score:5, Informative)

          by Beardo the Bearded (321478) on Tuesday September 08, 2009 @03:54PM (#29355819)

          Okay, let's take a road. The speed at which traffic can travel depends on the quality of the surface, gradient, camber, zoning, etc. Let's call this the "road conditions", with a lower number being better roads.

          The number of cars that want to get through that road is a primary unit, which we can refer to as the "volume of traffic".

          The third major criteria is the speed at which the traffic actually flows. This is the "actual flow" of traffic -- in other words, the "influence of other cars" on the traffic congestion.

          In other words:
          volume = influence of traffic * road conditions

          or:
          V = IR

  • Duh. (Score:4, Informative)

    by SatanicPuppy (611928) * <Satanicpuppy AT gmail DOT com> on Tuesday September 08, 2009 @03:10PM (#29355193) Journal

    Am I the only one who always gives their birthday as 01/01/1970 and their zip code as 20500?

    I mean, seriously. They don't need to know. Why would I give 'em the right numbers? They're lucky I even allow them to have rough demographic data.

    • Re:Duh. (Score:5, Funny)

      by ColdWetDog (752185) on Tuesday September 08, 2009 @03:17PM (#29355289) Homepage
      I just put "No" under sex. I like to tell the truth. Not sure how it helps on the ID end though.
    • Any particular reason you chose District of Columbia?
    • Re:Duh. (Score:4, Insightful)

      by garcia (6573) on Tuesday September 08, 2009 @03:25PM (#29355383) Homepage

      Am I the only one who always gives their birthday as 01/01/1970 and their zip code as 20500?

      I use 1/1/1979 (it's closer to my real age) and 90210 instead. I get a lot of crosseyed looks and many times the cashier (or whatever human I'm dealing with) will end up entering in a local zip code instead but people are no longer arguing w/me about what I choose to provide them when pressured for information (I always politely reply, "no thanks," when asked for that type of information but will give them false shit when they ask again and whine that they'll be fired).

      Why would I give 'em the right numbers? They're lucky I even allow them to have rough demographic data.

      Because the majority of people have absolutely no problems handing over any and all information they're prompted for up to and including their e-mail address, phone number or even SSN! Because most people don't even blink, those of us that don't feel like it should be anyone's business (like the scanning of IDs at liquor stores or bars to check age--there is a birthdate listed on IDs for a fucking reason people--not that they can scan my rare earth magnet swiped ID anyway) are looked at like assholes when we refuse to provide information that no one really needs anyway.

      • Re: (Score:3, Informative)

        by mmkkbb (816035)

        (like the scanning of IDs at liquor stores or bars to check age--there is a birthdate listed on IDs for a fucking reason people--not that they can scan my rare earth magnet swiped ID anyway)

        That's not to check age; that's to check for counterfeits with mismatched mag data, or mismatched 2-D barcode data, or missing UV ink prints, or missing holograms, etc. etc.

        • by steelfood (895457)

          The way to defeat this is to use an out-of-state fake ID. Or to use an ID of somebody who looks like you.

          The whole ID checking process has gotten asanine really...

          • by mmkkbb (816035)

            An out-of-state fake ID will not necessarily work. There are interstate standards for the content of mag stripes and 2-D barcodes, for example.

            • Re: (Score:3, Interesting)

              by Jah-Wren Ryel (80510)

              An out-of-state fake ID will not necessarily work. There are interstate standards for the content of mag stripes and 2-D barcodes, for example.

              But no where near all states follow those standards. All you gotta do is make a fake-id for one of those states. Even if the state does follow those standards, if you pick a state far enough way you can make up pretty much anything, call it an id card (rather than a driver's license) and the person using the machine will have to make the human decision to accept the id anyway or not. As someone who made such a fake-id for a girl who wanted to appear younger than she was (got tired of the bouncers at the

      • Re: (Score:3, Funny)

        by ACMENEWSLLC (940904)

        This makes me think of a probably not unique idea. Most places that ask my my phone number are the same places asking over and over again. Radio Shack, Toys-R-Us, and Sears for example. What would be great is to memorize one of their phone numbers from the phone book and always give them that. Perhaps a number from a different store. Let their telemarketers waste time calling their own stores.

        • Your "account" is indexed under your phone number - they are looking it up to know what offers they should let you in on, check to see if you have a store credit card or should have one and of course to build their profile on you.

          They don't care about your phone number other than that it is a unique identifier.

          • by causality (777677)

            Your "account" is indexed under your phone number - they are looking it up to know what offers they should let you in on, check to see if you have a store credit card or should have one and of course to build their profile on you.

            They don't care about your phone number other than that it is a unique identifier.

            I have the money, they have the goods, we make an exchange. I like it when it remains that simple. Their mistake is assuming that I want to establish an "account" without first asking me. When it comes to my personal information, everyone is on a need-to-know basis. Almost no one needs to know. If they have an entitlement mentality that prevents them from respecting that, then I have no moral qualms whatsoever about giving them false information.

      • Re:Duh. (Score:5, Funny)

        by plague3106 (71849) on Tuesday September 08, 2009 @03:50PM (#29355771)

        I once gave a gamestop employee my zip as 12345. He say "its ok if you don't want to give it." My reply was the no, I am from Schenectady, NY.

        • I use this exact same zip code. On web forms I usually put in 123 Fake Street.

          Oh and bob@hotmail.com? I am really, really sorry about that man.

    • You forgot phone number "867-5309"
    • Re:Duh. (Score:5, Funny)

      by interkin3tic (1469267) on Tuesday September 08, 2009 @03:31PM (#29355473)

      Yes you are. I always put put 90210. Phone number 867-5309. If anyone tries to find me, they're at least going to have that song stuck in their head and recall with disgust the shows they watched in the early 90's. Hopefully that will demoralize them enough to give up.

    • Re:Duh. (Score:4, Funny)

      by compro01 (777531) on Tuesday September 08, 2009 @03:33PM (#29355487)

      I would think 90210 is a more common choice for zip code. It's probably the most densely populated area on the planet according to dataminers.

    • No, I use 90210 because I know that's a valid code.

      I've given out random birthdays so many times that I have to check my DL before I order a cake.

    • Yah, I do that too. I have AARP invitations on my wall because they mined some database that shows I'm in my 70s. I also have lots of high school age directed mail because other databases show me as a teenager. Oh, and Medicaid and insurance scams and political propaganda targeted at seniors -- I get literally dozens of those a week.

  • by A beautiful mind (821714) on Tuesday September 08, 2009 @03:11PM (#29355199)
    See!

    -- Anonymous Coward
  • by Yvan256 (722131) on Tuesday September 08, 2009 @03:11PM (#29355211) Homepage Journal

    [researcher Latanya Sweeney] showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.

    Holy hell forget about that anonymized data crap, I want to learn how she can compress that much data into three bits!

  • Mission Impossible (Score:5, Insightful)

    by im_thatoneguy (819432) on Tuesday September 08, 2009 @03:12PM (#29355219)

    I've pretty much given up any hope of being anonymous. It's just going to get exponentially more difficult as time goes on.

    I had my credit card stolen once. It was stolen from the CC company. How is a business supposed to entrust me with thousands of dollars in credit if they don't know who I am? How is a credit card company supposed to function without a worldwide network which authorizes transactions.

    If someone wants to find me they'll find me.

    If someone wants to use my identity to frame me for a crime then they're just going to encounter a mountain of evidence from numerous sources which contradict their fabrication.

    "My G1 was on a Starbucks Wifi at the time of the crime. I used my CC to purchase the drink. I received a text from a nearby tower. I posted a comment on breaking news story that is written in my style of writing. I was seen on 8 security cameras walking to the starbucks from my car. I used an automatic toll card 5 miles away from the coffee shop...." Good luck coming up with a large mountain of evidence to put me somewhere else.

    • Re: (Score:2, Insightful)

      by riqtare (264681)

      If access to the evidence you just stated was available to the framer it makes it very easy to find a likely fall guy according to their habits. Makes the alibi of overwhelming evidence evaporate into prime suspicion.
      The best lies are those that are mostly truth.

    • by ArsonSmith (13997)

      So you're saying you robbed the coffee shop?

  • [citation needed]
    I can't think of anything I've done online (even my shemale midget fetish on youpron) that could be used to blackmail me, now i get that others are more ashamed about what they do online but "almost everybody"?

    • Re: (Score:3, Insightful)

      by interkin3tic (1469267)

      I did think that was an overstatement that undermined the main point. None of my prescriptions would be embarassing to anyone but a holistic medicine believer, I've told some tasteless jokes online. If someone were to send that information to my family along with what porn I looked at, that would be awkward at most. And that's assuming it's credible, which it wouldn't be.

      How exactly would this blackmail work? Bob, the evil co-worker threatens to tell your wife and boss you have had a sex change, a runni

      • by mbone (558574)

        Or, maybe, Bob, the evil co-worker, threatens to tell my wife and boss that I am a Nigerian prince who has obtained $ 1 billion USD in oil money that needs a US bank account to be successfully deposited...

        (Unfortunately, I know that there is a fair amount of spam sent in my name. I get the backscatter from it.)

    • by 5KVGhost (208137)

      I'm skeptical about that claim, too, but I think the author also intended it to include real-world activities. For example, you've called in sick to work, but records of your activity suggest that you were actually at a job interview / romantic liaison / midget convention over on the other side of town.

    • by mdf356 (774923)

      I can't think of anything I've done online (even my shemale midget fetish on youpron) that could be used to blackmail me

      Same here. However, the next bit of text is more relevant:

      discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm.

      There's almost certainly something that can be used to discriminate against you, harass you, or steal your identity, causing legally cognizable harm. Blackmail is just for the people ashamed of what they do; the rest affects everyone.

  • by jdgeorge (18767) on Tuesday September 08, 2009 @03:16PM (#29355273)

    Forget anonymity. I'm better off living in a glass house, so it's easier for me to know when I need to yell "Get off my lawn!"

  • by Anonymous Coward

    Even if the data is completely and unreversably anonymized, it is still invasive. Look at the story yesterday about the marketers data-mining kids' online private conversations for consumer gadget preferences. Even if there's no way from that data to infer the preferences of any particular kid, they should still be able to talk to each other without having their conversation be part of a marketing survey.

    Think also of a cafe that sells two kinds of food: apple pie (eaten by freedom-loving patriots), and f

    • Re: (Score:3, Interesting)

      by blahplusplus (757119)

      "Private should mean no disclosure, not anonymized disclosure, not aggregate disclosure, just plain no disclosure period."

      The profit motive and privacy are at odds, trying to make the most money and sell the most stuff means you want to know everything about everyone so that you can one up you competitors, it's a race to the bottom. Ideals in the real world always submit to the pragmatic concerns of making money in a capitalist society.

  • by RevWaldo (1186281) * on Tuesday September 08, 2009 @03:23PM (#29355355)
    If you ever wonder why people view the privacy of your records in the hand of third parties is important, and don't just hop on the "privacy is dead" bandwagon, this is the sort of scenario they have in mind.

    http://en.wikipedia.org/wiki/Mother_Earth_(magazine) [wikipedia.org]

    Mother Earth was an anarchist journal that described itself as "A Monthly Magazine Devoted to Social Science and Literature," edited by Emma Goldman. Alexander Berkman, another well-known anarchist, was the magazine's editor from 1907 to 1915. It published longer articles on a variety of anarchist topics including the labor movement, education, literature and the arts, state and government control, and women's emancipation, sexual freedom, and was an early supporter of birth control. Its subscribers and supporters formed a virtual "who's who" of the radical left in America in the years prior to 1920.

    In 1917, Mother Earth began to openly call for opposition to American entry into World War I and specifically to disobey government laws on conscription and registration for the military draft. On June 15, 1917, Congress passed the Espionage Act. The law set punishments for acts of interference in foreign policy and espionage. The Act authorized stiff fines and prison terms of up to 20 years for anyone who obstructed the military draft or encouraged "disloyalty" against the U.S. government. After Emma Goldman and Alexander Berkman continued to advocate against conscription, Goldman's offices at Mother Earth were thoroughly searched, and volumes of files and detailed subscription lists from Mother Earth, along with Berkman's journal The Blast, were seized. As a Justice Department news release reported:

    "A wagon load of anarchist records and propaganda material was seized, and included in the lot is what is believed to be a complete registry of anarchy's friends in the United States. A splendidly kept card index was found, which the Federal agents believe will greatly simplify their task of identifying persons mentioned in the various record books and papers. The subscription lists of Mother Earth and The Blast, which contain 10,000 names, were also seized."

    Mother Earth remained in monthly circulation until August 1917.[1] Berkman and Goldman were found guilty of violating the Espionage Act, (imprisoned for two years) and were later deported.

    • anarchist topics including the labor movement

      Labor organization is not Anarchism.

      Most Anarchists aren't really Anarchists, they just oppose the current form of governance and want to replace it with something else.

  • by Applekid (993327) on Tuesday September 08, 2009 @03:28PM (#29355435)

    So, despite the Birthday Paradox [wikipedia.org], they can still identify 87% of Americans? For some reason I'm under the impression that there are a lot more zip codes with more than 366 people (heck, even 1000 to call upon 3 or 4 duplicates that should cover gender differences) than there are zip codes under that amount.

    • by Daniel_Staal (609844) <DStaal@usa.net> on Tuesday September 08, 2009 @03:38PM (#29355571)

      That Paradox ignores the year. Add that in and it starts to become harder.

    • Your birthdate includes the year. Your birthday does not (at least for this discussion).

      The party trick of finding two people with the same birthday (a good probability in any group of 30 people or more) doesn't require them to have the same year of birth (although in most gatherings there's a good chance of this as well since often it's already somewhat segregated by age).

    • Re: (Score:2, Informative)

      by OrigamiMarie (1501451)
      Perhaps they meant zip + 4. Which gets you down to very few households, but most people can't rattle off their zip + 4, so this information wouldn't actually apply to the questions posed by cashiers. On the other hand, I have heard that data mining on web-surfing habits can usually pick up your zip + 4, so yeah, it would be pretty trivial to put that together with birth date (which is asked for a various places to determine that you're of-age -- though of course you can lie) and sex, which can probably be
    • by ArsonSmith (13997)

      Date of Birth != Annual Birthday

      one being month/day/year the other being just month/day.

    • So, despite the Birthday Paradox, they can still identify 87% of Americans? For some reason I'm under the impression that there are a lot more zip codes with more than 366 people (heck, even 1000 to call upon 3 or 4 duplicates that should cover gender differences) than there are zip codes under that amount.

      Well, as other people have pointed out, adding the year limits the number of collisions. So factor in year and maybe you need 80x the people to get the same obscurity. And you said 366 people. That eno

  • Couple of things.. (Score:5, Insightful)

    by hansraj (458504) on Tuesday September 08, 2009 @03:31PM (#29355469)

    Potential nitpick, but here goes.

    The summary (not surprisingly for a /. summary) omits a couple of details that give the reader a rather partial picture.

    For one, Paul Ohm is an Assistant Professor of law, and although the summary makes it sounds like the linked article would be from a technical perspective, (mostly) it is not.

    A quote like:

    "Data can either be useful or perfectly anonymous but never both."

    needs a bit of background about the qualification of the person making that claim. Why? Simply because it sounds like a rather technical remark. If some computer science researcher made this claim, I would tend to take it more on the face value, otherwise I would take it with a grain of salt.

    Now obviously this statement was not meant to be taken quite literally because the notion of "useful" is not precise. I can get reasonably useful information like "most of the people in my country like to buy branded stuff" or "most people who rent videos of actor X regularly, also rent the videos of actor Y regularly" without needing the underlying data to contain *any* personally identifiable information. The fact that extra data is store is a different thing.

    I personally believe that instead of claiming that some researcher has argued X, it can be more informative to actually say what kind of researcher it is who made a claim. Not because only researchers in a certain area can be trusted, but because a little bit of background puts the claims in right perspective.

  • by Kokuyo (549451)

    English is not my first language, so I probably didn't catch the whole meaning, but...

    The idea was that everyone can be identified with only the birth date, gender and ZIP code? So... err... There is, in fact, not even one ZIP code that has two people living there of the same gender that happen to share a birthday? Sure, to have the year coincide would take a bit more than just the date itself but it's hard for me to imagine that this could be true.

    So... what did I miss?

    • You missed the 87% figure. For 13%, this data is insufficient. (6.5% will share their birthdate, zip, and gender with another 6.5%)

      365*2 = 730 children per day per zip code could be uniquely identified using this information. If I understand this correctly, it implies that 730 / 93.5% = about 781 babies born per day per zip code (on a national average).

    • English is not my first language, so I probably didn't catch the whole meaning, but...

      The idea was that everyone can be identified with only the birth date, gender and ZIP code? So... err... There is, in fact, not even one ZIP code that has two people living there of the same gender that happen to share a birthday? Sure, to have the year coincide would take a bit more than just the date itself but it's hard for me to imagine that this could be true.

      So... what did I miss?

      It takes more than just these three items. What was meant was that if you take these three items, and run them against a database of known items, you end up knowing more from the combination than from the two separately. In this case, if you have a database with redacted information, and a second, non related, database that happens to have the redacted elements from the first, by selecting a good set of common keys to run a union of the two, you can "un-redact" the missing information. Nothing new here. The

    • by ArsonSmith (13997)

      perhaps the 87% part?

    • The idea was that everyone can be identified with only the birth date, gender and ZIP code? So... err... There is, in fact, not even one ZIP code that has two people living there of the same gender that happen to share a birthday? Sure, to have the year coincide would take a bit more than just the date itself but it's hard for me to imagine that this could be true.

      Well, there is some collision, 13% of the people have one. But 87% don't. You pass the 87% likelihood that there will be at least one pair (ass

      • Sorry to respond to myself... but "You pass the 87% likelihood that there will be at least one pair (assuming equal birthdates over the last 80 years and equal gender ratios) at 58,438 people in a zip code." is a mistake.

        SHould be: You pass the 87% likelihood that there will be at most one pair (assuming equal birthdates over the last 80 years and equal gender ratios) at 58,438 people in a zip code.

  • by EasyTarget (43516) on Tuesday September 08, 2009 @03:37PM (#29355539) Journal

    Data can either be useful or perfectly anonymous but never both

    What a load of bolaks....

    Supposing you have a list of -just- birth dates for every citizen at the census. You -only- have only been given one piece of data per person, the date, nothing more. Just a huge list of dates, sorted chronologically.
    1) The data has been totally anonymised.
    2) You can do all kinds of meaningful analysis on the age demographics of the population. And make policy decisions based on that.

    Fully anonymous data producing useful results.

    • by Abcd1234 (188840)

      Well, I think what your example demonstrates is that *application-specific* anonymization and, in your case, aggregation, can produce data that's both useful and actually anonymous. But I happen to agree with the article that, in the *general* case, it's impossible to take data and anonymize it in a way that retains it's usefulness across a large domain of potential applications while simultaneously protecting the anonymity of those in the database.

      'course, when you think about it, that's common sense: To

  • This is much too extreme. There are many good examples of useful data that is for almost all intents and purposes anonymous. Consider the example of anonymous lending libraries [wayner.org] from my book, Translucent Databases.

    The simplest version just pushes the book title through a one-way function. The more complex version also hides the name in a similar way.

    Can the anonymity be stripped away? There are coincidences and connections as Sweeney's examples and the Netflix examples show, but they can be fought by addi

  • Data can either be useful or perfectly anonymous but never both

    I'm not sure I entirely agree with this statement. While it's tecnically correct, I believe it's misleading...

    It's perfectly possible to hash personally identifiable information into an MD5 sum, to ensure that your records are unique, and then to generate useful statistics based on the resulting aggregate data without releasing significant personal information.

    For instance:

    Key = Hash(Your name + Your Zip + Your Birthday)
    Zipcode
    Birth Decade
    Hobbie

    • by Mprx (82435)
      That hash is too easily reversible. Brute force search in order of name popularity.
    • Right, but not on target. The issue is that if databases include enough data, and seemingly trivial data at that, by selecting good common elements and good databases to generate union result sets, I can show that it really was you at the Game Store that bought Bitchslap III, Nun Terror At The Vatican for PS3 at 7:30 PM last Tuesday, and you were not at the bar watching football like you claim.
      • by Burning1 (204959)

        At what point is it cheaper and more effective to hire a PI to follow me around and root through my garbage?

    • by Abcd1234 (188840)

      Yeah, but the whole point is, given a zip code, birth date, household income, and hobbies, I can probably figure out who you are.

      Fundamentally, the issue is very simple: Given some sort of identifier, and a series of properties about that identifier, if you have enough dimensions of detail, you end up narrowing down your sample so much that you end up with a population of one, that being the person the identifier "hides". It's just that simple.

      The only way to prevent this is to generate crosscuts of data

      • by Burning1 (204959)

        Fundamentally, the issue is very simple: Given some sort of identifier, and a series of properties about that identifier, if you have enough dimensions of detail, you end up narrowing down your sample so much that you end up with a population of one, that being the person the identifier "hides". It's just that simple.

        We go through the same basic process to find information through a search engine -- we attempt to find ways to narrow down the data in such a way that the information we are looking for exist w

  • Ohm is overwrought (Score:3, Informative)

    by feenberg (201582) on Tuesday September 08, 2009 @04:13PM (#29356189)

    I have worked with anonymized government data extensively, and birthdate and zipcode are always considered personally identifiable information. Sometimes birth year is available, and sometimes state or (rarely) county is available, but I have never even heard of a dataset with both. Datasets with month and day of birth are never considered to be anonymized, and are not released. The author of the paper is much overwrought.

  • I have a twin brother living with me. Now try to identify me, Haha!

  • CT scans (Score:3, Insightful)

    by Cajun Hell (725246) on Tuesday September 08, 2009 @07:30PM (#29359313) Homepage Journal

    Have you ever thought about how a "cat" scan works? Forget the 3D aspects and let's just think about how the cross-sectional pictures work.

    Every given reading, is just shooting a ray through the target, and getting a single number out. This is analogous to aggregate summaries are personal details in data. You know the average income of people in zip code 12345, but no specifics. The trick is, later, just as that CT scan is going to shoot a ray through a certain point again from a different direction, your personal details are going to be summarized again by someone else, in a different way.

    A picture will emerge. The CT scan is going to "see" the bone as distinct from the tissue right here at this pixel, and this person's data will be un-summarized. It just takes enough rays, and eventually all ambiguity goes away.

    A long time ago (about 20 years ago, I think?) there was a neato explanation of a cat scan algorithm in Scientific American. I wish I could find it. Because I bet you could show that article to any "database guy" these days, and they'd nod and smile.

Science is to computer science as hydrodynamics is to plumbing.

Working...