Why Anonymized Data Isn't 280
Ars has a review of recent research, and a summary of the history, in the field of reidentification — identifying people from anonymized data. Paul Ohm's recent paper is an elaboration of what Ohm terms a central reality of data collection: "Data can either be useful or perfectly anonymous but never both." "...in 2000, [researcher Latanya Sweeney] showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex. ... For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm. ... Reidentification science disrupts the privacy policy landscape by undermining the faith that we have placed in anonymization."
The Only Truly Anonymous Data (Score:1, Informative)
The only way to make sure that data remains truly anonymous if or it to start out as anonymous data. "Scrubbed" data will always be traceable and often will have the source data, non-scrubbed, leak into the wild.
All hail the glorious Hypno-Google.
Duh. (Score:4, Informative)
Am I the only one who always gives their birthday as 01/01/1970 and their zip code as 20500?
I mean, seriously. They don't need to know. Why would I give 'em the right numbers? They're lucky I even allow them to have rough demographic data.
Re:Paul Ohm? (Score:5, Informative)
Nonsense, it could be a extension of the current Law:
"In electrical circuits, Ohms' law states that the current through a conductor between two points is directly proportional to the potential difference or voltage across the two points, and inversely proportional to the resistance between them. In data anonymity, the law states that the general usefulness of any set of data that originally contained personally-identifiable information is inversely proportional to the degree of anonymity applied to said data."
See, on simple law to memorize, and now data analysts learn just a teensy bit about electricity and EEs learn just a teensy bit about data anonymization.
Remeber "Mother Earth" and the Espionage Act (Score:3, Informative)
http://en.wikipedia.org/wiki/Mother_Earth_(magazine) [wikipedia.org]
Mother Earth was an anarchist journal that described itself as "A Monthly Magazine Devoted to Social Science and Literature," edited by Emma Goldman. Alexander Berkman, another well-known anarchist, was the magazine's editor from 1907 to 1915. It published longer articles on a variety of anarchist topics including the labor movement, education, literature and the arts, state and government control, and women's emancipation, sexual freedom, and was an early supporter of birth control. Its subscribers and supporters formed a virtual "who's who" of the radical left in America in the years prior to 1920.
In 1917, Mother Earth began to openly call for opposition to American entry into World War I and specifically to disobey government laws on conscription and registration for the military draft. On June 15, 1917, Congress passed the Espionage Act. The law set punishments for acts of interference in foreign policy and espionage. The Act authorized stiff fines and prison terms of up to 20 years for anyone who obstructed the military draft or encouraged "disloyalty" against the U.S. government. After Emma Goldman and Alexander Berkman continued to advocate against conscription, Goldman's offices at Mother Earth were thoroughly searched, and volumes of files and detailed subscription lists from Mother Earth, along with Berkman's journal The Blast, were seized. As a Justice Department news release reported:
"A wagon load of anarchist records and propaganda material was seized, and included in the lot is what is believed to be a complete registry of anarchy's friends in the United States. A splendidly kept card index was found, which the Federal agents believe will greatly simplify their task of identifying persons mentioned in the various record books and papers. The subscription lists of Mother Earth and The Blast, which contain 10,000 names, were also seized."
Mother Earth remained in monthly circulation until August 1917.[1] Berkman and Goldman were found guilty of violating the Espionage Act, (imprisoned for two years) and were later deported.
Re:Duh. (Score:1, Informative)
Am I the only one who always gives their birthday as 01/01/1970 and their zip code as 20500?
But be careful. Using the same fake data consistently still allows someone to correlate across different records. For instance the aggregate data from various websites where you've filled-in data would identify you (with reasonably high probability) as being a single person. Then all it takes is one database that has enough info to link back to your real identity for your anonymity to be gone again.
I'm not saying that the average company would go to that much effort. I'm just saying that if you're going to be paranoid about anonymity, you should vary the data you provide somewhat randomly.
Re:Three things? Really? (Score:5, Informative)
That Paradox ignores the year. Add that in and it starts to become harder.
Re:Duh. (Score:3, Informative)
(like the scanning of IDs at liquor stores or bars to check age--there is a birthdate listed on IDs for a fucking reason people--not that they can scan my rare earth magnet swiped ID anyway)
That's not to check age; that's to check for counterfeits with mismatched mag data, or mismatched 2-D barcode data, or missing UV ink prints, or missing holograms, etc. etc.
Re:Three things? Really? (Score:2, Informative)
Re:Paul Ohm? (Score:5, Informative)
Okay, let's take a road. The speed at which traffic can travel depends on the quality of the surface, gradient, camber, zoning, etc. Let's call this the "road conditions", with a lower number being better roads.
The number of cars that want to get through that road is a primary unit, which we can refer to as the "volume of traffic".
The third major criteria is the speed at which the traffic actually flows. This is the "actual flow" of traffic -- in other words, the "influence of other cars" on the traffic congestion.
In other words:
volume = influence of traffic * road conditions
or:
V = IR
Ohm is overwrought (Score:3, Informative)
I have worked with anonymized government data extensively, and birthdate and zipcode are always considered personally identifiable information. Sometimes birth year is available, and sometimes state or (rarely) county is available, but I have never even heard of a dataset with both. Datasets with month and day of birth are never considered to be anonymized, and are not released. The author of the paper is much overwrought.
Re:Duh. (Score:1, Informative)
Re:Mission Impossible (Score:1, Informative)
Well recently in the Netherlands a guy tried to do just that: He wanted to use video tapes to prove he was somewhere else. The problem here was that the DA 'lost' those video tapes. I tried to re-find a link but was unsuccessful. Any other Dutch news finders up to the task?