Over the past year, there have been a number of high-profile incidents in which sensitive user data was accidentally revealed to the Internet at large. As a result, I believe that high-tech companies will never again share anonymized data on their users with academic researchers, at least not without requiring contracts and nondisclosure agreements. For the users and privacy advocates, this is probably a good thing. However, for researchers, the scientific community, and Internet users who want cool new technologies, this is almost certainly a change for the worse.
In 2006, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered $1 million to anyone who could improve the company's system of DVD recommendation. In order to protect its customers' privacy, Netflix anonymized the data set by removing any personal details.
The same thing happened back in 2006 when AOL released the search records of 500,000 of its users. Within days of the database's release, journalists from the New York Timeshad revealed the identity of user number 4417749 to be Thelma Arnold, a 62-year-old widow from Lilburn, Ga. Over 300 of the woman's searches were traced back to her, ranging from "60 single men" to "dog that urinates on everything." ...
In the Netflix case, the anonymized data could only be de-anonymized because researchers could leverage the independent, publicly-accessible ratings at IMDb, and they could only de-anonymize data for those Netflix users who had made a good number of ratings at both services. In the case of AOL, researchers could leverage the user searchstrings, and could only get leverage when they had a good number of searchstrings from the same person. So in both cases, they could only identify a subset of users, not all users. (However, I agree that even this is a deplorable invasion of privacy.)
Although the Netflix and AOL datasets show two different kinds of vulnerability, most datasets on human beings will show neither kind. For example, anonymized data on medical patients could only be de-anonymized if researchers could go backwards from clusters of symptoms and treatments to individuals. Neither the Netflix nor AOL episodes raises the risk of that.
I believe that Soghoian meant to limit his claim to data on users of web services, even though his headline goes well beyond that category. I understand that headlines cannot capture every nuance of the articles they describe, and run into this problem with my own blog every day. But I want to underscore the point about the larger world of open data in case someone draws the wrong conclusion. In the research landscape at large, very few open datasets are about human beings at all, and even fewer are about users of online web services. Even under the worst-case scenario in which all anonymized user data could be de-anonymized, the impact on open research data would be relatively small.
Peter Suber at 11/30/2007 02:25:00 PM.
The open access movement:
Putting peer-reviewed scientific and scholarly literature
on the internet. Making it available free of charge and
free of most copyright and licensing restrictions.
Removing the barriers to serious research.