March 3, 2009 | By Peter Eckersley and the Diabolical Power of Data Mining

Recently, there was a minor scandal when TechCrunch accused of turning over information — the identities of people listening to copies of a leaked U2 album — to the RIAA. issued a scathing denial of these allegations, and it's good to hear that the site hasn't turned into a worldwide music surveillance system. Not on purpose, that is.'s avowed innocence isn't quite the end of the story. The whole kerfuffle should remind us that websites that collect and republish seemingly innocuous facts about their users are often vulnerable to data mining. It doesn't matter whether you keep the users' names and addresses secret — the facts you publish about them may be sufficient to ensure that there is only one person on the whole wide web to whom those facts pertain.1

This isn't a problem that's unique to in any way. Networked computer systems often leak secrets in unexpected ways, but serves as a particularly clear example of why anonymity is hard to achieve.

More on this risk, and what to do about it, after the jump.

For those who haven't used it, is a site that collects information about your music listening habits. It gets the data from an audio player plug-in which you can install on your computer; the plug-in sends the data off with a username and password, so that has a complete record of everything that username listened to, and when. aggregates this data to create a nifty homepage for each user, including charts of your favorite artists and tracks, links to other people who enjoy similar stuff, and recommendations for new music that you're likely to like. Here's an example.

De-anonymizing your taste in music and friends

Sites like face a challenging privacy problem. Many of their users are happy to share their music tastes openly with everyone, and that's easy. But others may only want to share that information with a limited group of friends, or may be happy to have strangers (and record companies) see their music collections, but only with the protection of a veil of anonymity.

If your username isn't your real name, your account may seem to be anonymous, but the facts it contains probably tell the world who you are.

How is this so? Well, every profile contains a username, data on artists and tracks you've listened to at different times, a list of friends on the site, and a list of "neighbors" with similar musical taste. Each of these is a threat to anonymity:

  • Username: is your username on the same as your username on other sites? If so, does your profile on any of the other sites give away your identity?
  • Friends: even a small amount of information about other pseudonyms you've friended on has a high probability of allowing a data miner to match you based on the friendship graphs from other sites, like Facebook, Linked In, MySpace, LiveJournal, Yelp, etc.2
  • Music taste: do you have a MySpace (or Facebook, or LiveJournal) page that displays or discusses the kind of music you like? Are the artists and bands you mention on those pages also near the top of your charts? If so, a data miner could probably link you to your MySpace account, or to a small group of MySpace accounts, one of which is yours.

In combination, these facts mean that even if never revealed your IP address to anyone, your public profile on the site could be de-anonymized by careful data mining.

Defense against data mining

It isn't impossible to use or other data sharing sites anonymously; it just requires a great deal of care. For each fact you share, you need to ensure that there aren't other published data points to match it against:

  • select a username that is different to the ones you use on other sites;
  • don't friend your real friends on;3
  • don't reveal your music tastes publicly in other fora;
  • configure your plugin to send its reports through Tor.4

There are also some things that data aggregation sites can do to reduce the risks that data mining poses to their users' privacy. One strategy is to try to prevent third parties from obtaining datasets in the first place, while another is to try to prevent them from using data mining techniques to de-anonymize users. Neither of these is perfect. doesn't attempt to restrict limit to its datasets; in fact, the site's privacy policy is clear that all data (except for email addresses) will be available not only on the web but also through an API. They do attempt to contractually restrict: Section 5.1.6 of the terms of use for the API prohibits identifying users who have not chosen to identify themselves, while clause 7 in the site's Acceptable Use might have a similar effect.

It's good that has these limits in place, but don't depend on them. If you want facts about yourself to remain secret, be very careful before you let them onto the net.

  • 1. There are only 7 billion people on the planet, and only about a billion on the Internet. Every fact about a person (are they male or female? Where they live? Do they listen to Brian Eno?) slices that number down by a significant fraction. If you have enough facts about a person, (33
    bits of independent facts, it turns out, because log 2 7,000,000,000 = 32.7) you can determine who they are.
  • 2. If perfect friendship data was available, matching accounts across sites like this would simply require a data miner to solve the graph isomorphism problem, which turns out to be very easy in practice. Against real social networks, the problem is a probabilistic variant of graph isomorphism, because many people will be friends with someone on social network A, but not on social network B. Nonetheless, Arvind Narayanan and Vitaly Shmatikov are presenting a paper called De-Anonymizing Social Networks at the 2009 IEEE symposium on Security and Privacy, in which they show that matching is possible for a large percentage of users on real social networking sites.
  • 3. Caveat: if you have extremely similar musical tastes and collections to your friends (and many people do), it's possible that your real-life friends will turn out to be amongst your nearest "neighbors" on, regardless of whether you actually friend them.
  • 4. You can use the "torify" program to do this if you know how to run your media player from a shell script or command line.

Deeplinks Topics

Stay in Touch

NSA Spying

EFF is leading the fight against the NSA's illegal mass surveillance program. Learn more about what the program is, how it works, and what you can do.

Follow EFF

Pls plan to call your House Rep Monday, "vote no FCC privacy repeal" If your rep is Republican, ask 5 friends too

Mar 24 @ 4:18pm

A loophole in Australia's copyright safe harbor rules will stay open, endangering local user-generated content sites

Mar 24 @ 1:37pm

In all, @agcrocker addressed the appeals court on NSLs for more than 25 minutes. Here's the full recording.

Mar 24 @ 1:09pm
JavaScript license information