Last.fm and the Diabolical Power of Data Mining
Recently, there was a minor scandal when TechCrunch accused Last.fm of turning over information — the identities of people listening to copies of a leaked U2 album — to the RIAA. Last.fm issued a scathing denial of these allegations, and it's good to hear that the site hasn't turned into a worldwide music surveillance system. Not on purpose, that is.
Last.fm's avowed innocence isn't quite the end of the story. The whole kerfuffle should remind us that websites that collect and republish seemingly innocuous facts about their users are often vulnerable to data mining. It doesn't matter whether you keep the users' names and addresses secret — the facts you publish about them may be sufficient to ensure that there is only one person on the whole wide web to whom those facts pertain.1
This isn't a problem that's unique to Last.fm in any way. Networked computer systems often leak secrets in unexpected ways, but Last.fm serves as a particularly clear example of why anonymity is hard to achieve.
More on this risk, and what to do about it, after the jump.
For those who haven't used it, Last.fm is a site that collects information about your music listening habits. It gets the data from an audio player plug-in which you can install on your computer; the plug-in sends the data off with a username and password, so that Last.fm has a complete record of everything that username listened to, and when. Last.fm aggregates this data to create a nifty homepage for each user, including charts of your favorite artists and tracks, links to other people who enjoy similar stuff, and recommendations for new music that you're likely to like. Here's an example.
De-anonymizing your taste in music and friends
Sites like Last.fm face a challenging privacy problem. Many of their users are happy to share their music tastes openly with everyone, and that's easy. But others may only want to share that information with a limited group of friends, or may be happy to have strangers (and record companies) see their music collections, but only with the protection of a veil of anonymity.
If your username isn't your real name, your Last.fm account may seem to be anonymous, but the facts it contains probably tell the world who you are.
How is this so? Well, every Last.fm profile contains a username, data on artists and tracks you've listened to at different times, a list of friends on the site, and a list of "neighbors" with similar musical taste. Each of these is a threat to anonymity:
- Username: is your username on Last.fm the same as your username on other sites? If so, does your profile on any of the other sites give away your identity?
- Friends: even a small amount of information about other pseudonyms you've friended on Last.fm has a high probability of allowing a data miner to match you based on the friendship graphs from other sites, like Facebook, Linked In, MySpace, LiveJournal, Yelp, etc.2
- Music taste: do you have a MySpace (or Facebook, or LiveJournal) page that displays or discusses the kind of music you like? Are the artists and bands you mention on those pages also near the top of your Last.fm charts? If so, a data miner could probably link you to your MySpace account, or to a small group of MySpace accounts, one of which is yours.
In combination, these facts mean that even if Last.fm never revealed your IP address to anyone, your public profile on the site could be de-anonymized by careful data mining.
Defense against data mining
It isn't impossible to use Last.fm or other data sharing sites anonymously; it just requires a great deal of care. For each fact you share, you need to ensure that there aren't other published data points to match it against:
- select a username that is different to the ones you use on other sites;
- don't friend your real friends on Last.fm;3
- don't reveal your music tastes publicly in other fora;
- configure your Last.fm plugin to send its reports through Tor.4
There are also some things that data aggregation sites can do to reduce the risks that data mining poses to their users' privacy. One strategy is to try to prevent third parties from obtaining datasets in the first place, while another is to try to prevent them from using data mining techniques to de-anonymize users. Neither of these is perfect.
It's good that Last.fm has these limits in place, but don't depend on them. If you want facts about yourself to remain secret, be very careful before you let them onto the net.
- 1. There are only 7 billion people on the planet, and only about a billion on the Internet. Every fact about a person (are they male or female? Where they live? Do they listen to Brian Eno?) slices that number down by a significant fraction. If you have enough facts about a person, (33
bits of independent facts, it turns out, because log 2 7,000,000,000 = 32.7) you can determine who they are.
- 2. If perfect friendship data was available, matching accounts across sites like this would simply require a data miner to solve the graph isomorphism problem, which turns out to be very easy in practice. Against real social networks, the problem is a probabilistic variant of graph isomorphism, because many people will be friends with someone on social network A, but not on social network B. Nonetheless, Arvind Narayanan and Vitaly Shmatikov are presenting a paper called De-Anonymizing Social Networks at the 2009 IEEE symposium on Security and Privacy, in which they show that matching is possible for a large percentage of users on real social networking sites.
- 3. Caveat: if you have extremely similar musical tastes and collections to your friends (and many people do), it's possible that your real-life friends will turn out to be amongst your nearest "neighbors" on Last.fm, regardless of whether you actually friend them.
- 4. You can use the "torify" program to do this if you know how to run your media player from a shell script or command line.