Sunday, 16 June 2013

There's no such a thing as harmless data

After the leak of spying project Prism, one of the many lame excuses given to justify such an immoral thing is that the data which is being collected is not the content of messages, just the sources, destinations and times.

As I have been working with machine learning for quite some time, I feel compelled to explain that even with this apparently harmless amount of data we can discover a lot about someone.

It is common place that everywhere you buy things on the internet gives you suggestions based on previous items you have bought. The more you buy, the most accurately the suggestions seem to agree with your taste. In this case, however, you think this is reasonable given that you are actually giving them what seems to be the appropriate information to infer just that.

What you don't really know, and usually probably don't really care, is how this is done. There are many ways, in fact, but let me explain one of them. The one that will make you understand how powerful these methods can be.

In fact, I will explain the one I have worked on two years ago in collaboration with Joerg Reichardt, a colleague from Germany who is a specialist in complex networks. Networks are simply formed by objects that are connected by something. Graphically, the objects are represented by dots and, when they have a connection, we draw a line between them. The graph below is an example taken from Wikipedia:


For instance, the dots, or in the above case the circles, can be computers and the lines a physical connection between them. The interesting thing is that the concept of connection can be generalised to any kind of relationship between the objects. One example could be a graph where the dots, also called vertices or nodes, can be any kind of real objects and the lines, technically called edges, could connect any two objects that share some colour in common. 

A very interesting kind of network, from the commercial point of view, is the one relating consumers to films in sites like Lovefilm or Netflix. This is a special case of a network which is represented by a bipartite graph. By that, we mean a graph with two different kinds of nodes. In this case, we will have nodes which are films and nodes which are consumers. But to be considered bipartite, a graph like that must obey one more condition, that nodes never connect to other nodes of the same kind. For our example, that means that we always connect consumers to films, but never consumers to consumers of films to films. 

The ways connections are put are now very clear. If a consumer watches a film, a connection is established. In the end, we have a network like this:


A bit of advertisement here. This was taken from our work, The Interplay between Microscopic and Mesoscopic Structures in Complex Networks, which can also be found in my website. I will explain the work in a bit, but let me now continue with the graph above. You can clearly see now that the name "bipartite" is really appropriate. Consider that the green lozenges are consumers and the blue circles are films. The red line means that, at least once, consumer number 4 watched film g.

This is a very compact way of visualising this kind of who watched what information. Of course the website can do even better and collect other data like the rating each consumer gives to each film, but we can already do a lot without it.

The next step is now to use it to give recommendations. The cool way in which you can do that is the following. You might think that if someone watch a horror film, then we can recommend another horror film to that person, but that is too simplistic. What if the person watched 10 romantic films and only one horror film? We'd do better by suggesting another romantic film, right? Things would be so easy if people were easy to read like that...

What usually happens is that people watch dozens, sometimes hundreds, of films and most of them can watch a huge number of different categories of films. Consider that this happens to thousands or millions of consumers and you can understand why we call it a complex network.

But think about the kinds of person you know. Geeks usually watch a lot of science fiction, adventure and even horror, but much less romantic films. Non-geeks usually watch some science fiction, but would watch much more adventure films, for instance. The trick we want to perform is that of, by looking at our bipartite graph, being able to identify something like that, communities of people with similar interests. If we succeed, our recommendations will surely be more appropriate. This is called community detection or clustering in graphs. It's not very easy to see in bipartite graphs, but it's very clear when you look at graphs like the one below:


This graph was taken from the paper Mixture models and exploratory analysis in networks by Newman and Leicht and represents a network of friendships of members of a karate club which split in two because of internal disputes. You can see that the members are separated in two clusters of nodes. Connections between members are much more frequent inside the clusters than outside. Finding these clusters might be easy for relatively small networks, but is tricky for large ones. The interesting thing is that the authors of the above paper found an algorithm that, given the nodes and edges only, could find the split almost perfectly.

Now you should start to be scared. Imagine that someone constructs this kind of network by connecting two persons if they speak through their mobile phones for, let's say, more than ten minutes. Using the above algorithm, this person can classify people in groups. If this person has some information about the interests inside each group, then a better classification can be achieved. Imagine now that this classification can be geeks, businessmen, religious people or... terrorists. 

Let's come back to our films, because now you can understand how that information can allow for a finer classification than the karate club one above. By using the algorithm that I and my friends developed, for instance, you can cluster people in groups relative to their interests in films, books, products and so on. You don't need access to the text of their messages, nor their ratings and even less their reviews. The simple connection information becomes enough.

The amazing thing about the bipartite example is that people never connect directly with one another, only indirectly by the products. And the algorithm doesn't need to have any prior information about what profile of people buy which product, it can cluster both the persons and the products in groups at the same time!

If you're a professional in the area and look the papers above, and also others in the field, you will see that the results are very good. The amount of information extracted can be impressive. And that was obtained simply by the information contained in the connections of the graph! Of course the algorithms are not perfect, which is a terrible thing actually, because bureaucrats won't care

The bottom line is that every private piece of information about you that leaks from websites or mobiles can reveal much more about you than you can imagine. The paper I published appeared on 2011. These things evolve very fast and, if you have enough computer power, you can do amazing things.

So, don't believe when the agencies say that the data they are collecting is not very informative. First, it is, second, if it wasn't, either they would not be collecting it, or they would be idiots wasting a lot of money and time. And, believe me, they are not.

No comments:

Post a Comment