2012 / Trend#3 / Data is the new Black

It's no wonder Karl Lagarfeld was at LeWeb this year ! Fashion breathes the air we breathe and makes something different out of it. Our air today is heavy with all things digital and data related. One must be there to stay up to date with the times. The data deluge has already made the headlines in recent years. What is to come however is of a very different nature. The easiness of data use is trickling down thanks to user-friendly applications. Data is reaching the long tail at light speed. We should get ready for yet another flood, not of data however but of data-based products.

In a recent publication, Mike Loukides tackles " the evolution of data products " and tries to establish a minimalist taxonomy. His reflection is a mix of three elements:

  1. Data is disappearing. It is becoming invisible. The user has no idea how much data is involved even is the simplest mobile applications he uses (finding a place to park on your smartphone is a great example in that sense; it mobilizes around 5 to 10 gigantic data sets provided by 10 to 15 organizations)
  2. Data is being combined (like in the example above, and needless to say that the more we go down this road, the cleverer the combinations that will emerge; keep an eye on hackatons for that purpose)
  3. Data is personalized (Back to Eli Pariser's " Filter bubble ")

Loukides makes the point that original exploitation of data based on these 3 criteria will be the differentiating factor in years to come. Looking back at the data though, the latter's origin is instrumental in determining its use:

  1. Some data products are simply based on existing data: Transport data, traffic data or weather data ... are products in themselves
  2. Other data products need user-generated data to make sense: The " Quantified Self " movement is big in that category. Here you'd delve into all the calorie-counting, sleep-optimizing, blood pressure-measuring ... apps where you enter personal data in order for the app to generate some useful information

Both kinds of data products will soar since both the user-generated data and the, well, non-user-generated data are skyrocketing in size. But to better understand where the innovation is happening, one enlightening example might be the social networking data taxonomy Bruce Schneier came up with:

  • Service data is the data you give to a social networking site in order to use it. Such data might include your legal name, your age, and your credit-card number
  • Disclosed data is what you post on your own pages: blog entries, photographs, messages, comments, and so on
  • Entrusted data is what you post on other people's pages. It's basically the same stuff as disclosed data, but the difference is that you don't have control over the data once you post it -- another user does
  • Incidental data is what other people post about you: a paragraph about you that someone else writes, a picture of you that someone else takes and posts. Again, it's basically the same stuff as disclosed data, but the difference is that you don't have control over it, and you didn't create it in the first place
  • Behavioral data is data the site collects about your habits by recording what you do and who you do it with. It might include games you play, topics you write about, news articles you access (and what that says about your political leanings), and so on
  • Derived data is data about you that is derived from all the other data. For example, if 80 percent of your friends self-identify as gay, you're likely gay yourself

Disclosed, Entrusted and Incidental, Behavioral are user-generated data. Service data is also user-generated only it's the same kind of "cold" data a government is likely to have in its files (legal name, your age, and your credit-card number), it's personal but not too personal, it doesn't define who you are in its essence.  Derived data is not user-generated but rather algorithm-induced. That last kind is a data product. It is the stuff of the future.

Being personally obsessed by the predictive power of data, reading all what I've read, it was the only thing I wanted to see and the killer app of data products I secretly desired. And looking at all these notes in front of me, it seems like that's the real trend to come. We've all heard Eric Shmidt's famous quote:

Google needs to move beyond the current search format of you entering a query and getting 10 results. The ideal would be us knowing what you want before you search for it...

Only Google isn't the only organization with data. Governments, marketing companies and entertainment companies also have substantial quantities. The companies have been been tracking your clicks, the government should know everything there is to know about you if they source all the material from the State's different departments. This last fact is what is now being called algorithmic government or put more simply : organized human behavior predictability. If Google can predict what you will want to search for based on how much they know about you, given what the data the government has, you can bet they have a shot at guessing what you're about to do. Especially that they now have the tools for it.

And no, I'm not talking about the satellites but rather about the data integration software able to handle Petabytes and crunch it in seconds. But prediction will always be an art (i.e. it's not a science) and two practices are frequently used :

  1. Qualitative work based on nugget-finding (mainly human dependent because of the complexity of the conclusions that need to be drawn). Software like Recorded future is able to do that now. This is an incredible company that has the ability to semantically detect future-turned sentences on the web and federate it in order to generate insight about what will probably make the news tomorrow. This is actually the technique fortune-tellers use : based on clues they pick up from the discussion with their client, they can make nugget-based predictions. Hence the inaccuracy seeing the scarcity of the data.
  2. Quantitative work based on machine learning and heavy data use (mainly machine dependent, because if the mass of data). Here you'd find the likes of bit.ly, the short link generator which uses all these links to simply see what's trending, you also find Google Flu trends which finds the places where flu is hitting by seeing where searches for the word "flu" are the most frequent in every possible language. It's based on word counts and localisation, so the results are generated for humans to read and analyze, not to decipher as in Recorded Future's case. Machines are prone to mistakes of course and to something data miners call over-fitting : wanting to fit data to a model they've induced based on previous learning even though sample size (and logic !) does not allow for that type of conclusion. Here's a striking example :)

Needless to say the best tools of the coming years are combinations of the two. These are the tools which will allow a symbiotic interaction of human and machine in order to analyze the data, whereas the machine would take the repetitive, boring work out of the analyst's way, letting him focus on what he's really good at : thinking. We won't be able to unlock the future. I know that now. But we can see the present in a clearer way in order to avoid the pitfalls to come. That's why we need to machines not to replace but to support our insight-generating minds.

As Karl Lagerfeld puts it : "Somebody still needs to find the ideas, and it's not the machines who are going to do it"