I’ve recently posted here about how most major social media applications are a bit like an iceberg – and how most of us only get to see the tip of it. I’ve also looked at how the internet could be looked at as a vast conspiracy to make us all machine readable.
Today, I’d like to focus on how we can start to intelligently mine that huge firehose of information and extract useful, meaningful information from it.
This is particularly exciting for us because it’s information that we’ve never been able to see before – a huge torrent of observation, comment and hard data that is being produced in a machine-readable format for the first time ever.
To put this into context, take any point in recent history and imagine you could get millions of people to come to you and volunteer opinions or interesting extracts from the things they’re read or seen in a way that we could tabulate it, weigh it for value and get information from it.
On a slightly darker note, imagine that they do this in a fairly guileless way – often even putting info into our hands that we could exploit or abuse.
Imagine what the Stasi could have done with it? Imagine what benign forces of law-and-order could do with it in order to undermine crime or terrorism? Pop your ‘Liberal’ thinking cap and and spot the potential dangers here. Then try on your ‘sales and marketing’ cap and look at the opportunities or your ‘consumerist’ eyes to see the threats here.
It’s all exciting, fascinating and worrying in equal measure.
So what can be achieved – with the right tools?
Think about Twitter, for example, as a huge database of these comments.
That’s basically what it is, after all. 200,000,000 comments a day. An utterly vast database that constantly imports a torrent of characters.
Let’s pick a random tweet for illustration purposes.
Here we can see the following things:
- A sentence that has meaning of some sort
- A collection of words that we can juxtapose (FT, airing, tax, idea, etc)
- A tweeter about whom we can draw some conclusions (how influential they are, who they associate with, sometimes, where they live and work). There are lots of clever tools that help to determine & qualify ‘influence’ and Umair’s Klout score is very impressive here
- We can sometimes aggregate their comments to draw conclusions about what they do for a living, what their interests are, etc (again, Klout tells us Umair is influential about capitalism, politics & markets, but there’s plenty more that we could find out from his Twitter profile and what he says)
- In a small percentage of cases, people even have ‘geo-tagging’ enabled on their tweets so we can nail down whereabouts they’re tweeting from – so we can run a search on people in a specific area if we like (usually at 25m radius)
- If not, there are plenty of other clues – his profile says London/West Coast and conclusions can be drawn from his subject matter and who he associates with
- They may have included links to web-pages or pictures – sometimes the web-page URL may include a particular word (eg www.guardian.co.uk/football/2011/sep/28/carlos–tevez–denies–refusing–play) – and those words will be added to the list of words that can be juxtaposed in that tweet. This also applies to urls that have been masked by an ‘url-shortener’ such as bit.ly or Twitter’s own shortener.
The more slowly and painstakingly we look at a collection of tweets, using human eyes, the more we can find.
But that’s a huge job – and often human eyes often bring unconcoious biases, selectivity and mis-perceptions to the table. And as any marketeer or pollster will tell you, the real value is often in looking at what comes out when we sift large volumes of data rather then small snapshots.
So we may want to get cleverer about the way our machines process this information and extract data from it.
Now imagine you have a vast bank of computing power at your disposal. Imagine twitter have let you access this huge database/torrent of information in a way that allows you to process it cleverly.
At the crudest level, we can run a sentiment analysis programme over a particular tweet. I’ll be posting on sentiment analysis and other forms of automated language processing in more detail shortly, but please bear with me for now and accept my assertion that it can machine-read large amounts of information and tell us whether references to (picking a name out of the air) Adair Turner are generally positive or not.
We can run a query to see what other words crop up in tweets that also include the word Adair. Of course, there are other people called Adair around and our sentiment analysis will never be perfect, but we can look at the context of each of these words (i.e. words that occur on twitter and also occur in tweets with the word ‘Adair’ in them) and draw conclusions about them as well.
Do those words occur in a context that lends positive, negative or neutral sentiment towards them?
We can look at a graph showing when these tweets happened. Are there real-world events (TV appearances, news stories, etc) that provoked them?
Then lets take that single-line graph showing the numbers on a timeline and give it a colour. Let’s say green = positive, red = negative and blue = neutral. If a particular real-world events results in that graph surging and changing colour, we can learn something from that.
Has our subject said something interesting? Has the hive-mind of the internet seen it as being significant? If so, it will rise to a peak and decline slowly as people keep talking about it. Or was it a claim about them that gained little credibility – one that spiked and dropped quickly.
For a politician, marketeer or a journalist, these changes provide valuable communications information. Hedge funds have found this information more valuable than other available data. Health authorities find the information more useful than other available data in tracking epidemics.
As a campaigner or marketeer, you can see if a damaging line of attack on the way earlier. Has something been said that needs to be clarified? Is there misunderstanding or mischief coursing around?
There’s also the opportunity that this kind of insight provides to anyone who wants to intervene and change the way the firehose is talking about something.
Can we drill down into this information and find out who the influencers are behind a surge like this? Can we contact them directly with the clarification or rebuttal? Or can we give rival influencers the ammunition they need to launch a counter-attack?
Journalists watch this firehose. I alerts them to stories. It gives an good indicator to the credibilty of a story. A rapid rebuttal can take a problem off the radar.
Over the next few weeks, I’ll start posting some graphics here with a few example illustrations. Please stay tuned!