One of Intent HQ’s key technological innovations is the Topic Graph.
This data structure encodes millions of concepts and links them based on ontological, categorical, semantic and affinity based relationships. This four pronged approach ensures that we understand that when a user talks about Klondike (a Discovery Channel TV show), they may be interested in other Discovery Documentary TV shows as well as having a high propensity for certain fast foods and retail brands.
The data that drives these insights is derived from three main sources. Semantic and categorical relationships are extracted from Wikipedia and Wordnet, ontological relationships from Wikidata and affinity from the social network profiles of millions of users and their friends.
The backbone of the Topic Graph is extracted from Wikipedia. This online collaborative encyclopaedia encodes a staggering amount of information – the English language version alone has more than 5 million topics, connected with more than 120 million links. Unfortunately Wikipedia is not easy to process. Not only is it very large – a single XML file weighing in at more than 40GB uncompressed – but the data contained within it is designed to be accessible to humans, not machines.
We extract two key pieces of data from Wikipedia: topics and categories. Topics generally describe an atomic concept and categories are labels that collect related topics together. Unfortunately this categorisation is, like much raw Wikipedia data, noisy and difficult to work with.
The first issue we encounter is that categories are not organised in a fixed hierarchy, instead being linked in a directed graph. In addition, the granularity of the category structure varies tremendously across subject domains. Some categories are very broad; others are incredibly specific.
For the Wikipedia categorisation to be truly useful, this rough structure must be cleaned and normalised. The first part of this process involves merging and consolidating similar categories (those with substantially similar child topics) and removing noisy categories (eg. lists of people born in 1909). To remove noisy categories we apply editorial rules as well as a supervised machine learning algorithm that learns which categories are of value.
Once this is done, the directed graph is converted into a tree structure. Broadly speaking this means that more general categories (eg. “physics”) are found at the top of the tree, with more specific ones found closer to the bottom. We use a variant of Tarjan’s Algorithm to do this.
Wikidata is a massive, collaborative knowledge base containing over 15 million data items. The raw data dump is very large uncompressed, making it rather challenging to work with. Indeed, it’s only in the last 12 months that the ready availability of cost effective EC2 instances with high capacity SSD storage and very large (>100GB) amounts of RAM has made it possible for us to work with Wikidata in realtime.
Wikidata is easily machine processable, but every fact is treated as equal. This poses a problem: in our world, the importance of a given fact is highly dependent on context.
For example, in a content recommendations scenario, the fact that Andy Murray is a tennis player is, in an article on Tennis, likely to be far more relevant than the fact that he was born in 1987.
To make full use of Wikidata, we need to know what facts are most relevant – both globally, and locally, depending on context. Given the size and scope of Wikidata, manually curating these facts is not feasible.
Semantic similarity and Topic affinity
To decide which facts are important given a specific context, we use two different techniques.
Semantic similarity between two topics captures how similar their individual meanings are. For example, Andy Murray is semantically very similar to Roger Federer – both are tennis players, both have won Grand Slams, and both have won many tennis tournaments. Andy Murray and Wayne Rooney are less similar but still hold some similarities; both are British sportsmen who are close to the top of their respective fields. Andy Murray and South West Trains are dissimilar – there are no meaningful connections between the two.
Topic affinity captures how likely it is that, given one topic, another is relevant. This is calculated from millions of social network profiles. The IHQ platform processes these en-masse and links individual topics to Tweets, Likes and shared URLs. By examining the relative co-occurrence of these topics, and taking care to account for the fact that some topics are far more prevalent than others (“Family Guy” appears millions of times, “Blacks Books” only a few thousand), we are able to make connections between topics that are not encoded in Wikipedia or Wikidata.
Accounting for the relative popularity of topics is very important. Without making this adjustment, almost every list of related topics is dominated by celebrities and popular TV shows. To reduce the effect of this we utilise ideas from information theory, looking for relationships which lower entropy in the overall selection.
Putting it all together
Putting these individual components together gives us an interconnected graph of concepts and categories. Links between these topics and categories capture different relationships (semantic, affinity, ontological), and weights on individual edges capture the strength of the relationship.
On top of this, we add links to our customers content and their users profiles.
This final version of the graph is used to drive content recommendations, segment users for analytics and drive advertising.
[btn href=”http://intenthq.difr.co/resources/intent-hq-topic-graph-white-paper/”]Download the Topic Graph white paper[/btn]