Tuesday 22 November 2016

Exploring The Clinton Emails Through Document Similarity

In the last post we succeeded in finally improving performance to make calculating document similarity for the fairly large Clinton Email data set.

Let's explore it...



Top 800

There are 7944 documents (emails) in the Clinton data set. This turns into 7944 * 7942 / 2 = 31,668,756 doc1-doc2 combinations. Plotting all of these in one graph visualisation might not be useful to see structure.

So let's see what happens when we plot the top 800. By top we mean the 800 pairs of documents with the highest similarity scores.

Here's what happens:



You can see that even with only 800 out of 31 million plotted, the view is quite busy! Luckily the force-directed graph settles and the documents move apart, except where strong links keep things together. This does clarify some clusters of similar documents, but quite a lot moves out of view .. something we need to fix later.

Here's what it looks like after a settling a little:


What are some of these clusters? Let's look at the biggest one in that view, coloured pink above, ... and review the contents of the emails. Here are some of the document names, and what they're about:

  • C05767378.txt
  • C05767361.txt
  • C05767355.txt
  • C05773263.txt
  • C05773249.txt
  • C05773269.txt
  • C05767369.txt

Well - looking at the content of those emails .. it is encouraging that the subject of all of these email is the same ... "Subject: Re: Bravo! Brava! Issue your statement! Sid" .. which makes sense as they are all supposed to be related. This is good news, and suggests our document similarity calculations are working!

Once we've read the documents we realise there is a silly reason that the documents are so similar - they all quote the same guardian article, copying and pasting the text from it. That's ok .. we're interested in documents that are meaningfully related and having read them, we realise they are in fact part of the same conversation. 




Top 10

Let's limit the number of pairs that are drawn to the very top-most 10. This should show us those documents with the most similarity across the entire corpus.



There's only one cluster with more than 2 documents .. so let's take a look at it:

  • C05774734.txt
  • C05770127.txt
  • C05770137.txt

Again it's good to see that these 3 emails are all about the same subject .. with an email subject of "Subject: Re: The political ramifications of the Greek/European debt crisis". Again, there is a silly reason why these are so highly scored for similarity ... they copy and paste the same press release text. Despite this, the emails are actually related. 




Top 50

Increasing the number to 50 .. we still only really get document pairs .. no real clusters yet.



Let's try more pairs.





Top 2000

Let's plot the top 2000 pairs and let it settle. Here's what we get ...



Let's look at that cluster highlighted in blue. 



It's interesting because rather than having lots of connections between most/all the documents ... it is in fact a chain with one document similar to another, which in turn is similar to another .. and so on. 

The chain is:

  • C05767510.txt
  • C05773329.txt
  • C05773316.txt
  • C05767491.txt
  • C05767495.txt
  • C05767499.txt
  • and a branch to C05773315.txt

Reading these emails, we can again say they are linked because they all have the same email subject field "Subject Re: Mubarak Call Sheet" .. they're part of the same conversation. That is a pretty good achievement given we simply had a huge bucket of random emails to work with!

The chain also shows the email trail as it grows with replies to emails including the text of previous emails. Again it is pretty cool that we managed to visualise this on the graph! That branch on the graph to C05773315.txt is also reflected in the text ... it is a forked email conversation.

Not bad ... all this from a simple idea of similar documents!



Topics & Themes


This is all great .. and we can spend loads of time exploring these graphs for interested clusters, and chains .. and even key documents which link different clusters ...

But one thing we still had to do manually is read the documents to find out what the cluster theme was .. that is something we have started to explore when we started to look at reducing dimensions to distill out the key topics in a corpus.

So let's crack on with developing that further ...

No comments:

Post a Comment