Friday 4 November 2016

Grouping Similar Documents .. aka Clustering. Part 2/2

Last time talked about how useful it would be to automatically group similar documents - and we developed quite a simple idea for doing this.

And it worked quite well ..


For a test mixed corpus, made of documents from 4 different corpuses ... the method successfully separated out a cluster consisting only of recipes ... not bad at all for such a simple underlying idea that similar documents share common words ... nothing more complicated than that.

But we need to do more experimenting to see if we fix the big cluster which is a mix of document types ... we should ideally have 4 clusters - Clinton emails, Iraq report, Italian recipes and Macbeth.

The following reminds us of how we constructed the mixed courpus, from 13 documents from each of the source corpora, and our expectation that successful document similarity clustering well separate them again according to these types.




Experiment 1: Remove Stop Words

Perhaps the similarity scores are being skewed by those boring stop-words like a, the, in, it, on ... let's apply our filter to brutally chop them out.

The results, showing 150 links are ...


Yipeee!

That works fantastically! There are 4 clear clusters .. and they are pure ... they only contain documents of the same type.

The following shows the same graph plot with the clusters shown more clearly.


This tells us that we do need to deal with boring words skewing the similarity scores. We previously tried to avoid the brutal method of removing stop-words directly and using measures of interestingness ... we could try that ...



Experiment 2: Normalise The Word Frequency Vectors

Let's undo our brutal stop-word removal and try a different approach.

Our current calculations are using word frequency, not raw word counts, to account for any bias from long or short documents .. which will have more or less words because of this. A short document can never have a very high word count.

But we didn't normalise the document vectors when we took the dot product. This wasn't intentional .. we simply forgot!

We can use the scipy.spatial.distance.cosine() function which is just (1 - similarity) that we want .. and includes normalising the vectors.

Here's the top of the results... looks promising because there doesn't appear to be any incorrect similarities ..

doc1 doc2 similarity
0 C05739561.txt C05739554.txt 1.000000
1 C05739565.txt C05739563.txt 0.999959
2 C05739561.txt C05739546.txt 0.997317
3 C05739554.txt C05739546.txt 0.994921
4 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-32.txt 0.994283
5 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-32.txt 0.992807
6 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-31.txt 0.992191
7 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-33.txt 0.991721
8 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-12.txt 0.989763
9 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-31.txt 0.987702
10 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-31.txt 0.981875
11 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-12.txt 0.978545
12 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-12.txt 0.975892
13 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-12.txt 0.968804
14 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-11.txt 0.965843
15 the-report-of-the-iraq-inquiry_section-12.txt the-report-of-the-iraq-inquiry_section-11.txt 0.965642
16 the-report-of-the-iraq-inquiry_section-20.txt the-report-of-the-iraq-inquiry_section-12.txt 0.964386
17 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-20.txt 0.960154
18 the-report-of-the-iraq-inquiry_section-15-1.txt the-report-of-the-iraq-inquiry_section-12.txt 0.956302
19 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-15-1.txt 0.953367

So normalising the vectors seems to help ..  let's see what happens if we plot the top 50 links:


Looks promising .. let's plot more... the top 150:


Oh no! ... things have broke down a bit .. yes, there are 2 good clusters that are pure .. and even a pure tiny cluster .. but the fourth cluster is a big mix of documents that should have been in different clusters.

So normalisation helps mitigate the effects of lots of boring words .. but why? Let's take this step by step ..

  • Our simple idea for document similarity says that documents are similar if they share common words ... documents in a cluster about cars won't mention food words, for example.
  • But the boring stop-words like .. is, a, an, on, to, the .. appear in all kinds of documents. They don't help identify which genre or cluster a document should be in .. that's why they're boring .. they're not relevant words.
  • These boring words occur lots, by far, and dominate the word frequency of more interesting words .. making documents seem similar when they aren't ... because they have lots stop words in common .. and the frequencies are higher than useful words.
  • Normalising the document vectors makes the very large stop-word frequencies smaller .. so that the vectors have a length of 1. The effect on large stop-word frequencies is significant .. but the effect on small interesting words is less so ...  the following diagram shows this.

You can see the effect on high boring-word frequencies is much more dramatic (bigger change) than it is on interesting word frequencies (smaller change).

That explains why normalising vectors seem to mitigate the effect of overly prominent stop-words. But it isn't perfect as we saw.



Word Interesting-ness Factor

What did we do before, when we were trying to deal with boring words? We developed a measure of interesting-ness or relevance .. by combining three factors:

  • term frequency (TF) - how often a word occurs in a document
  • inverse document frequency (IDF) - how often a word occurs across the corpus
  • .. and our own refinement penalising shorter words - because most stop-words are short

Maybe we should use the relevance scores to automatically weed out the boring words?

To do his is easy, we just read the relevance matrix that we created earlier, instead of the word count index. Also, we comment out the calculation of word frequency because that is already done for the relevance index.

The results look promising ... there aren't any incorrect looking similarity pairs in the top 20 list:

doc1 doc2 similarity
0 C05739561.txt C05739546.txt 1.000000
1 C05739561.txt C05739554.txt 0.997375
2 C05739565.txt C05739563.txt 0.992705
3 C05739554.txt C05739546.txt 0.991492
4 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-32.txt 0.950962
5 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-33.txt 0.926943
6 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-32.txt 0.925336
7 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-31.txt 0.916514
8 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-31.txt 0.887027
9 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-31.txt 0.858477
10 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-12.txt 0.845943
11 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-12.txt 0.788654
12 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-12.txt 0.758289
13 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-12.txt 0.742125
14 C05739567.txt C05739545.txt 0.706729
15 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-11.txt 0.700997
16 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-11.txt 0.686425
17 the-report-of-the-iraq-inquiry_section-12.txt the-report-of-the-iraq-inquiry_section-11.txt 0.665006
18 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-11.txt 0.651868
19 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-20.txt 0.647377

In fact extending the list to the top 100 also looks good ... very promising! This gives us hope that we can automate the mitigation of stop-words .. and not use the brutal method of filtering out predetermined stop-words.

Let' see the visualisation of the clusters. Here's the top 50 links:


Looks good .. clear distinct and pure clusters... let's look at the top 100:


So far so good, 4 clear clusters as we'd expect .. and they're all pure. The recipe's one is a little small with only 2 documents in it ... that must mean other document kinds ... like the Iraq report documents .. are more similar to each other than the recipes .. which kind of makes sense because the recipes are for different dishes, and so have different ingredients and cooking methods.

In fact, if we look at the two recipes shown 10.txt and 11.txt they are for white sauce and yellow sauce, with lots of similarities for preparation .. so no surprise that they should have a higher similarity score than other recipes.

Let's look at the top 200 now ..


Look at the above .. I've colour coded them to make it easier to see which documents are in each cluster.

Whoah! ... what happened? Five clusters?

The recipes seem to have been split into 2 clusters ... but if we examine the smaller cluster of 2 we see that they are both recipes for rice cakes, and are similar to each other .. but they must be sufficiently different to the other recipes for links not to appear in the top 200.

If you look closely at the Iraq report cluster you'll see one document hanging onto the rest by a tenuous link .. its the annex-4. That's just an appendix listing maps ... and doesn't contain much else that's topical .. so the graph visualisation really does show which documents, even in a cluster, are only there by the most minimal of connections.

Great!

Let's keep pushing the number of links to plot .. this is tej top 800:


Now that's less useful .. everything is starting to connect to everything now because the even weak similarity links are now shown . even so you can still see the clusters .. the Italian recipes and Clinton emails trying to stay together and away from the rest of the mass.

The Macbeth play and the Iraq report are most easily merging together  ... maybe there's is a message in there somewhere about both tragedies?

Let's stop there and sum up...



Conclusion:

We've made fantastic progress:

  • Removing stop-words is effective, and results in clean clusters.
  • But if we don't want to use a predetermined stop-word list .. 
  • Normalisation of the document vectors helps mitigate overly high counts for boring words.
  • And relevance scores (TF/IDF and similar) work great to automatically and adaptively remove boring commonly occurring words.
  • The graph visualisation is really powerful for insights into the text data .. surfacing the natural number of clusters, tenuous links, and even clusters within clusters. 


The code for document similarity is at GitHub:



Great job!

No comments:

Post a Comment