Intuitive Text Mining: Dimension Reduction with SVD ... Let's Try It!

In the last two posts, here and here, we wanted to see how we could:

Visualise lots of documents .. each with many dimensions (words) .. on a 2 dimensional plot .. so that we can see them clearly and usefully on a computer or on a book page.

Extract out the underlying topics (e.g. vehicle, cooking) ... especially when the documents don't actually mention these more general topic names specifically.

We found a tool called Singular Valued Decomposition which should help. Let's try it it on real data.

The Mixed Corpus

We previously created the mixed corpus, made of 13 documents from each of the Recipes, Iraq, Macbeth and Clinton data sets. We previously used it to see if our method of grouping similar documents worked.

We'll keep our code in a notebook on GitHub at:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/07_svd_clustering_and_topics.ipynb

We'll apply our stop-word filter, and create the relevance index as usual. We create the SVD decomposition of the word-document matrix $\mathbf{A}$, which saves a hdf5 file with the three matrices $\mathbf{U}$, $\mathbf{\Sigma}$, and $\mathbf{V}^T$

tmt.svd.calculate_singular_value_decomposition(cr.content_directory)

We now want to get a feel for how many useful dimensions there are. One way of doing this is looking at the eigenvalues .. the diagonal elements of the $\mathbf{\Sigma}$ matrix. We can see this by plotting them as a bat chart.

# get SVD eigenvalues
eigenvalues = tmt.svd.get_svd_eigenvalues(cr.content_directory)

# visualise the SVD eigenvalues as
tmt.visualisation.plot_bar_chart(eigenvalues)

The eigenvalues look like the following:

We can see there are two eigenvalues that stick out well above the others. That means the general nature of the mixed data set can be broadly reconstructed with just these two eigenvalues. If we wanted to be more detailed, we can see the next 3 eigenvalues also stick out as a bit larger than the rest which seem to fall away to a very small number.

We can explore these top 5 eigenvalues when we look at extracting topics below.

First let's see what the transformed document-view $\mathbf{\Sigma} \cdot \mathbf{V}^T$ looks like .. using only the top 2 eigenvalues in $\mathbf{\Sigma}$ to take a 2-dimensional slice. Hopefully similar documents are placed near each other.

# get document-view projection onto 2 dimensions
document_view = tmt.svd.get_document_view(cr.content_directory)

# plot documents in reduced dimension space with a 2-d scatter
tmt.visualisation.plot_scatter_chart(document_view)

Here's what the 2-d plot of the document-view looks like:

Well, we can see some clusters .. the dots aren't scattered everywhere in a random manner. But can we see 4 clusters, as we hope? Well, we'll have to see where the original documents from the Recipes, Iraq, Macbeth and Clinton data sets ended up on that chat. We can do this because we kept the document-view matrix ordered with document names associated with the pandas data frame columns.

Here's that plot separated out into the documents that we know they are:

That's better! We can clearly see that there is indeed a Recipes cluster and a Clinton cluster. Yu can also see the documents are spread along lines which are perpendicular to each other .. in other worse, these two sets of documents are as different as could be, albeit being squished into a 2-d space! Cool!

The Macbeth and Iraq clusters are much closer to zero, they are distinct clusters, but they aren't as distant from each other as the recipes and Clinton documents are. We saw before that the Iraq and Macbeth documents merge most easily in terms of similarity, so this shouldn't surprise us.

Maybe there's a lesson there .. Macbeth and Iraq .. both about corruption, death and intrigue?

Now, let's try to extract some topics.

For topics, we're looking at the word-view $\mathbf{U} \cdot \mathbf{\Sigma}$. The columns of these matrix are the topics, made up of linear-combinations of the words from the original vocabulary. Because we've truncated the singular matrix $\mathbf{\Sigma}$, this word-view will have zero-value columns except for the left-most n where n is the number of retained eigenvalues in $\mathbf{\Sigma}$.

The way we code this is to take each column from the left-most n that aren't zero. For each one, we re-order it so that the values are in descending order .. this will give us the most significant word at the top of that column. We can then truncate these columns to only retain the most contributing words. We also remove the sign +/- of the values because a value of -0.8 is more contributing than a value of +0.001. Its the value of the elements that matters, it shows how much of the word contributes to the topic.

Let's try it:

# get top n topics, n is usually the same as key dimensions identified by the eigenvalue bar chart above
number_of_topics = 4
# how many words in each topic (the most significant)
topic_length = 10

topics_list = tmt.svd.get_topics(cr.content_directory, number_of_topics, topic_length)

That means we're picking out only the top 4 topics .. and each of these topics will be limited to the 10 most significant words. Printing the returned list of topics gives:

topic # 0
sauce 0.036254
butter 0.031721
broth 0.029299
boiled 0.028786
flour 0.028568
little 0.027809
water 0.027283
rice 0.025682
quantity 0.024787
salt 0.018478
Name: 0, dtype: float64

topic # 1
subject 0.042459
benghazi 0.033981
state 0.032291
f201504841 0.032254
redactions 0.031469
unclassified 0.031251
sensitive 0.028776
waiver 0.027892
05132015 0.027635
department 0.025765
Name: 1, dtype: float64

topic # 2
vegetables 0.036591
sauce 0.030400
rice 0.023624
soup 0.022900
fish 0.019670
cabbage 0.017470
boiled 0.016740
greens 0.015799
flour 0.015604
cooking 0.014433
Name: 2, dtype: float64

topic # 3
rice 0.041190
vegetables 0.037766
saffron 0.018390
cabbage 0.017364
marrow 0.017055
greens 0.016380
cooked 0.012974
beef 0.012837
soup 0.012059
them 0.011880
Name: 3, dtype: float64

We can see quite clearly the top 2 topics are distinct .. and about two very different themes. One is clearly about cooking, and the other about the Clinton emails, specifically about Libya.

That's great! We've extracted two topics.

What about the next two topics? These are also about cooking but these are less influential topics. How do we see this? If we plot the sum of the absolute values of the topic columns we can see the following:

The top 2 columns (x-axis) have the largest sums .. the next two drop significantly.

Great! That seems to work .. but lets try another, less contrived dataset.

Recipes Corpus

Let's try the recipes corpus. We know this isn't a mixed corpus containing very different themes. Let' see what happens anyway .. it'll be useful to see how a mono-themed corpus yields to SVD.

The bar chart of eigenvalues shows only one value leading above the rest:

That probably suggests that there is only one main topic or theme of the data set. Indeed we know this to be true.

Let's see what the document-view plotted in 2-dimensions looks like:

There's a cluster ... not two or more distinct and distance clusters. Again we expected this.

The list if topics isn't that distinct, and the sum of the elements of the topic columns shows just one dominant theme:

Well - even if that didn't reveal great insights, we can see what a mono-themed dataset looks like when SVD is applied.

Let's try another, bigger corpus.

Iraq Inquiry Report

The Iraq Inquiry Report is, at one level, all about one topic - the Iraq war and the circumstances that led to it, but it is a big enough data set that should contain several sub-topics. Let's see what happens...

So there's clearly one eigenvalue much bigger than the rest, but there are maybe four others that also stick out as significant. This could mean there are 4 or 5 topics. Let's continue ...

The document-view plotted in 2-dimensions shows one cluster that is very clearly distant and distinct from the rest. What are these? Let's see:

document_view.T[document_view.T[0] > 0.01]

	0	1
the-report-of-the-iraq-inquiry_section_annex-4.txt	0.039891	0.001821

If we look closer at this document .. we see it is an annex, and in fact a set of maps. So that is indeed different from the rest of the documents which are more narrative.

Let's zoom into that central grouping a little more:

That's not quite as enlightening .. there still seems to be one cluster with a couple of off-shoots. Let look at these. The documents with large x-coordinates are...

document_view.T[document_view.T[0] > 0.001]

	0	1
the-report-of-the-iraq-inquiry_section_annex-2.txt	0.001225	-0.003456
the-report-of-the-iraq-inquiry_section_annex-3.txt	0.001312	-0.003445
the-report-of-the-iraq-inquiry_section_annex-4.txt	0.039891	0.001821

If we look closer at these annexes .. (we saw 2 above) we find that they are indeed different from the main data set .. they are a glossary of terms (annex-2), a list of names and posts (annex-3) and the set of maps (annex-4) we saw above.

What about those docs dangling downwards ...

document_view.T[document_view.T[1] < -0.006]

	0	1
the-report-of-the-iraq-inquiry_section-111.txt	0.000499	-0.007369
the-report-of-the-iraq-inquiry_section-112.txt	0.000415	-0.010617

Again, looking at these two documents, sections 111 and 112, we find that they are both about de‑Ba’athification, which the plot tells us must be sufficiently different and unique compared to the rest of the documents.

Let's look at some of the topics, taking the top 10, and ten words in each:

topic # 0
multinational 0.019513
map 0.018873
mndn 0.009999
maps 0.009649
division 0.007975
dissolved 0.005887
provinces 0.005885
southeast 0.005879
mnfnw 0.005734
mndnc 0.005734
Name: 0, dtype: float64

topic # 1
debaathification 0.010073
resolution 0.003564
basra 0.002889
weapons 0.002380
had 0.002360
inspectors 0.002251
destruction 0.002238
cpa 0.002217
wmd 0.002115
biological 0.002111
Name: 1, dtype: float64

topic # 2
debaathification 0.016325
baath 0.002761
resolution 0.002180
postinvasion 0.002080
no1 0.002078
bremer 0.002075
weapons 0.001960
destruction 0.001869
baathists 0.001798
biological 0.001708
Name: 2, dtype: float64

topic # 3
telegrams 0.008248
quotes 0.004428
documents 0.003594
quote 0.003523
bold 0.002878
spellings 0.002869
redactions 0.002845
egram 0.002822
transcripts 0.002773
navigate 0.002693
Name: 3, dtype: float64

topic # 4
inquests 0.004623
families 0.004157
bereaved 0.003888
coroners 0.002850
mental 0.002770
boi 0.002714
reservists 0.002711
investigations 0.002630
telic 0.002620
inquest 0.002611
Name: 4, dtype: float64

topic # 5
basra 0.003605
inquests 0.003233
bereaved 0.002699
debaathification 0.002648
families 0.002642
butler 0.002636
destruction 0.002585
weapons 0.002520
dfid 0.002233
witness 0.002220
Name: 5, dtype: float64

topic # 6
witness 0.009471
director 0.001858
20012003 0.001808
representative 0.001532
lieutenant 0.001475
deputy 0.001458
resolution 0.001426
20002003 0.001356
commander 0.001354
butler 0.001294
Name: 6, dtype: float64

topic # 7
resolution 0.003268
snatch 0.003229
vehicle 0.002454
istar 0.002397
vehicles 0.002338
hutton 0.002313
butler 0.002294
requirement 0.001761
mobility 0.001725
destruction 0.001704
Name: 7, dtype: float64

topic # 8
treaty 0.004037
britain 0.003420
witness 0.003097
feisal 0.002870
ottoman 0.002433
nuri 0.002356
angloiraqi 0.002282
mesopotamia 0.002100
rashid 0.001794
resolution 0.001672
Name: 8, dtype: float64

topic # 9
veterans 0.004128
mental 0.003520
medical 0.002977
inquests 0.002574
health 0.002349
inquest 0.002167
reservists 0.001946
coroners 0.001898
resolution 0.001734
witness 0.001710
Name: 9, dtype: float64

This is much better than I would have expected .. each of these topics is meaningful .. the SVD seems to have pulled them out despite the lack of clear cluster on the 2-d plot:

topic 0 is about maps, regions, provinces,
topics 1, 2 are about de'baathification and post-invasion policy and events
topic 3 is clearly about messages, telegrams, spelling, transcripts, quotes
topic 4 is about inquests, families, bereaved, coroners
topic 5 is also about inquests but seems to be focussed on basra and the Butler report
not fully clear what topic 6 is about
topic 7 is about the snatch vehicles and mobility, which received much reporting about their adequacy
topic 8 is about the more historical topics of treaties and the ottoman and British empires
topic 9 is about veterans mental and physical health, and inquests about this

Given how well this is going .. why stop at 10 topics .. an experiment with 20 topics still yield great results with topics:

topic 11 about treasury, funding, costs
topic 18 about policing, basra, iraqiisaton and reform
etc..

This is really really impressive! I certainly didn't expect such a treasure trove of topic insights to emerge!

Conclusion

SVD is great at extracting topics - we've seen the amazing results popping out of the Iraq Report corpus.

SVD is ok at visualising in 2-d a much higher dimensional dataset .. but it isn't amazing as only the projection onto 2 axes (topics) can be visualised. This loses too much information if other topics are significant.

The bar chart plot of eigenvalues from the $\mathbf{\Sigma}$ matrix is a really good way of determining how many significant topics there are in a data set.

Next time we'll try the much bigger Clinton emails .. and also have look at applying the similarity graph plotting to the reconstructed SVD matrix (with reduced singular values).

Intuitive Text Mining

Tuesday, 21 February 2017

Dimension Reduction with SVD ... Let's Try It!

The Mixed Corpus

Recipes Corpus

Iraq Inquiry Report

Conclusion

No comments:

Post a Comment