Tuesday 21 February 2017

Dimension Reduction with SVD ... Let's Try It!

In the last two posts, here and here, we wanted to see how we could:


  • Visualise lots of documents .. each with many dimensions (words) .. on a 2 dimensional plot .. so that we can see them clearly and usefully on a computer or on a book page.
  • Extract out the underlying topics (e.g. vehicle, cooking) ... especially when the documents don't actually mention these more general topic names specifically.


We found a tool called Singular Valued Decomposition which should help. Let's try it it on real data.


The Mixed Corpus

We previously created the mixed corpus, made of 13 documents from each of the Recipes, Iraq, Macbeth and Clinton data sets. We previously used it to see if our method of grouping similar documents worked.


We'll keep our code in a notebook on GitHub at:




We'll apply our stop-word filter, and create the relevance index as usual. We create the SVD decomposition of the word-document matrix $\mathbf{A}$, which saves a hdf5 file with the three matrices $\mathbf{U}$, $\mathbf{\Sigma}$, and $\mathbf{V}^T$

tmt.svd.calculate_singular_value_decomposition(cr.content_directory)

We now want to get a feel for how many useful dimensions there are. One way of doing this is looking at the eigenvalues .. the diagonal elements of the $\mathbf{\Sigma}$ matrix. We can see this by plotting them as a bat chart.

# get SVD eigenvalues
eigenvalues = tmt.svd.get_svd_eigenvalues(cr.content_directory)

# visualise the SVD eigenvalues as
tmt.visualisation.plot_bar_chart(eigenvalues)

The eigenvalues look like the following:


We can see there are two eigenvalues that stick out well above the others. That means the general nature of the mixed data set can be broadly reconstructed with just these two eigenvalues. If we wanted to be more detailed, we can see the next 3 eigenvalues also stick out as a bit larger than the rest which seem to fall away to a very small number.

We can explore these top 5 eigenvalues when we look at extracting topics below.

First let's see what the transformed document-view $\mathbf{\Sigma} \cdot \mathbf{V}^T$ looks like .. using only the top 2 eigenvalues in $\mathbf{\Sigma}$ to take a 2-dimensional slice. Hopefully similar documents are placed near each other.

# get document-view projection onto 2 dimensions
document_view = tmt.svd.get_document_view(cr.content_directory)

# plot documents in reduced dimension space with a 2-d scatter 
tmt.visualisation.plot_scatter_chart(document_view)

Here's what the 2-d plot of the document-view looks like:


Well, we can see some clusters .. the dots aren't scattered everywhere in a random manner. But can we see 4 clusters, as we hope? Well, we'll have to see where the original documents from the RecipesIraqMacbeth and Clinton data sets ended up on that chat. We can do this because we kept the document-view matrix ordered with document names associated with the pandas data frame columns.

Here's that plot separated out into the documents that we know they are:


That's better! We can clearly see that there is indeed a Recipes cluster and a Clinton cluster. Yu can also see the documents are spread along lines which are perpendicular to each other .. in other worse, these two sets of documents are as different as could be, albeit being squished into a 2-d space! Cool!

The Macbeth and Iraq clusters are much closer to zero, they are distinct clusters, but they aren't as distant from each other as the recipes and Clinton documents are. We saw before that the Iraq and Macbeth documents merge most easily in terms of similarity, so this shouldn't surprise us.


Maybe there's a lesson there .. Macbeth and Iraq .. both about corruption, death and intrigue?

Now, let's try to extract some topics.

For topics, we're looking at the word-view $\mathbf{U} \cdot \mathbf{\Sigma}$. The columns of these matrix are the topics, made up of linear-combinations of the words from the original vocabulary. Because we've truncated the singular matrix $\mathbf{\Sigma}$, this word-view will have zero-value columns except for the left-most n where n is the number of retained eigenvalues in $\mathbf{\Sigma}$.

The way we code this is to take each column from the left-most n that aren't zero. For each one, we re-order it so that the values are in descending order .. this will give us the most significant word at the top of that column. We can then truncate these columns to only retain the most contributing words. We also remove the sign +/- of the values because a value of -0.8 is more contributing than a value of +0.001. Its the value of the elements that matters, it shows how much of the word contributes to the topic.

Let's try it:

# get top n topics, n is usually the same as key dimensions identified by the eigenvalue bar chart above
number_of_topics = 4
# how many words in each topic (the most significant)
topic_length = 10

topics_list = tmt.svd.get_topics(cr.content_directory, number_of_topics, topic_length)

That means we're picking out only the top 4 topics .. and each of these topics will be limited to the 10 most significant words. Printing the returned list of topics gives:

 topic # 0
sauce       0.036254
butter      0.031721
broth       0.029299
boiled      0.028786
flour       0.028568
little      0.027809
water       0.027283
rice        0.025682
quantity    0.024787
salt        0.018478
Name: 0, dtype: float64 

 topic # 1
subject         0.042459
benghazi        0.033981
state           0.032291
f201504841      0.032254
redactions      0.031469
unclassified    0.031251
sensitive       0.028776
waiver          0.027892
05132015        0.027635
department      0.025765
Name: 1, dtype: float64 

 topic # 2
vegetables    0.036591
sauce         0.030400
rice          0.023624
soup          0.022900
fish          0.019670
cabbage       0.017470
boiled        0.016740
greens        0.015799
flour         0.015604
cooking       0.014433
Name: 2, dtype: float64 

 topic # 3
rice          0.041190
vegetables    0.037766
saffron       0.018390
cabbage       0.017364
marrow        0.017055
greens        0.016380
cooked        0.012974
beef          0.012837
soup          0.012059
them          0.011880
Name: 3, dtype: float64 

We can see quite clearly the top 2 topics are distinct .. and about two very different themes. One is clearly about cooking, and the other about the Clinton emails, specifically about Libya.

That's great! We've extracted two topics.

What about the next two topics? These are also about cooking but these are less influential topics. How do we see this? If we plot the sum of the absolute values of the topic columns we can see the following:


The top 2 columns (x-axis) have the largest sums .. the next two drop significantly.

Great! That seems to work .. but lets try another, less contrived dataset.



Recipes Corpus

Let's try the recipes corpus. We know this isn't a mixed corpus containing very different themes. Let' see what happens anyway .. it'll be useful to see how a mono-themed corpus yields to SVD.

The bar chart of eigenvalues shows only one value leading above the rest:


That probably suggests that there is only one main topic or theme of the data set. Indeed we know this to be true.

Let's see what the document-view plotted in 2-dimensions looks like:


There's a cluster ... not two or more distinct and distance clusters. Again we expected this.

The list if topics isn't that distinct, and the sum of the elements of the topic columns shows just one dominant theme:


Well - even if that didn't reveal great insights, we can see what a mono-themed dataset looks like when SVD is applied.

Let's try another, bigger corpus.


Iraq Inquiry Report

The Iraq Inquiry Report is, at one level, all about one topic - the Iraq war and the circumstances that led to it, but it is a big enough data set that should contain several sub-topics. Let's see what happens...


So there's clearly one eigenvalue much bigger than the rest, but there are maybe four others that also stick out as significant. This could mean there are 4 or 5 topics. Let's continue ...


The document-view plotted in 2-dimensions shows one cluster that is very clearly distant and distinct from the rest. What are these? Let's see:

document_view.T[document_view.T[0] > 0.01]

01
the-report-of-the-iraq-inquiry_section_annex-4.txt0.0398910.001821

If we look closer at this document .. we see it is an annex, and in fact a set of maps. So that is indeed different from the rest of the documents which are more narrative.

Let's zoom into that central grouping a little more:


That's not quite as enlightening .. there still seems to be one cluster with a couple of off-shoots. Let look at these. The documents with large x-coordinates are...

document_view.T[document_view.T[0] > 0.001]

01
the-report-of-the-iraq-inquiry_section_annex-2.txt0.001225-0.003456
the-report-of-the-iraq-inquiry_section_annex-3.txt0.001312-0.003445
the-report-of-the-iraq-inquiry_section_annex-4.txt0.0398910.001821

If we look closer at these annexes .. (we saw 2 above) we find that they are indeed different from the main data set .. they are a glossary of terms (annex-2), a list of names and posts (annex-3) and the set of maps (annex-4) we saw above.

What about those docs dangling downwards ...

document_view.T[document_view.T[1] < -0.006]

01
the-report-of-the-iraq-inquiry_section-111.txt0.000499-0.007369
the-report-of-the-iraq-inquiry_section-112.txt0.000415-0.010617

Again, looking at these two documents, sections 111 and 112, we find that they are both about de‑Ba’athification, which the plot tells us must be sufficiently different and unique compared to the rest of the documents.

Let's look at some of the topics, taking the top 10, and ten words in each:

 topic # 0
multinational    0.019513
map              0.018873
mndn             0.009999
maps             0.009649
division         0.007975
dissolved        0.005887
provinces        0.005885
southeast        0.005879
mnfnw            0.005734
mndnc            0.005734
Name: 0, dtype: float64 

 topic # 1
debaathification    0.010073
resolution          0.003564
basra               0.002889
weapons             0.002380
had                 0.002360
inspectors          0.002251
destruction         0.002238
cpa                 0.002217
wmd                 0.002115
biological          0.002111
Name: 1, dtype: float64 

 topic # 2
debaathification    0.016325
baath               0.002761
resolution          0.002180
postinvasion        0.002080
no1                 0.002078
bremer              0.002075
weapons             0.001960
destruction         0.001869
baathists           0.001798
biological          0.001708
Name: 2, dtype: float64 

 topic # 3
telegrams      0.008248
quotes         0.004428
documents      0.003594
quote          0.003523
bold           0.002878
spellings      0.002869
redactions     0.002845
egram          0.002822
transcripts    0.002773
navigate       0.002693
Name: 3, dtype: float64 

 topic # 4
inquests          0.004623
families          0.004157
bereaved          0.003888
coroners          0.002850
mental            0.002770
boi               0.002714
reservists        0.002711
investigations    0.002630
telic             0.002620
inquest           0.002611
Name: 4, dtype: float64 

 topic # 5
basra               0.003605
inquests            0.003233
bereaved            0.002699
debaathification    0.002648
families            0.002642
butler              0.002636
destruction         0.002585
weapons             0.002520
dfid                0.002233
witness             0.002220
Name: 5, dtype: float64 

 topic # 6
witness           0.009471
director          0.001858
20012003          0.001808
representative    0.001532
lieutenant        0.001475
deputy            0.001458
resolution        0.001426
20002003          0.001356
commander         0.001354
butler            0.001294
Name: 6, dtype: float64 

 topic # 7
resolution     0.003268
snatch         0.003229
vehicle        0.002454
istar          0.002397
vehicles       0.002338
hutton         0.002313
butler         0.002294
requirement    0.001761
mobility       0.001725
destruction    0.001704
Name: 7, dtype: float64 

 topic # 8
treaty         0.004037
britain        0.003420
witness        0.003097
feisal         0.002870
ottoman        0.002433
nuri           0.002356
angloiraqi     0.002282
mesopotamia    0.002100
rashid         0.001794
resolution     0.001672
Name: 8, dtype: float64 

 topic # 9
veterans      0.004128
mental        0.003520
medical       0.002977
inquests      0.002574
health        0.002349
inquest       0.002167
reservists    0.001946
coroners      0.001898
resolution    0.001734
witness       0.001710
Name: 9, dtype: float64

This is much better than I would have expected .. each of these topics is meaningful .. the SVD seems to have pulled them out despite the lack of clear cluster on the 2-d plot:

  • topic 0 is about maps, regions, provinces, 
  • topics 1, 2 are about de'baathification and post-invasion policy and events
  • topic 3 is clearly about messages, telegrams, spelling, transcripts, quotes
  • topic 4 is about inquests, families, bereaved, coroners
  • topic 5 is also about inquests but seems to be focussed on basra and the Butler report
  • not fully clear what topic 6 is about
  • topic 7 is about the snatch vehicles and mobility, which received much reporting about their adequacy
  • topic 8 is about the more historical topics of treaties and the ottoman and British empires
  • topic 9 is about veterans mental and physical health, and inquests about this

Given how well this is going .. why stop at 10 topics .. an experiment with 20 topics still yield great results with topics:

  • topic 11 about treasury, funding, costs
  • topic 18 about policing, basra, iraqiisaton and reform
  • etc..


This is really really impressive! I certainly didn't expect such a treasure trove of topic insights to emerge!


Conclusion

  • SVD is great at extracting topics - we've seen the amazing results popping out of the Iraq Report corpus.
  • SVD is ok at visualising in 2-d a much higher dimensional dataset .. but it isn't amazing as only the projection onto 2 axes (topics) can be visualised. This loses too much information if other topics are significant.
  • The bar chart plot of eigenvalues from the $\mathbf{\Sigma}$ matrix is a really good way of determining how many significant topics there are in a data set.

Next time we'll try the much bigger Clinton emails .. and also have look at applying the similarity graph plotting to the reconstructed SVD matrix (with reduced singular values).

Thursday 9 February 2017

So Many Dimensions! .. And How To Reduce Them - Part 2/2 (SVD Demystified)

Previously we asked ourselves


  • How do we visualise lots of documents .. each with many dimensions (words) .. on a 2 dimensional plot .. so that I can see them clearly and usefully on my computer or on a book page?
  • How do we extract out the underlying topics (e.g. vehicle, cooking) ... especially when the documents don't actually mention these more general topic names specifically?


With a deliberately constructed example, we saw how reducing the dimensions can achieve both of these tasks. What we didn't see is how we could reduce dimensions automatically.

A popular way to do dimension reduction (ugh, what a yucky phrase) .. is by using a mathematical process called Singular Value Decomposition (another horrible phrase).

SVD is not well explained in my opinion, so here I'll try to gently introduce the idea by slowly building up from basics that we're all familiar with. Let' see if I can make it a bit clearer ...


What Are We Trying To Do .. the View From Above

Before we dive into details .. we should really make clear and keep in mind what we're trying to do.

We're trying to take a matrix and turn it into a smaller one. That's it.

Let's explain that ... we have matrices, just a grid of numbers, which represent the word counts of words in documents. Here's one we saw earlier, which is a 6 by 6 matrix. The word wheel appears 4 times in doc1 and doesn't appear in doc6.


We want to turn it into a smaller matrix which also somehow keeps the most essential information. We saw this too, when we reduced that 6 by 6 matrix into a smaller 2 by 6 matrix.


So mathematically we need to do something like the following diagram shows:


As we go through some of the maths and algebra .. we should keep that above image in our minds .. that's what we're ultimately trying to do, .. reduce dimensions.


Don't Throw Away The Important Bits

Looking at that diagram above we might naively suggest throwing away all the rightmost and bottommost elements, and just keeping the top left. Well, it would certainly reduce the size of the matrix.

But this is the important bit - somehow we have to throw away only the least useful bits of information. 

That's really key .. and the reason we have to go through some mathematical jiggery-pokery. If we didn't care about keeping the most useful information in the smaller reduced matrix, we could just throw away bits of the bigger matrix without too much thought.

So how do we identify the most useful bits? If we can do this, we can chuck away the least useful bits with ease. This is the reason we need a bit of maths.


What's Important?

How do we work out which bits are important? Who says they're important? Perhaps we might disagree on what's important and what's not?

All valid questions .. so let's agree on a definition otherwise we'll get nowhere:

The important bits of a set of data are the ones that can best reconstruct the whole data set. 

That means if we threw away the important bits, and looked at the remaining bits, we would not be able to imagine or reconstruct anything like the original set of data. It won't be perfect, because we have lost information by throwing something away, but hopefully the bit we chucked away wasn't crucial to the shape or nature of the original data set.

You may realise that this is a bit like lossy compression - when we squeeze images or music down into a smaller set of data .. a smaller file .. we're throwing away bits which we hope don't distort the reconstructed image or music too much.


Identifying The Important Bits

How do we actually identify the important bits? In mathematics, sometimes it's easier to work backwards from a desired result, rather than starting from nothing and hoping to get do a desired result.

So imagine we already have a mathematical operation that transforms our smaller matrix into the big one. When we apply a transformation, we usually do this by multiplying by a matrix. Why not adding or some other arithmetic operation? Simply because multiplying matrices gives us the possibility of stretching and rotating ... but adding/subtracting only shifts data along the axes which is not so useful if we have data that is normalised or centred around an origin.

The following shows this supposed transformation by multiplying matrices visually. That yellow matrix is the magic transformation. It applies to the smaller dimensions-reduced matrix to give us the original bigger matrix.


Could this work? Well, one rule that must apply is the one we know from school maths - that the rows and columns of the multiplied matrices must be be compatible. You'll remember that when we multiply matrices:

  • the answer has the same number of rows as the first matrix, and the same number of columns as the second. 
  • and that the number of columns of the first matrix must equal the number of rows of the second.

The following illustrates these rules. If the rules weren't true the matrices can't actually be multiplied.


Oh dear .. that means we can't have a matrix multiplied by our smaller dimension-reduced one that results in the original bigger matrix. Why? Because from the multiplication rules above, the columns of the reduced matrix must equal the rows of the original matrix .. which wouldn't be a reduction.

Ok - so we actually do need the right hand side of that equation to comply with the matrix multiplication rules. There are lots of ways you could do that, but a simple one - simple is always good - just adds another matrix on the right. Have a look at the following to see how and why you can multiply three matrices (rightmost first) and end up with an answer that is the size of the big original.


You can see that to recreate a matrix of size A columns and B rows, the yellow matrix needs to have B=D rows and the green matrix needs to have A=G columns. It doesn't mater what the size of the middle matrix is .. and we want it to be smaller!

This is good progress - we now have a mathematical way to relate a bigger matrix to a smaller one.

Even better - you can see how that middle pink matrix could be as large as the original one ... and could shrink ... and we'd still have the answer of the right size. That's what we want to do .. to do that shrinking of the matrix .. and still reconstructing the original.

Phew! What a lot of maths effort!

Now we need to know what those matrices on the right need to contain ...


Singular Value Decomposition

It turns out that it is indeed possible to break a matrix down into three multiplied components just as we wanted to do above. Those three components are the yellow, pink and green matrices on the right hand side of the earlier diagram.

This breaking down into multiplied components is just like factorisation that we might have seen at school. For example, 24 can be factorised into 2, 3 and 4. That is $24 = 2 \cdot 3 \cdot 4$.

There may be many ways to factorise a matrix .. just like there may be many ways to factorise normal numbers ... but the special factorisation of a matrix A that we're interested in is called the Singular Value Decomposition, or just SVD. It is often written as follows:

$$ \mathbf{A} = \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T $$

Let's go through that expression step-by-step:

  • The $\mathbf{A}$ is the matrix we want to break down into 3 multiplied pieces.
  • Those 3 pieces are $\mathbf{U}$, $\mathbf{\Sigma}$, and $\mathbf{V}^T$
  • The middle piece, $\mathbf{\Sigma}$, is the piece that we can change the size of, and still have a matrix $\mathbf{A}$ of the right size.  This is the one we're interested in for our overall aim of reducing dimensions. Lots of people also call this $\mathbf{S}$ instead of $\mathbf{\Sigma}$.
  • The $\mathbf{U}$ is an interesting matrix but we can delay talking about it until the next post. We will see it used later in this post too.
  • Similarly, the $\mathbf{V}^T$ is also an interesting matrix, which we'll delay talking about until the next post. That $T$ superscript means the matrix is transposed, that is, flipped along the diagonal that runs from top left to bottom right (so the top tight element becomes the bottom left element). Why? It's just convention that came from mathematicians finding it easier to work with.
To keep this discussion clear, we won't look at why and how the SVD works ... we'll do that in the next post. For now, we'll assume it works.

In fact, SVD is so popular and powerful, it is provided by many software toolkits, including python's numpy.linalg.svd libraries.


Reducing Dimensions

That middle element $\mathbf{\Sigma}$ is the one we'll use to reduce the dimensions of $\mathbf{A}$. We've already seen above that if we change the size of $\mathbf{\Sigma}$ we don't affect the size of $\mathbf{A}$.

But how do we know we're not going to throw away important information? We already said that we only want to throw away bits of $\mathbf{\Sigma}$ that contribute less to $\mathbf{A}$?

Again - a nice feature of SVD is that the middle element, called the matrix of singular values is diagonal.  That is, all the elements are zero except those on the diagonal. So that makes throwing bits away easy, because each diagonal element only affects a row or a column of the other matrices, $\mathbf{U}$ and $\mathbf{V}^T$.

Even better - the elements of that singular value matrix are ordered by size. The biggest values are at the top left, and the smallest at the bottom right. That is really really helpful. The largest values are the ones that contribute most to the content of $\mathbf{A}$. So if we want to reduce dimensions, we want to keep these large components.

Let's say all that last bit again, but in a different way .. because it is important enough to repeat! Have a look at the following. It's just a simple statement showing how two matrices are multiplied together to give another one.

$$
\begin{pmatrix}
18 & 3 \\
\end{pmatrix}

=

\begin{pmatrix}
2 & 3 \\
\end{pmatrix}
\cdot
\begin{pmatrix}
9 & 0 \\
0 & 1 \\
\end{pmatrix}

$$

You can see that the 9 in the diagonal matrix contributes to the left hand side more than the 1 does. If we wanted to chop off that 1, we'd have the following:

$$
\begin{pmatrix}
18 & 0 \\
\end{pmatrix}

=

\begin{pmatrix}
2 & 3 \\
\end{pmatrix}
\cdot
\begin{pmatrix}
9 & 0 \\
0 & 0 \\
\end{pmatrix}

$$

The original matrix is changed bit. But not as much as if we had removed the 9 and kept the 1:

$$
\begin{pmatrix}
0 & 3 \\
\end{pmatrix}

=

\begin{pmatrix}
2 & 3 \\
\end{pmatrix}
\cdot
\begin{pmatrix}
0 & 0 \\
0 & 1 \\
\end{pmatrix}

$$

It might be easier to see what's happening with pictures instead of numbers ... the following shows the effect on an image of a face as the singular values of the diagonal matrix are turned to zero progressively. You can see again, that the largest (top left) value maintains the broadest structure, whilst the smaller ones provide less significant information. Incidentally, this is not a bad way to compress images .. and there's a further good example here, and another here.


That shows how a diagonal matrix with the diagonal elements ordered by size, can be used to remove dimensions in a way that removes the least information first, and keeps the most important information. (There are proper rigorous proofs of this, but here we just want to give an intuition).

Enough theory .. let's try it ...


SVD Practical Example 1 - Check Factors Rebuild Original Matrix

Let's see the SVD in action on a simple matrix first.

This python notebook on github shows how the following simple matrix is decomposed into 3 elements. The original matrix is $\mathbf{A}$:

A = 
 [[-1  1]
 [ 1  1]]

Python numpy's function for SVD numpy.linalg.svd() uses slightly different terminology. It factors $\mathbf{A}$ into $\mathbf{u}$, $numpy.diag(s)$, and $\mathbf{v}$. Compared to the usual notation, $\mathbf{U} = \mathbf{u}$, $\mathbf{S} = numpy.diag(s)$, and $\mathbf{V}^T = \mathbf{v}$.

Applying this python SVD function to $\mathbf{A}$ gives us ...

U =
 [[-0.70710678  0.70710678]
 [ 0.70710678  0.70710678]] 

S =
 [[ 1.41421356  0.        ]
 [ 0.          1.41421356]] 

V^T =
 [[ 1.  0.]
 [ 0.  1.]] 

We can see a few things already:

  • The $\mathbf{U}$ has rows that are unit-length vectors. 
  • The $\mathbf{V}^T$ has columns that are unit-length vectors.
  • The $\mathbf{S}$ is indeed diagonal. 
Now that we have broken $\mathbf{A}$ into $\mathbf{U} \cdot \mathbf{S} \cdot \mathbf{V}^T $, let's see if we can reconstruct $\mathbf{A}$ from these 3 factors.

Look again at the python notebook, and you'll see that multiplying $\mathbf{U}$, $\mathbf{S}$, and $\mathbf{V}^T $gives us back the original $\mathbf{A}$.

A2 = 
 [[-1.  1.]
 [ 1.  1.]]

That's great! It is reassuring to see that actually confirmed.

Now let's look at SVD applied to the simple document set about cooking and vehicles we used before.


SVD Practical Example 2 - Reduce Simple Document Set

Another python notebook on github illustrates SVD applied to that simple set of documents we talked about above.

The word-document matrix is

doc1doc2doc3doc4doc5doc6
wheel0.500.33330.250.00000.00000.0
seat0.250.33330.000.00000.00000.0
engine0.250.33330.750.00000.00000.0
slice0.000.00000.000.50000.50000.6
oven0.000.00000.000.33330.16670.0
boil0.000.00000.000.16670.33330.4

This is the original set of data $\mathbf{A}$ that we'll try to decompose into 3 pieces with SVD.

Applying the python numpy SVD function we get those 3 pieces:

U =
 [[ 0.   -0.57 -0.52  0.   -0.64  0.  ]
 [ 0.   -0.29 -0.6   0.    0.75  0.  ]
 [ 0.   -0.77  0.61  0.    0.2   0.  ]
 [-0.84  0.    0.    0.    0.   -0.54]
 [-0.25  0.    0.   -0.89  0.    0.38]
 [-0.48  0.    0.    0.45  0.    0.75]] 

S =
 [[ 1.1   0.    0.    0.    0.    0.  ]
 [ 0.    1.06  0.    0.    0.    0.  ]
 [ 0.    0.    0.45  0.    0.    0.  ]
 [ 0.    0.    0.    0.29  0.    0.  ]
 [ 0.    0.    0.    0.    0.13  0.  ]
 [ 0.    0.    0.    0.    0.    0.05]] 

V^T =
 [[ 0.    0.    0.   -0.53 -0.56 -0.63]
 [-0.52 -0.51 -0.68  0.    0.    0.  ]
 [-0.57 -0.38  0.72  0.    0.    0.  ]
 [ 0.    0.    0.   -0.77  0.01  0.64]
 [-0.63  0.77 -0.1   0.    0.    0.  ]
 [-0.   -0.   -0.   -0.35  0.83 -0.44]] 

We make the same observations about these 3 matrices:
  • The $\mathbf{U}$ has rows that are unit-length vectors. 
  • The $\mathbf{V}^T$ has columns that are unit-length vectors.
  • The $\mathbf{S}$ is indeed diagonal, and this time we can see the values ordered by size, the largest being 1.1 and the smallest 0.05. 

Now let's do the important bit - reducing dimensions. We previously wanted to reduce this 6 dimensional data into a more easily viewable 2 dimensions. We do this by only keeping 2 dimensions of $\mathbf{S}$ and calling it $\mathbf{S}_{reduced}$.

S2 =
 [[ 1.1   0.    0.    0.    0.    0.  ]
 [ 0.    1.06  0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.  ]]

You can see that we keep only the values 1.1 and 1.06 and zero everything else.

What do we do with this new singular matrix? Well, we use it to create new views of the documents and the words. How? Look at the following digram showing what happens when we multiply $\mathbf{S}$ with one of the other two matrices.

You can see that:

  • Calculating $\mathbf{S}_{reduced} \cdot \mathbf{V}^T$ is combining a simple scaling factor (the diagonal elements of $\mathbf{S}$) with the column vectors of $\mathbf{V}^T$. That means we still have a vertical vector, which in the original $\mathbf{A}$ is the document view. So this calculated vector is a new document view.
  • Calculating $\mathbf{U} \cdot \mathbf{S}_{reduced}$ is combining a simple scaling factor (the diagonal elements of $\mathbf{S}$) with the row vectors of $\mathbf{U}$. Again, this means we still have a row vector, which in the original $\mathbf{A}$ is the word view. So this calculated vector is a new word view.
For a more mathematical explanation see the update at the bottom of this blog.

Let's look at the new view of the documents, which is $\mathbf{S}_{reduced} \cdot \mathbf{V}^T$. This turns out to be:

S_reduced_VT = 
 [[ 0.    0.    0.   -0.58 -0.62 -0.7 ]
 [-0.55 -0.54 -0.72  0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.  ]]

You can see this is really a 2-dimensional set of data ... each column, which represents a document, has only 2 features now. That's a direct result of zeroing all but the top2 elements of the diagonal $\mathbf{S}$.

Let's plot this to see what it looks like:


We see those 6 documents now squished into a 2-dimensional plane (the above graph), and we can see 2 clusters of documents. Those are the two natural groups of documents about vehicles and documents about cooking.

Nice!

Let's now look at the reduced view of the words, which is $\mathbf{U} \cdot  \mathbf{S}_{reduced}$. This turns out to be:

012345
wheel0.00-0.600.00.00.00.0
seat0.00-0.300.00.00.00.0
engine0.00-0.810.00.00.00.0
slice-0.930.000.00.00.00.0
oven-0.270.000.00.00.00.0
boil-0.530.000.00.00.00.0

Remember you can follow the code at the notebook on github.

Anyway - what does that view tell us? Well, we can see there are 2 topics:

  • The first topic is mostly influenced by the word slice, and then a bit by boil, and not much by oven. That topic is ... cooking. You can see that slice has a contributing factor of -0.93, compared to say oven which as a smaller -0.27.
  • The other topic is mostly made of up the word engine, then wheel and some seat. That topic is ... vehicles.
What we have just done is exact those topics out without having to know what they were beforehand.

Nice!


Try SVD on a Slightly Bigger (3 Topic) Document Set

Let's see if this SVD method works for a slightly bigger set of documents. Let's try it. We'll take the above small set of documents and add a 3 more representing a new topic, home, made of words like door, kitchen and roof.

You can follow the code in a notebook on githhub.

Here's the word document matrix:

doc1doc2doc3doc4doc5doc6doc7doc8doc9
wheel0.500.33330.250.00000.00000.00.00.000.00
seat0.250.33330.000.00000.00000.00.00.250.00
engine0.250.33330.750.00000.00000.00.00.000.00
slice0.000.00000.000.50000.50000.60.00.000.00
oven0.000.00000.000.33330.16670.00.50.000.00
boil0.000.00000.000.16670.33330.40.00.000.00
door0.000.00000.000.00000.00000.00.00.250.25
kitchen0.000.00000.000.00000.00000.00.50.250.25
roof0.000.00000.000.00000.00000.00.00.250.50

We take he SVD but use the top 3 element of the singular matrix, not 2. Why? Well, we could use 2 if all we wanted to do was plot the new document view on a 2d plot. But we want to spot 3 topics so we need 3 elements of the singular matrix to transform the words into topics. In real life, with unknown data, we would explore the data and views with different cuts of S to see whether meaningful topics emerged.


The 2d plot of documents clearly shows 3 clusters .. which is what we expected. Great!

And the topic view is as follows:

012345678
wheel-0.010.6-0.04000000
seat-0.020.330.11000000
engine-0.020.81-0.09000000
slice-0.91-0.03-0.15000000
oven-0.3700.29000000
boil-0.52-0.02-0.1000000
door-0.020.030.27000000
kitchen-0.120.040.57000000
roof-0.030.040.41000000

The above shows that the 2 new axes are linear combinations of existing words:

  • topic 1 is mostly slice and some boil
  • topic 2 is mostly engine and some wheel
  • topic 3 is mostly kitchen and some roof


Conclusion

We've seen the broad intuition behind reducing dimensions by using a special method of factorising matrices which allows us to separate out the key contributing elements and zeroing the ones we think are less informative.

We've seen practically how to:

  • visualise high dimensional data in much lower dimensions .. 6-dimensional documents on to a 2-d plot.
  • extract out the key topics in terms of the words they contain, without knowing what they are beforehand.
In this post we've said that the SVD can do it's magic but not really explained why or how .. that's a more mathematical discussion which we'll do in the next / future post.




Update

I wasn't 100% convinced by the many books and guides that say that $\mathbf{S}_{reduced} \cdot \mathbf{V}^T$ is a document view in the new, dimension-reduced space. Similarly, the explanations for why $\mathbf{U} \cdot \mathbf{S}_{reduced}$ is a new word view in the dimension-reduced space weren't quite believable.

So I kept looking for a good explanation. The best I could find is the following from one of my textbooks:


At the bottom of that section it says:

"we can see that ... $A'^TA' = VD'^TD'V^T$ .. and thus the representation
of documents in the low-dimensional LSI space is given by the rows of the $VD^T$ matrix"

How is this?

Let's start with what they call $A'^TA'$. That's our original matrix $\mathbf{A}$ but the version that's reconstructed from the reduced $\mathbf{S}$. So we can say

$$ \mathbf{A'} = \mathbf{U} \cdot \mathbf{\Sigma'} \cdot \mathbf{V}^T $$

Notice the symbol ' used to remind us we're talking about the reduced dimension versions of these.

Fine .. but what's the obsession with $A'^TA'$?  We'll let's go back to our original $\mathbf{A}$. What is  $\mathbf{A^T} \cdot \mathbf{A}$? Well, it's a matrix where each element is arrived at by combining two vectors, one from $\mathbf{A^T}$ and one from  $\mathbf{A}$. The way matrices are multiplied, this is effectively combining document vectors ... vectors of each document, containing word-counts. The highest values will be those pairs which have high values for word-counts for each document vector... in other words, $\mathbf{A^T} \cdot \mathbf{A}$ is a document-document correlation matrix. It shows which documents are most similar to each other.

To see this even more clearly, that first part of the multiplication is $\mathbf{A^T}$ .. that transpose makes sure that when we do the matrix multiplication, it is the document vector (vertical in $\mathbf{A}$) that is used.

So,

$$\mathbf{A^T} \cdot \mathbf{A}$$

is a document-document correlation matrix.

You can see that if we swapped rows for columns when multiplying those matrices, we'd get a word-word correlation matrix:

$$\mathbf{A} \cdot \mathbf{A^T}$$

where the order is reverse.

Phew ... what's that got to do with the question? Well, if we expand out the algebra for what $\mathbf{A}$ is .. $ \mathbf{A} = \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T $ .. we can say ..

$$ \mathbf{A} = \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T $$

$$ \mathbf{A^T} \cdot \mathbf{A} = (\mathbf{V} \cdot \mathbf{\Sigma}^T \cdot \mathbf{U}^T) \cdot
(\mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T) $$

.. simplifying because $\mathbf{U}^T \cdot \mathbf{U} = 1$ by definition ... and also $\mathbf{\Sigma}^T = \mathbf{\Sigma}$ because it's diagonal ...

$$ \mathbf{A^T} \cdot \mathbf{A} = (\mathbf{V} \cdot \mathbf{\Sigma})  \cdot (\mathbf{\Sigma} \cdot \mathbf{V}^T) $$

Now from this we can see that the right hand side is a document-document correlation matrix, with two similar bits, one which is the transpose of the other. So we can relate $\mathbf{A}$ with $\mathbf{\Sigma} \cdot \mathbf{V}^T $ in some way.

Or in the new low-dimensional space, $\mathbf{A'}$ with $\mathbf{\Sigma'} \cdot \mathbf{V}^T $.

That's why we can say that

$$\mathbf{\Sigma'} \cdot \mathbf{V}^T $$

is a document-view like matrix. I think ...  I'm convinced ... are you?




Update 2

I am convinced now .. I've found many more papers that explain the justification for why $\mathbf{\Sigma'} \cdot \mathbf{V}^T $ is a document-view like matrix.

(click to enlarge)



Notice how that top paper makes clear that $\mathbf{\Sigma'} \cdot \mathbf{V}^T $ is not the same as $\mathbf{A}$ or $\mathbf{A}^T$ .. just that the correlations work out the same:

$$ \mathbf{A^T} \cdot \mathbf{A} = (\mathbf{V} \cdot \mathbf{\Sigma})  \cdot (\mathbf{\Sigma} \cdot \mathbf{V}^T) $$