Intuitive Text Mining: So Many Dimensions! .. And How To Reduce Them

Previously we asked ourselves

How do we visualise lots of documents .. each with many dimensions (words) .. on a 2 dimensional plot .. so that I can see them clearly and usefully on my computer or on a book page?

How do we extract out the underlying topics (e.g. vehicle, cooking) ... especially when the documents don't actually mention these more general topic names specifically?

With a deliberately constructed example, we saw how reducing the dimensions can achieve both of these tasks. What we didn't see is how we could reduce dimensions automatically.

A popular way to do dimension reduction (ugh, what a yucky phrase) .. is by using a mathematical process called Singular Value Decomposition (another horrible phrase).

SVD is not well explained in my opinion, so here I'll try to gently introduce the idea by slowly building up from basics that we're all familiar with. Let' see if I can make it a bit clearer ...

What Are We Trying To Do .. the View From Above

Before we dive into details .. we should really make clear and keep in mind what we're trying to do.

We're trying to take a matrix and turn it into a smaller one. That's it.

Let's explain that ... we have matrices, just a grid of numbers, which represent the word counts of words in documents. Here's one we saw earlier, which is a 6 by 6 matrix. The word wheel appears 4 times in doc1 and doesn't appear in doc6.

We want to turn it into a smaller matrix which also somehow keeps the most essential information. We saw this too, when we reduced that 6 by 6 matrix into a smaller 2 by 6 matrix.

So mathematically we need to do something like the following diagram shows:

As we go through some of the maths and algebra .. we should keep that above image in our minds .. that's what we're ultimately trying to do, .. reduce dimensions.

Don't Throw Away The Important Bits

Looking at that diagram above we might naively suggest throwing away all the rightmost and bottommost elements, and just keeping the top left. Well, it would certainly reduce the size of the matrix.

But this is the important bit - somehow we have to throw away only the least useful bits of information.

That's really key .. and the reason we have to go through some mathematical jiggery-pokery. If we didn't care about keeping the most useful information in the smaller reduced matrix, we could just throw away bits of the bigger matrix without too much thought.

So how do we identify the most useful bits? If we can do this, we can chuck away the least useful bits with ease. This is the reason we need a bit of maths.

What's Important?

How do we work out which bits are important? Who says they're important? Perhaps we might disagree on what's important and what's not?

All valid questions .. so let's agree on a definition otherwise we'll get nowhere:

The important bits of a set of data are the ones that can best reconstruct the whole data set.

That means if we threw away the important bits, and looked at the remaining bits, we would not be able to imagine or reconstruct anything like the original set of data. It won't be perfect, because we have lost information by throwing something away, but hopefully the bit we chucked away wasn't crucial to the shape or nature of the original data set.

You may realise that this is a bit like lossy compression - when we squeeze images or music down into a smaller set of data .. a smaller file .. we're throwing away bits which we hope don't distort the reconstructed image or music too much.

Identifying The Important Bits

How do we actually identify the important bits? In mathematics, sometimes it's easier to work backwards from a desired result, rather than starting from nothing and hoping to get do a desired result.

So imagine we already have a mathematical operation that transforms our smaller matrix into the big one. When we apply a transformation, we usually do this by multiplying by a matrix. Why not adding or some other arithmetic operation? Simply because multiplying matrices gives us the possibility of stretching and rotating ... but adding/subtracting only shifts data along the axes which is not so useful if we have data that is normalised or centred around an origin.

The following shows this supposed transformation by multiplying matrices visually. That yellow matrix is the magic transformation. It applies to the smaller dimensions-reduced matrix to give us the original bigger matrix.

Could this work? Well, one rule that must apply is the one we know from school maths - that the rows and columns of the multiplied matrices must be be compatible. You'll remember that when we multiply matrices:

the answer has the same number of rows as the first matrix, and the same number of columns as the second.
and that the number of columns of the first matrix must equal the number of rows of the second.

The following illustrates these rules. If the rules weren't true the matrices can't actually be multiplied.

Oh dear .. that means we can't have a matrix multiplied by our smaller dimension-reduced one that results in the original bigger matrix. Why? Because from the multiplication rules above, the columns of the reduced matrix must equal the rows of the original matrix .. which wouldn't be a reduction.

Ok - so we actually do need the right hand side of that equation to comply with the matrix multiplication rules. There are lots of ways you could do that, but a simple one - simple is always good - just adds another matrix on the right. Have a look at the following to see how and why you can multiply three matrices (rightmost first) and end up with an answer that is the size of the big original.

You can see that to recreate a matrix of size A columns and B rows, the yellow matrix needs to have B=D rows and the green matrix needs to have A=G columns. It doesn't mater what the size of the middle matrix is .. and we want it to be smaller!

This is good progress - we now have a mathematical way to relate a bigger matrix to a smaller one.

Even better - you can see how that middle pink matrix could be as large as the original one ... and could shrink ... and we'd still have the answer of the right size. That's what we want to do .. to do that shrinking of the matrix .. and still reconstructing the original.

Phew! What a lot of maths effort!

Now we need to know what those matrices on the right need to contain ...

Singular Value Decomposition

It turns out that it is indeed possible to break a matrix down into three multiplied components just as we wanted to do above. Those three components are the yellow, pink and green matrices on the right hand side of the earlier diagram.

This breaking down into multiplied components is just like factorisation that we might have seen at school. For example, 24 can be factorised into 2, 3 and 4. That is $24 = 2 \cdot 3 \cdot 4$.

There may be many ways to factorise a matrix .. just like there may be many ways to factorise normal numbers ... but the special factorisation of a matrix A that we're interested in is called the Singular Value Decomposition, or just SVD. It is often written as follows:

$$ \mathbf{A} = \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T $$

Let's go through that expression step-by-step:

The $\mathbf{A}$ is the matrix we want to break down into 3 multiplied pieces.
Those 3 pieces are $\mathbf{U}$, $\mathbf{\Sigma}$, and $\mathbf{V}^T$
The middle piece, $\mathbf{\Sigma}$, is the piece that we can change the size of, and still have a matrix $\mathbf{A}$ of the right size. This is the one we're interested in for our overall aim of reducing dimensions. Lots of people also call this $\mathbf{S}$ instead of $\mathbf{\Sigma}$.
The $\mathbf{U}$ is an interesting matrix but we can delay talking about it until the next post. We will see it used later in this post too.
Similarly, the $\mathbf{V}^T$ is also an interesting matrix, which we'll delay talking about until the next post. That $T$ superscript means the matrix is transposed, that is, flipped along the diagonal that runs from top left to bottom right (so the top tight element becomes the bottom left element). Why? It's just convention that came from mathematicians finding it easier to work with.

To keep this discussion clear, we won't look at why and how the SVD works ... we'll do that in the next post. For now, we'll assume it works.

In fact, SVD is so popular and powerful, it is provided by many software toolkits, including python's numpy.linalg.svd libraries.

Reducing Dimensions

That middle element $\mathbf{\Sigma}$ is the one we'll use to reduce the dimensions of $\mathbf{A}$. We've already seen above that if we change the size of $\mathbf{\Sigma}$ we don't affect the size of $\mathbf{A}$.

But how do we know we're not going to throw away important information? We already said that we only want to throw away bits of $\mathbf{\Sigma}$ that contribute less to $\mathbf{A}$?

Again - a nice feature of SVD is that the middle element, called the matrix of singular values is diagonal. That is, all the elements are zero except those on the diagonal. So that makes throwing bits away easy, because each diagonal element only affects a row or a column of the other matrices, $\mathbf{U}$ and $\mathbf{V}^T$.

Even better - the elements of that singular value matrix are ordered by size. The biggest values are at the top left, and the smallest at the bottom right. That is really really helpful. The largest values are the ones that contribute most to the content of $\mathbf{A}$. So if we want to reduce dimensions, we want to keep these large components.

Let's say all that last bit again, but in a different way .. because it is important enough to repeat! Have a look at the following. It's just a simple statement showing how two matrices are multiplied together to give another one.

$$
\begin{pmatrix}
18 & 3 \\
\end{pmatrix}

=

\begin{pmatrix}
2 & 3 \\
\end{pmatrix}
\cdot
\begin{pmatrix}
9 & 0 \\
0 & 1 \\
\end{pmatrix}

$$

You can see that the 9 in the diagonal matrix contributes to the left hand side more than the 1 does. If we wanted to chop off that 1, we'd have the following:

$$
\begin{pmatrix}
18 & 0 \\
\end{pmatrix}

=

\begin{pmatrix}
2 & 3 \\
\end{pmatrix}
\cdot
\begin{pmatrix}
9 & 0 \\
0 & 0 \\
\end{pmatrix}

$$

The original matrix is changed bit. But not as much as if we had removed the 9 and kept the 1:

$$
\begin{pmatrix}
0 & 3 \\
\end{pmatrix}

=

\begin{pmatrix}
2 & 3 \\
\end{pmatrix}
\cdot
\begin{pmatrix}
0 & 0 \\
0 & 1 \\
\end{pmatrix}

$$

It might be easier to see what's happening with pictures instead of numbers ... the following shows the effect on an image of a face as the singular values of the diagonal matrix are turned to zero progressively. You can see again, that the largest (top left) value maintains the broadest structure, whilst the smaller ones provide less significant information. Incidentally, this is not a bad way to compress images .. and there's a further good example here, and another here.

That shows how a diagonal matrix with the diagonal elements ordered by size, can be used to remove dimensions in a way that removes the least information first, and keeps the most important information. (There are proper rigorous proofs of this, but here we just want to give an intuition).

Enough theory .. let's try it ...

SVD Practical Example 1 - Check Factors Rebuild Original Matrix

Let's see the SVD in action on a simple matrix first.

This python notebook on github shows how the following simple matrix is decomposed into 3 elements. The original matrix is $\mathbf{A}$:

A =
[[-1 1]
[ 1 1]]

Python numpy's function for SVD numpy.linalg.svd() uses slightly different terminology. It factors $\mathbf{A}$ into $\mathbf{u}$, $numpy.diag(s)$, and $\mathbf{v}$. Compared to the usual notation, $\mathbf{U} = \mathbf{u}$, $\mathbf{S} = numpy.diag(s)$, and $\mathbf{V}^T = \mathbf{v}$.

Applying this python SVD function to $\mathbf{A}$ gives us ...

U =
[[-0.70710678 0.70710678]
[ 0.70710678 0.70710678]]

S =
[[ 1.41421356 0. ]
[ 0. 1.41421356]]

V^T =
[[ 1. 0.]
[ 0. 1.]]

We can see a few things already:

The $\mathbf{U}$ has rows that are unit-length vectors.
The $\mathbf{V}^T$ has columns that are unit-length vectors.
The $\mathbf{S}$ is indeed diagonal.

Now that we have broken $\mathbf{A}$ into $\mathbf{U} \cdot \mathbf{S} \cdot \mathbf{V}^T $, let's see if we can reconstruct $\mathbf{A}$ from these 3 factors.

Look again at the python notebook, and you'll see that multiplying $\mathbf{U}$, $\mathbf{S}$, and $\mathbf{V}^T $gives us back the original $\mathbf{A}$.

A2 =
[[-1. 1.]
[ 1. 1.]]

That's great! It is reassuring to see that actually confirmed.

Now let's look at SVD applied to the simple document set about cooking and vehicles we used before.

SVD Practical Example 2 - Reduce Simple Document Set

Another python notebook on github illustrates SVD applied to that simple set of documents we talked about above.

The word-document matrix is

	doc1	doc2	doc3	doc4	doc5	doc6
wheel	0.50	0.3333	0.25	0.0000	0.0000	0.0
seat	0.25	0.3333	0.00	0.0000	0.0000	0.0
engine	0.25	0.3333	0.75	0.0000	0.0000	0.0
slice	0.00	0.0000	0.00	0.5000	0.5000	0.6
oven	0.00	0.0000	0.00	0.3333	0.1667	0.0
boil	0.00	0.0000	0.00	0.1667	0.3333	0.4

This is the original set of data $\mathbf{A}$ that we'll try to decompose into 3 pieces with SVD.

Applying the python numpy SVD function we get those 3 pieces:

U =
[[ 0. -0.57 -0.52 0. -0.64 0. ]
[ 0. -0.29 -0.6 0. 0.75 0. ]
[ 0. -0.77 0.61 0. 0.2 0. ]
[-0.84 0. 0. 0. 0. -0.54]
[-0.25 0. 0. -0.89 0. 0.38]
[-0.48 0. 0. 0.45 0. 0.75]]

S =
[[ 1.1 0. 0. 0. 0. 0. ]
[ 0. 1.06 0. 0. 0. 0. ]
[ 0. 0. 0.45 0. 0. 0. ]
[ 0. 0. 0. 0.29 0. 0. ]
[ 0. 0. 0. 0. 0.13 0. ]
[ 0. 0. 0. 0. 0. 0.05]]

V^T =
[[ 0. 0. 0. -0.53 -0.56 -0.63]
[-0.52 -0.51 -0.68 0. 0. 0. ]
[-0.57 -0.38 0.72 0. 0. 0. ]
[ 0. 0. 0. -0.77 0.01 0.64]
[-0.63 0.77 -0.1 0. 0. 0. ]
[-0. -0. -0. -0.35 0.83 -0.44]]

We make the same observations about these 3 matrices:

The $\mathbf{U}$ has rows that are unit-length vectors.
The $\mathbf{V}^T$ has columns that are unit-length vectors.
The $\mathbf{S}$ is indeed diagonal, and this time we can see the values ordered by size, the largest being 1.1 and the smallest 0.05.

Now let's do the important bit - reducing dimensions. We previously wanted to reduce this 6 dimensional data into a more easily viewable 2 dimensions. We do this by only keeping 2 dimensions of $\mathbf{S}$ and calling it $\mathbf{S}_{reduced}$.

S2 =
[[ 1.1 0. 0. 0. 0. 0. ]
[ 0. 1.06 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]]

You can see that we keep only the values 1.1 and 1.06 and zero everything else.

What do we do with this new singular matrix? Well, we use it to create new views of the documents and the words. How? Look at the following digram showing what happens when we multiply $\mathbf{S}$ with one of the other two matrices.

You can see that:

Calculating $\mathbf{S}_{reduced} \cdot \mathbf{V}^T$ is combining a simple scaling factor (the diagonal elements of $\mathbf{S}$) with the column vectors of $\mathbf{V}^T$. That means we still have a vertical vector, which in the original $\mathbf{A}$ is the document view. So this calculated vector is a new document view.
Calculating $\mathbf{U} \cdot \mathbf{S}_{reduced}$ is combining a simple scaling factor (the diagonal elements of $\mathbf{S}$) with the row vectors of $\mathbf{U}$. Again, this means we still have a row vector, which in the original $\mathbf{A}$ is the word view. So this calculated vector is a new word view.

For a more mathematical explanation see the update at the bottom of this blog.

Let's look at the new view of the documents, which is $\mathbf{S}_{reduced} \cdot \mathbf{V}^T$. This turns out to be:

S_reduced_VT =
[[ 0. 0. 0. -0.58 -0.62 -0.7 ]
[-0.55 -0.54 -0.72 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]]

You can see this is really a 2-dimensional set of data ... each column, which represents a document, has only 2 features now. That's a direct result of zeroing all but the top2 elements of the diagonal $\mathbf{S}$.

Let's plot this to see what it looks like:

We see those 6 documents now squished into a 2-dimensional plane (the above graph), and we can see 2 clusters of documents. Those are the two natural groups of documents about vehicles and documents about cooking.

Nice!

Let's now look at the reduced view of the words, which is $\mathbf{U} \cdot \mathbf{S}_{reduced}$. This turns out to be:

	0	1
wheel	0.00	-0.60
seat	0.00	-0.30
engine	0.00	-0.81
slice	-0.93	0.00
oven	-0.27	0.00
boil	-0.53	0.00

Remember you can follow the code at the notebook on github.

Anyway - what does that view tell us? Well, we can see there are 2 topics:

The first topic is mostly influenced by the word slice, and then a bit by boil, and not much by oven. That topic is ... cooking. You can see that slice has a contributing factor of -0.93, compared to say oven which as a smaller -0.27.
The other topic is mostly made of up the word engine, then wheel and some seat. That topic is ... vehicles.

What we have just done is exact those topics out without having to know what they were beforehand.

Nice!

Try SVD on a Slightly Bigger (3 Topic) Document Set

Let's see if this SVD method works for a slightly bigger set of documents. Let's try it. We'll take the above small set of documents and add a 3 more representing a new topic, home, made of words like door, kitchen and roof.

You can follow the code in a notebook on githhub.

Here's the word document matrix:

	doc1	doc2	doc3	doc4	doc5	doc6	doc7	doc8	doc9
wheel	0.50	0.3333	0.25	0.0000	0.0000	0.0	0.0	0.00	0.00
seat	0.25	0.3333	0.00	0.0000	0.0000	0.0	0.0	0.25	0.00
engine	0.25	0.3333	0.75	0.0000	0.0000	0.0	0.0	0.00	0.00
slice	0.00	0.0000	0.00	0.5000	0.5000	0.6	0.0	0.00	0.00
oven	0.00	0.0000	0.00	0.3333	0.1667	0.0	0.5	0.00	0.00
boil	0.00	0.0000	0.00	0.1667	0.3333	0.4	0.0	0.00	0.00
door	0.00	0.0000	0.00	0.0000	0.0000	0.0	0.0	0.25	0.25
kitchen	0.00	0.0000	0.00	0.0000	0.0000	0.0	0.5	0.25	0.25
roof	0.00	0.0000	0.00	0.0000	0.0000	0.0	0.0	0.25	0.50

We take he SVD but use the top 3 element of the singular matrix, not 2. Why? Well, we could use 2 if all we wanted to do was plot the new document view on a 2d plot. But we want to spot 3 topics so we need 3 elements of the singular matrix to transform the words into topics. In real life, with unknown data, we would explore the data and views with different cuts of S to see whether meaningful topics emerged.

The 2d plot of documents clearly shows 3 clusters .. which is what we expected. Great!

And the topic view is as follows:

	0	1	2
wheel	-0.01	0.6	-0.04
seat	-0.02	0.33	0.11
engine	-0.02	0.81	-0.09
slice	-0.91	-0.03	-0.15
oven	-0.37	0	0.29
boil	-0.52	-0.02	-0.1
door	-0.02	0.03	0.27
kitchen	-0.12	0.04	0.57
roof	-0.03	0.04	0.41

The above shows that the 2 new axes are linear combinations of existing words:

topic 1 is mostly slice and some boil
topic 2 is mostly engine and some wheel
topic 3 is mostly kitchen and some roof

Conclusion

We've seen the broad intuition behind reducing dimensions by using a special method of factorising matrices which allows us to separate out the key contributing elements and zeroing the ones we think are less informative.

We've seen practically how to:

visualise high dimensional data in much lower dimensions .. 6-dimensional documents on to a 2-d plot.
extract out the key topics in terms of the words they contain, without knowing what they are beforehand.

In this post we've said that the SVD can do it's magic but not really explained why or how .. that's a more mathematical discussion which we'll do in the next / future post.

Update

I wasn't 100% convinced by the many books and guides that say that $\mathbf{S}_{reduced} \cdot \mathbf{V}^T$ is a document view in the new, dimension-reduced space. Similarly, the explanations for why $\mathbf{U} \cdot \mathbf{S}_{reduced}$ is a new word view in the dimension-reduced space weren't quite believable.

So I kept looking for a good explanation. The best I could find is the following from one of my textbooks:

At the bottom of that section it says:

"we can see that ... $A'^TA' = VD'^TD'V^T$ .. and thus the representation

of documents in the low-dimensional LSI space is given by the rows of the $VD^T$ matrix"

How is this?

Let's start with what they call $A'^TA'$. That's our original matrix $\mathbf{A}$ but the version that's reconstructed from the reduced $\mathbf{S}$. So we can say

$$ \mathbf{A'} = \mathbf{U} \cdot \mathbf{\Sigma'} \cdot \mathbf{V}^T $$

Notice the symbol ' used to remind us we're talking about the reduced dimension versions of these.

Fine .. but what's the obsession with $A'^TA'$? We'll let's go back to our original $\mathbf{A}$. What is $\mathbf{A^T} \cdot \mathbf{A}$? Well, it's a matrix where each element is arrived at by combining two vectors, one from $\mathbf{A^T}$ and one from $\mathbf{A}$. The way matrices are multiplied, this is effectively combining document vectors ... vectors of each document, containing word-counts. The highest values will be those pairs which have high values for word-counts for each document vector... in other words, $\mathbf{A^T} \cdot \mathbf{A}$ is a document-document correlation matrix. It shows which documents are most similar to each other.

To see this even more clearly, that first part of the multiplication is $\mathbf{A^T}$ .. that transpose makes sure that when we do the matrix multiplication, it is the document vector (vertical in $\mathbf{A}$) that is used.

So,

$$\mathbf{A^T} \cdot \mathbf{A}$$

is a document-document correlation matrix.

You can see that if we swapped rows for columns when multiplying those matrices, we'd get a word-word correlation matrix:

$$\mathbf{A} \cdot \mathbf{A^T}$$

where the order is reverse.

Phew ... what's that got to do with the question? Well, if we expand out the algebra for what $\mathbf{A}$ is .. $ \mathbf{A} = \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T $ .. we can say ..

$$ \mathbf{A} = \mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T $$

$$ \mathbf{A^T} \cdot \mathbf{A} = (\mathbf{V} \cdot \mathbf{\Sigma}^T \cdot \mathbf{U}^T) \cdot
(\mathbf{U} \cdot \mathbf{\Sigma} \cdot \mathbf{V}^T) $$

.. simplifying because $\mathbf{U}^T \cdot \mathbf{U} = 1$ by definition ... and also $\mathbf{\Sigma}^T = \mathbf{\Sigma}$ because it's diagonal ...

$$ \mathbf{A^T} \cdot \mathbf{A} = (\mathbf{V} \cdot \mathbf{\Sigma}) \cdot (\mathbf{\Sigma} \cdot \mathbf{V}^T) $$

Now from this we can see that the right hand side is a document-document correlation matrix, with two similar bits, one which is the transpose of the other. So we can relate $\mathbf{A}$ with $\mathbf{\Sigma} \cdot \mathbf{V}^T $ in some way.

Or in the new low-dimensional space, $\mathbf{A'}$ with $\mathbf{\Sigma'} \cdot \mathbf{V}^T $.

That's why we can say that

$$\mathbf{\Sigma'} \cdot \mathbf{V}^T $$

is a document-view like matrix. I think ... I'm convinced ... are you?

Update 2

I am convinced now .. I've found many more papers that explain the justification for why $\mathbf{\Sigma'} \cdot \mathbf{V}^T $ is a document-view like matrix.

(click to enlarge)

Notice how that top paper makes clear that $\mathbf{\Sigma'} \cdot \mathbf{V}^T $ is not the same as $\mathbf{A}$ or $\mathbf{A}^T$ .. just that the correlations work out the same:

$$ \mathbf{A^T} \cdot \mathbf{A} = (\mathbf{V} \cdot \mathbf{\Sigma}) \cdot (\mathbf{\Sigma} \cdot \mathbf{V}^T) $$

Intuitive Text Mining

Thursday, 9 February 2017

So Many Dimensions! .. And How To Reduce Them - Part 2/2 (SVD Demystified)