Intuitive Text Mining: Similar Words ... an Easy Intro to Vector Spaces

In the last post we did some fun stuff, based on the idea of word co-occurrence to indicate some kind of similarity or connectedness in meaning.

What we found was that word co-occurrence worked well for generating sentences - great fun - but was plagued by the boring word (stop word) problem. Lots of words co-occur with the boring word "and" .. but that doesn't mean those words are all somehow related to "and" in meaning.

So let's try another approach that's a teeny weeny bit more sophisticated. And exploring this will lead us to another great tool used by professional text data scientists.

Similar/Related Words

What do we mean when we say two words are similar. We weren't that precise before. Here are some options:

One word causes another .. e.g. virus causes disease
One word is a kind of another .. e.g. a horse is a kind of a mammal
One word is just another way of saying the other .. e.g. lazy is almost interchangeable with idle

There are other kinds of ways words can be related or similar .. but for now let's focus on the third kind above ... where words have the same, or almost the same, meaning .. and could be swapped out for each other without too much trouble.

Of course real language is subtle and complex .. so the words umpire and referee are almost the same but one is actually used in preference to there other depending on the sport... a football referee and a cricket umpire. This particular complexity is one we'll embrace rather than ignore...

Meaning Comes From Context

How do we know whether a word is similar to another. Well we could look them up in a dictionary and compare definitions. That's not a stupid idea at all! But dictionary definitions are actually not that comparable .. and we'd still have to automatically interpret those definitions ... which takes us back to the original problem of getting computers to understand human language.

Let's try again...

Have a look at the following diagram which shows three words car, van, and banana.

The words are linked to other words that they co-occur with. So the word car co-occurs with the words seats, drive and take. That's what you'd expect .. nothing controversial here.

Now if you look at the word van you can see it shares a lot of the co-occurring words with car .. again, you would entirely expect this. That's because the two things, car and van, and very similar ... they act similarly, they are used similarly, built similarly and can often be used interchangeably.

But there are ways in which a van is different from a car. The above shows that the word height is sometimes used with van but rarely or not at all with car.

Now the word banana .. well, that word doesn't share any co-occurring words with car or van .. except maybe the word take it shares with car.

So this diagram allows us to see which words are similar in meaning or use, without even having to use a dictionary to actually understand what the words really mean! We do it simply by spotting the shared context of the related words.

That's pretty powerful!

There's more! ...

The words car and van don't co-occur much ... yet the above diagram shows that they are similar through their shared neighbours!

... That's really powerful!

Uses of Finding Similar Words

The uses for finding similar words are huge! Off the top of my head, we could:

find words that were similar to ones we had
documents that were related to others (which contained enough similar words)
visually cluster similar words to understand what an unknown text was about quickly without having to read all the text
find new words, or uses of words, in a corpus .. and understand how they were used
...

Exciting!

How Would We Do It (By Hand)?

Let's now think about how we would find similar words, following the above idea of shared co-occuring words.

We should certainly use the matrix of co-occurrences that we built last time. That gives us the co-occurrence of one word following another.

When we talked above about co-occurrence we din't care about word order .. so we'd need to change. that matrix a little bit. We'd have to sum the values for word1 -> word2 and word2 -> word1 because that gives us the co-occurrence irrespective of which word comes first. Easy enough!

Now, imagine we have a word car ... how do we use that matrix to find similar words? Remember, we mean words that are similar because they share lots of co-occuring words.

Have a look at the following diagram of a co-occurrence matrix:

What's the word that's the most similar to car? Let's take this slowly and step by step.

Let's see if the word fish is the answer. Remember - we're not looking at simple co-occurance of fish and car. We're looking for other shared co-occurent words.

If you look across from car you can see which it it is most co-occurent with - wheels, paint.
If you look across from fish, it is most co-occurent with nothing much else.
So car and fish don't share much co-occurrent words.

Let's see if the word van is the answer.

If you look across from van, we can see it is mostly co-occurrent with wheels, paint.
That's the same set of words that are most co-occurrent with car, which we just did above.
That means car and van share lots of co-occurring words.

Yay! That means car and van are similar!

Ok - that's great, how do we turn this into some kind of calculation that a computer can do?

How Would We Do It (By Computer)?

Let's look back at what we did above by hand. We looked across from each candidate word, and considered which words were most co-occurrent with that candidate word ...

... And if those co-occurrent words matched ... then the two words were similar. Similar according to our definition above.

How do we calculate such commonality? Well, if you look again at that matrix above, the darker matrix cells have higher values .. and it's those we wanted to match. The following diagram shows this more clearly:

Between car and fish, there aren't dark cells that line up. Between car and van there are, as shown by the green arrows.

Numerically, we could multiply the values. This is simple .. which we like ... and seems to have the right outcomes. If there are low vales in the cells we want to match .. the multiple is low. If both are high the multiple is high .. which is what we want. And, crucially, we try to match a high value with a low value .. we get a lowish value too .. which is what we want.

Let's see this in action, with calculations done in a spreadsheet:

Yup - seems to work as we want. The high valued multiples only happen if both multiplied values are high .. that is a match for co-occurring word.

Then we simply sum up the multiples, representing the matches, ... to get a score for similarity. Here we have a similarity score of 0.01 for car and fish ... makes sense. And we have a score of 1.76 for car and van ... also makes sense!

So we now have a computer method for finding similar words! .. wasn't such hard work!

Getting Fancy - Word Vectors And The Spaces They Live In

You've probably heard loads of people talking about using vector spaces to do fancy stuff with words. They certainly give the impression that this mysterious stuff is kind of magic and powerful!

Let's demystify it a bit!

First of all - there are many different ways vectors and vector calculations are used .. there isn't only a single way. We'll start with just one.

If you look at that calculation we just did .. we multiplied the cells of a sequence of cells. All of the numbers in a row are associated with the word we're interested in.

So you can see that the row of numbers is a vector (of numbers) for the word in question. Here's the vector for the word car:

$$ \mathbf{car} =
\begin{pmatrix}
1.00 \\
0.90 \\
0.00 \\
0.80 \\
0.20 \\
0.00 \\
\end{pmatrix}$$

So similarity score is the dot product of these vectors ... which is a nice coincidence, because the dot product sums the multiplies of the corresponding vector elements - precisely what we've done above to calculate the similarity score!

$$ \mathbf{similarity} = \mathbf{word_1} \cdot \mathbf{word_2}$$

This is also great because many computer programming languages or their supporting libraries make dot products of vectors easy and efficient. That includes Python too!

Let's take a different look at what this dot product does. First let's see what that vector really is, and where it lives:

Each element in those vectors is a magnitude along an axis. An axis that represents the co-occurence word. So the space that these vectors live in is the space of co-occurence words. It's nice to think of this space as a space of themes, or topics, or ideas, or meanings...

Taking the dot product of two vectors in this space is the same as measuring the angle between them (the maths is easy).

$$\mathbf{a \cdot b} = |a| \ |b| \ \mathbf{cos}(\theta)$$

If the vectors are normalised to 1, this super simplifies nicely to

$$\mathbf{a \cdot b} = \mathbf{cos}(\theta)$$

You can think of that dot product as a measure of similarity ... a smaller angle means the vectors are more aligned, and $\mathbf{cos}(\theta) \to 1$ ... a bigger angle means they are point in very different directions and are not similar, and $\mathbf{cos}(\theta) \to 0 \to -1$.

Next Time

Next time we'll implement these ideas and see how well they work. We can look forward to seeing if we can:

recreate the force-directed graphs based on this new measure of similarity (based on commonality of co-occurrent words)
cluster similar words based on this similarity
cluster similar documents
search engine based - not on exact matches - but on similar words, where similarity is this new definition

Fancy Terminology

Here's some fancy terminology you might hear others using about some of the stuff we talked about here:

2nd order co-occurrence ... that's just the definition we developed above .. where we consider how much of word1's co-occurrent words are the same as those for word2. We saw how van and car are themselves no co-occurring but they are similar because they share many common other co-occurring words. Second order just means "once removed" here .. as shown in the diagram earlier.

vector space - a space in which vectors exist ... the focus is on how we define the axes. There are all kinds of vector spaces, and above we made one which had axes of co-occurrent words.

Intuitive Text Mining

Thursday, 29 September 2016

Similar Words ... an Easy Intro to Vector Spaces