Intuitive Text Mining

Saturday, 15 October 2016

Profiling Indexing, Relevance and Co-occurrence Part I

In the last post we looked at three tools for profiling python code to find performance hotspots .. code that is slow, or is hit lots of times adding up to a long time.

Let's apply these tools to our toolkit, because we know that somethings are slow ... like indexing, and calculating co-occurrence.

Profiling Indexing

Let's try the function-level cProfile first. We'll wrap the indexing code inside a function do_processing() to make it easier to call with cProfile .. as follows ..

def do_processing():
# for all documents in corpus
for document_name in cr.get_documents():
#print("processing ", document_name)

# get document text
document_text = cr.get_text_by_document(document_name)

# simplify whitespace (remove newlines)
b = tmt.text_processing.simplify_whitespace(document_text)

# only keep alphanumeric characters, removes punctuation
c = tmt.text_processing.keep_alphanumeric(b)

# make lowercase
d = tmt.text_processing.to_lowercase(c)

# split into words list
dl = tmt.text_processing.split_text_into_words(d)

# build n-grams
#gl = tmt.word_processing.build_ngrams_from_words(dl,2)

# remove stop words
#el = tmt.word_processing.remove_stop_words(dl, "./stopwords/minimal-stop.txt")

# update index
tmt.index_search.create_wordcount_index_for_document(cr.content_directory, document_name, dl)
pass
pass

# profile code
cProfile.run('do_processing()')

The output is long .. here's the top of it ..

211209192 function calls (206665033 primitive calls) in 565.070 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
2674860 1.814 0.000 5.263 0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
1 0.109 0.109 565.070 565.070 <ipython-input-11-9069811af5cf>:4(do_processing)
1 0.000 0.000 565.070 565.070 <string>:1(<module>)
534856 0.673 0.000 1.040 0.000 __init__.py:168(iteritems)
58 0.000 0.000 0.000 0.000 __init__.py:224(u_safe)
58 0.001 0.000 0.458 0.008 __init__.py:512(__init__)
58 0.002 0.000 0.458 0.008 __init__.py:581(update)
2088 0.002 0.000 0.029 0.000 _methods.py:34(_prod)
534450 0.500 0.000 5.571 0.000 _methods.py:37(_any)
116 0.000 0.000 0.000 0.000 _weakrefset.py:16(__init__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:20(__enter__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:26(__exit__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:70(__contains__)
58 0.000 0.000 0.001 0.000 abc.py:178(__instancecheck__)
267109 0.683 0.000 0.992 0.000 algorithms.py:807(_get_take_nd_function)
267109 4.719 0.000 28.988 0.000 algorithms.py:840(take_nd)
232 0.001 0.000 0.002 0.000 array.py:113(_getrowsize)
232 0.004 0.000 0.154 0.001 array.py:131(__init__)
232 0.003 0.000 0.060 0.000 array.py:190(_g_create)
...
...

Looking through the output, the rows with large times are associated with built-in functions which we can do little about .. we're not going to rewrite the core python functions or the pandas library. Here are some examples:

58 0.000 0.000 0.001 0.000 generic.py:1904(_update_inplace)
267167 2.743 0.000 193.508 0.001 generic.py:2307(reindex_axis)
267167 4.327 0.000 78.374 0.000 generic.py:2320(_reindex_with_indexers)
267167 0.348 0.000 0.400 0.000 generic.py:2641(__finalize__)
...
58 0.000 0.000 0.032 0.001 group.py:917(_f_close)
58 1.595 0.028 554.948 9.568 index_search.py:70(create_wordcount_index_for_document)
267167 1.235 0.000 47.746 0.000 indexing.py:101(_get_setitem_indexer)
534334 4.355 0.000 43.300 0.000 indexing.py:1102(_convert_to_indexer)
267167 3.294 0.000 551.357 0.002 indexing.py:126(__setitem__)
801501 0.897 0.000 1.291 0.000 indexing.py:128(<genexpr>)
267167 1.671 0.000 45.093 0.000 indexing.py:163(_convert_tuple)
58 0.000 0.000 0.001 0.000 indexing.py:1753(convert_to_index_sliceable)
...
928 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
1068900 1.336 0.000 1.336 0.000 {method 'fill' of 'numpy.ndarray' objects}
2939011 0.946 0.000 0.946 0.000 {method 'get' of 'dict' objects}
267167 90.638 0.000 92.661 0.000 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
1603408 128.358 0.000 129.298 0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
638 0.001 0.000 0.001 0.000 {method 'groups' of '_sre.SRE_Match' objects}
..
232 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
267109 12.336 0.000 12.336 0.000 {pandas.algos.take_2d_axis1_float64_float64}
535204 30.105 0.000 31.133 0.000 {pandas.lib.infer_dtype}

267225 0.081 0.000 0.081 0.000 {pandas.lib.is_bool}

Apart from telling us that our create_wordcount_index_for_document() takes most of the time .. which isn't useful as we know that's the indexing function ... there rest is not very insightful as they're internal functions, and we don't know where they're called from.

So let's try pprofile, the line-by-line profiler. Here's how we invoke it:

profiler = pprofile.Profile()
with profiler:
do_processing()
profiler.dump_stats(filename="pprofile_index_1.txt")

The output is even bigger! But it is organised into sections for each python source code file, ordered by the time spent in each file. Here's a sample:

/Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py
File duration: 231.942s (6.50%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
1| 0| 0| 0| 0.00%|# module apply processing to text (not word lists)
2| 0| 0| 0| 0.00%|
3| 0| 0| 0| 0.00%|# import text string stuff
4| 0| 0| 0| 0.00%|import string
5| 0| 0| 0| 0.00%|
6| 0| 0| 0| 0.00%|
7| 0| 0| 0| 0.00%|# split into words
8| 58| 0.00025773| 4.44363e-06| 0.00%|def split_text_into_words(input_text):
9| 58| 0.190883| 0.00329108| 0.01%| output_words_list = input_text.split(" ")
10| 58| 0.00185609| 3.20015e-05| 0.00%| return output_words_list

The most time (28%) is spend in the pandas/indexes/base.py library. We shouldn't be surprised as we're using pandas to do much of the hard work. More interesting will be which bits of our code, ended up calling that code. Next is 22% in pandas/core/internals.py, then 12% in pandas.core/indexing.py, 11% in pandas/core/generic.py, 9% in pandas/core/common.py ...

... so that's already 82% of the total time in pandas!

We then get to our own code:

File: /Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py
File duration: 231.942s (6.50%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
1| 0| 0| 0| 0.00%|# module apply processing to text (not word lists)
2| 0| 0| 0| 0.00%|
3| 0| 0| 0| 0.00%|# import text string stuff
4| 0| 0| 0| 0.00%|import string
5| 0| 0| 0| 0.00%|
6| 0| 0| 0| 0.00%|
7| 0| 0| 0| 0.00%|# split into words
8| 58| 0.00025773| 4.44363e-06| 0.00%|def split_text_into_words(input_text):
9| 58| 0.190883| 0.00329108| 0.01%| output_words_list = input_text.split(" ")
10| 58| 0.00185609| 3.20015e-05| 0.00%| return output_words_list
11| 0| 0| 0| 0.00%|
12| 0| 0| 0| 0.00%|
13| 0| 0| 0| 0.00%|# make lowercase
14| 58| 0.000458717| 7.90892e-06| 0.00%|def to_lowercase(input_text):
15| 58| 0.0161779| 0.000278929| 0.00%| output_text = input_text.lower()
16| 58| 0.00069499| 1.19826e-05| 0.00%| return output_text
17| 0| 0| 0| 0.00%|
18| 0| 0| 0| 0.00%|
19| 0| 0| 0| 0.00%|# remove whitespace, convert to single space
20| 58| 0.000150442| 2.59383e-06| 0.00%|def simplify_whitespace(input_text):
21| 0| 0| 0| 0.00%| #output_text = input_text.translate(str.maketrans(string.whitespace, ' ' * len(string.whitespace)))
22| 58| 0.446623| 0.0077004| 0.01%| output_text = " ".join(input_text.split())
23| 58| 0.00181532| 3.12986e-05| 0.00%| return output_text
24| 0| 0| 0| 0.00%|
25| 0| 0| 0| 0.00%|
26| 0| 0| 0| 0.00%|# remove punctuation
27| 0| 0| 0| 0.00%|def remove_punctuation(input_text):
28| 0| 0| 0| 0.00%| output_text = input_text.translate(str.maketrans({a:None for a in string.punctuation}))
29| 0| 0| 0| 0.00%| return output_text
30| 0| 0| 0| 0.00%|
31| 0| 0| 0| 0.00%|
32| 0| 0| 0| 0.00%|# keep only alphanumeric characters
33| 58| 0.000307798| 5.30687e-06| 0.00%|def keep_alphanumeric(input_text):
34| 32473800| 231.281| 7.12207e-06| 6.49%| output_text = ''.join(c for c in input_text if c in (" " + string.ascii_letters + string.digits))
(call)| 15895288| 111.163| 6.99348e-06| 3.12%|# /Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py:34 <genexpr>
35| 58| 0.00230646| 3.97666e-05| 0.00%| return output_text

That bit only takes 6.50% of the total time, but inside that we can see the string join command takes a long time .. 6.49% of the total time, which is actually very significant. The other string command is also a bit slow taking 0.01% of the time.

String operations do seem slow!

So that's a good result already .. if we can improve the string join operations, or even better, the function which filters out non-alphanumeric characters ... then we've addressed 6% of the performance problem!

Improving The Slow AlphaNumeric Filter

Before we make the change, we should time how long it takes to run the index creation for each document in the big Iraq Report corpus.

# profile code
%time do_processing()

CPU times: user 6min 30s, sys: 9.24 s, total: 6min 39s

Wall time: 6min 44s

And the merging of the index ...

# profile code
%time tmt.index_search.merge_wordcount_indices_for_corpus(cr.content_directory)

CPU times: user 5.82 s, sys: 662 ms, total: 6.48 s
Wall time: 7.54 s

We'll replace the code which strips out all non-alphanumeric characters with one based on the regular expression engine. The following shows simple benchmarks comparing relative performance:

That change from a string.join() method to a regular expression method sped up this example by over 11 times! .... That is a massive speed up!

Let's see how it helps our indexing. The following shows how long the indexing takes after this change has been implemented.

# profile code
%time do_processing()

CPU times: user 5min 6s, sys: 8.47 s, total: 5min 15s
Wall time: 5min 16s

So we have a speedup from 6m 39s to 5m 15s which is 399 seconds down to 315 seconds ... so we've shaved off 21% of the time!

That's not a bad result for a simple 2-line code change .. one of which is an import statement!

21% reduction in time

...

with just 2 lines of changed code!

That's the power of profiling .. it points you to where the problems actually are.

We're not finished yet ... more in the next post!

Wednesday, 12 October 2016

Profiling Python Performance

We've created a good bit of code already .. both as code in Python notebooks .. as well as code in the textminingtoolkit package of libraries we're developing.

But some of it is slooow:

Creating word-document indices is slooow.
Calculating the relevance matrix is slooow.
Calculating the co-occurrence matrix is also slooow.

Ho can we fix it? Well .. we can't fix it unless we can measure it.

We need to find out which bits of code are slow. There are several ways of doing this... let's look at a few, starting super simple.

Super Simple - Time It!

The simplest approach to measuring the performance of bits of code is to run them and record the time they took with a stopwatch.

Too often however, python code runs so fast, we'd never have a chance to get the timing right with a stopwatch! Stuff happens in microseconds .. not seconds or minutes.

So Python itself, the Jupiter notebook, and also 3rd party libraries provide ways to measure how long bits of code take in a much more accurate manner.

Imagine we have a super simple function, called work() which takes a parameter n:

def work(n):

print("hello")

x = n * n

print(x)

pass

You can see it is super simple, it just says hello and calculates the square of n, before printing it out. Simples!

If we call it, here's what happens:

work(3)

hello

In a python notebook, we can use a simple command to time how long some code takes. This command is only available in the notebook, not python in general .. and commands like this begin with a % symbol, and are sometimes called python magics! Don't ask me why ...

Here's the %time magic:

%time work(4)

hello

CPU times: user 48 µs, sys: 0 ns, total: 48 µs

Wall time: 52.9 µs

That is sooooo much easier than the old fashioned method of calling a python command to get the time, running a function or code, then getting the time again, and finding the difference in times ... painful!

The work() function is not taxing at all .. you can see it took 48 microseconds! That wall time is a bit bigger because that's the stopwatch time .. which includes time your computer was doing other stuff that's nothing to do with running work() .. like updating the pointer on your screen or seeing if a button was pressed .. etc

Now let's rewrite our work() function with a fast bit and a slow bit:

def quick(n):

return n*n

def slow(n):

counter = 0

for i in range(n):

for j in range(n):

counter +=1

pass

return counter

def work(n):

print("hello")

a = quick(n)

b = slow(n)

print(a,b)

pass

You can see that work() now calls two other functions, quick() and slow(), both of which calculate the square of n. quick() does it more efficiently than slow().

Let's try again with the magic timer:

%time work(5000)

hello

25000000 25000000

CPU times: user 1.51 s, sys: 22.2 ms, total: 1.54 s

Wall time: 1.57 s

You can see that this now takes 1.54 seconds ... noticeably slower.

But this doesn't tell us why work() is slow. It doesn't tell us whether the quick() function really is quick .. not does it tell us that the slow() function really is slow.

That's what we need .. to be able to find which bits of our functions are slow .. so we can zoom in on the hotspots and try to improve them.

That's where more sophisticated profilers are the right too for the job.

Python Profiler: cProfile

There are a few options for profiling python code. Python itself has built in profilers.

cProfile is the most commonly used built in profiler. Here's how we'd use it ...

import cProfile

cProfile.run('work(5000)')

The results are:

hello
25000000 25000000
56 function calls in 1.637 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <ipython-input-2-152b7fd5fd62>:1(quick)
1 0.000 0.000 1.637 1.637 <ipython-input-2-152b7fd5fd62>:13(work)
1 1.636 1.636 1.636 1.636 <ipython-input-2-152b7fd5fd62>:4(slow)
1 0.000 0.000 1.637 1.637 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 ioloop.py:932(add_callback)
6 0.000 0.000 0.000 0.000 iostream.py:228(_is_master_process)
6 0.000 0.000 0.000 0.000 iostream.py:241(_schedule_flush)
6 0.000 0.000 0.000 0.000 iostream.py:309(write)
1 0.000 0.000 0.000 0.000 posix.py:53(wake)
2 0.000 0.000 0.000 0.000 stack_context.py:253(wrap)
2 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
1 0.000 0.000 1.637 1.637 {built-in method builtins.exec}
2 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
6 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
2 0.000 0.000 0.000 0.000 {built-in method builtins.print}
6 0.000 0.000 0.000 0.000 {built-in method posix.getpid}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'write' of '_io.FileIO' objects}
6 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}

Whoaahh! What's all that!?

Don't worry .. it' just very detailed listing of what the code is doing and for how long. Let's break it apart..

56 function calls were made in total .. that includes functions that are called inside python by our own code
total time was 1.637 seconds

Now the interesting bit ...

work() took 1.637 seconds .. we know that's the total time .. makes sense because work() is the main function we're calling
slow() takes 1.636 seconds ... which is almost all the time the whole work() function took
quick() is so quick it doesn't even register on the stopwatch!

By the way, tottime is the time in a function excluding any functions called from there .. and cumtime is the time in a function including any functions called from there. ... that's why work() has a high cumtime and a low tottime as all it does is call 2 other functions.

That's what we wanted - to see which bits of code inside a bigger bunch of code was the slow hotspot.

Python Profiler: pprofile

As great as cProfile is .. it only profiles called functions ... which in some cases is not granular enough. Sometimes we want to know how each line of python performs. And sometimes, not everything that is a performance hotspot is in fact a function.

So let's look at pprofile which aims to do profiling per line of python.

To use it, we import it and set up a profiler, and call the function we're interested in:

import pprofile

profiler = pprofile.Profile()
with profiler:
work(5000)
profiler.dump_stats(filename="abc.txt")

Normally we'd only need to print the output statistics with a profiler.print_stats() but right now there is a bug which means it doesn't work properly in a notebook - here's the github issue 19 I raised. The workaround for now is to dump the stats to a file, "abc.txt" here.

The output file we get is huge .. because this profiler also profiles the code called by our own python .. deep into he core python libraries and functions. Luckily the top of the file is most relevant to our own code:

Total duration: 274.336s
File: <ipython-input-2-152b7fd5fd62>
File duration: 274.334s (100.00%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
1| 1| 4.05312e-06| 4.05312e-06| 0.00%|def quick(n):
2| 1| 5.96046e-06| 5.96046e-06| 0.00%| return n*n
3| 0| 0| 0| 0.00%|
(call)| 1| 274.336| 274.336|100.00%|# <ipython-input-2-152b7fd5fd62>:13 work
4| 1| 3.8147e-06| 3.8147e-06| 0.00%|def slow(n):
5| 1| 8.10623e-06| 8.10623e-06| 0.00%| counter = 0
6| 5001| 0.02175| 4.34912e-06| 0.01%| for i in range(n):
7| 25005000| 91.4991| 3.65923e-06| 33.35%| for j in range(n):
8| 25000000| 92.4224| 3.6969e-06| 33.69%| counter +=1
9| 25000000| 90.3722| 3.61489e-06| 32.94%| pass
10| 5000| 0.018151| 3.63021e-06| 0.01%| pass
11| 1| 4.05312e-06| 4.05312e-06| 0.00%| return counter
12| 0| 0| 0| 0.00%|
13| 1| 3.69549e-05| 3.69549e-05| 0.00%|def work(n):
14| 1| 0.000101089| 0.000101089| 0.00%| print("hello")
(call)| 2| 0.00199699| 0.000998497| 0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
15| 1| 2.5034e-05| 2.5034e-05| 0.00%| a = quick(n)
(call)| 1| 1.00136e-05| 1.00136e-05| 0.00%|# <ipython-input-2-152b7fd5fd62>:1 quick
16| 1| 3.38554e-05| 3.38554e-05| 0.00%| b = slow(n)
(call)| 1| 274.334| 274.334|100.00%|# <ipython-input-2-152b7fd5fd62>:4 slow
17| 1| 5.00679e-05| 5.00679e-05| 0.00%| print(a,b)
(call)| 4| 0.000454903| 0.000113726| 0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
18| 1| 5.96046e-06| 5.96046e-06| 0.00%| pass

This is much more insightful ... giving us information line-by-line. Let's break it down:

The total time was 274 seconds .. that's because this method of profiling adds huge overhead. if you're interested, pprofile can also work out performance hotspots through statistical sampling which has less overhead, but is less accurate ... but often accurate enough!
The hits column shows which lines get executed many times .. we can see the lines inside the slow() function being called many many times .. some 5000 times, some 25000000 times! Obviously a hotspot if they're not super-quick.
The time and time per hit column show which lines take most time. We can see those loops consume the most time .. but each individual line doesn't take especially long .. it's just the they're called 5000 or 25000000 times!

Excellent stuff! We've got the tools now to see if we can identify which parts of our textminingtoolkit code are slow .. that's the next post!

Tuesday, 11 October 2016

Co-occurrence .. Refined ... Part 2/2

In the previous post, we improved how we considered to words to be co-occurrent or not.

Instead of a black-n-white yes-or-no .. we now have a spectrum of co-occurrence for words inside a window. This allows related words that do occur frequently together, but may be placed a couple or more words apart, to be recognised as such.

The results looked good .. apart from the problem of boring stop-words which also happen to be very co-occurent.

We've previously developed a way to deal with such boring words by developing a measure of word relevance .. which balances word frequency, commonality across documents, and word length. Such a method won't be perfect, but it'll be better than a brutal application of a word stop-list which can't adapt to each corpus.

Let's see how it works.

Combining Word Relevance & Co-Occurrence

How do we combine these two indictors .. co-occurrence and word relevance? Well, there are a few ways we could do it:

Option 1: Remove low-relevance stop words first, then calculate co-occurrence.

Option 2: Calculate co-occurrence then remove pairs not in most relevant list.

Option 3: Multiply co-occurrence scores with relevance scores to get a new measure.

How do we choose?

Option 2 seems inefficient as we'd end up building word-pairs and calculating co-occurrence for words which we would later discard. For large datasets that would turn into significant waste and time.

Option 1 is the simplest. Option 3 could work, but as always, we prefer to try simpler things first.

Results

We've tweaked the javascript d3 code that creates the network graph plots:

var simulation = d3.forceSimulation(graph.nodes)
    .force("link", d3.forceLink(graph.links).distance(50))
    .force("charge", d3.forceManyBody().strength(-20))
    //.force("radius", d3.forceCollide(15))    .force("center", d3.forceCenter(width / 2.0, height / 2.0));

The length of the links has been increased to make crowded networks clearer. The repulsion between nodes has been reduced to prevent them flying off the edge of the display! We tried but didn't use the enforcement of a collision radius, which effectively ensures nodes don't get too close .. it ended up creating a regular and unnatural pattern.

The following shows the network of nodes connecting co-occuring words (with window 2) which were in the top 100 most relevant for this (small) Italian Recipes dataset. It's quite crowded .. and does still contain some stop-words. If you play with it, you can explore some meaningful connections.

The next plot shows only those nodes with had a weight of more than 3.5 (we tried a few other values too).

It really brings out the key meaningful word pairings ..

grated cheese
tomato sauce
pour over
olive oil
bread crumbs
brown stock

That's a lot of success for now! Great job!

Next Steps - Performance Again

We applied the above ideas to a small dataset ... we really want to apply them to the larger data sets because co-occcurrence with a stretched window doesn't really become useful with small datasets.

However, the performance of indexing and calculating co-occurrence is slow .. so next we need to investigate this and find a fix!

Saturday, 8 October 2016

Co-occurrence .. Refined ... Part 1/2

Much of our recent experiments have been based on the idea of co-occurrence .. two words being next to each other.

We assumed that if two words co-occurred they must be related.

That's a powerful theory. .... but it isn't perfect. Two key problems are:

Interesting words are often sat next to boring words. Apple in .. Slice the ... John and ...
Related words are often not right next to each other, but instead, 2 to 3 words apart .. Apple and banana ... ... Hillary spoke to Obama ...

What do we do?

Longer Reach Co-Occurrence

We don't need to throw away our notion of co-occurrence. All we need to do is make it a tiny bit more sophisticated to take into account the above observation that related words are sometimes a short distance apart.

There are loads of ways we can do this .. but as always, let's start simple, and only make things more complex if we need to.

Option 1: One or Two Words Apart
The simplest idea is to extend the reach of co-occurrence from word being right next to each other to having a word between them:

What we're saying here is that if two words are right next to each other, or they have one word between them, then we consider them to be co-occurrent.

That would help with the problem of some boring words, and also interesting related words being once removed.

Why stop at one word between them? ...

Option 2: One, Two, Three, Four ... N Words Apart
Why not have 2 words between co-occurrent words? Or 3? Or 4 .. maybe N words?

That would deal with related words that were further apart... but... that also begins to strain the original theory that related words are close together.

So we need to somehow give more credibility to the idea that words that are closer together are related, and less credibility to the idea that words that are further apart are related.

That's not to say, words that are far apart can't be related .. it just means that, without any other clues, and given a large corpus of text, we're more likely to believe that more closely co-occurrent words are related.

Option 3: Weights N-Words Apart
We can easily translate this idea into a calculation:

Set a window, of length N , which limits how far apart words can be to be considered co-occurrent.
Use a maths function, like exp(-x²) which can give more numerical weight to closer words than more distant words.

That looks like a good way forward ... let's try it ...

Results

The following shows the resulting force-directed graph for our Italian Recipes mini-corpus, using the above extended-reach co-occurrence with a window of 3 (words could be 3 apart).

Here's a plot of the top 500 words by co-occurrence .. click to enlarge ..

It's a bit busy .. but we can see some nodes which are connected to many others .. but sadly they're the boring stop-words.

Here's the same data but only showing a smaller set of the words with higher co-occurrence value .. more than 4.0 here.

Yup - that makes it even clearer that we need to deal with those pesky stop-words.

As a bit of a short-cut we'll use the rather brutal method of removing stop-words we developed earlier, tmt.word_processing.remove_stop_words(). Here's the result:

This is really much better now that we've clear out the stop words. We can see words that should be related, for example:

bread -- crumbs
salt -- pepper
tomato -- sauce, tomato -- paste, white -- sauce
cut -- pieces
cold -- water
olive -- oil
egg-yolk

That validates our idea .. and gives us he clearest insights yet ... we just need to deal with those stop-words!

Thursday, 29 September 2016

Similar Words ... an Easy Intro to Vector Spaces

In the last post we did some fun stuff, based on the idea of word co-occurrence to indicate some kind of similarity or connectedness in meaning.

What we found was that word co-occurrence worked well for generating sentences - great fun - but was plagued by the boring word (stop word) problem. Lots of words co-occur with the boring word "and" .. but that doesn't mean those words are all somehow related to "and" in meaning.

So let's try another approach that's a teeny weeny bit more sophisticated. And exploring this will lead us to another great tool used by professional text data scientists.

Similar/Related Words

What do we mean when we say two words are similar. We weren't that precise before. Here are some options:

One word causes another .. e.g. virus causes disease
One word is a kind of another .. e.g. a horse is a kind of a mammal
One word is just another way of saying the other .. e.g. lazy is almost interchangeable with idle

There are other kinds of ways words can be related or similar .. but for now let's focus on the third kind above ... where words have the same, or almost the same, meaning .. and could be swapped out for each other without too much trouble.

Of course real language is subtle and complex .. so the words umpire and referee are almost the same but one is actually used in preference to there other depending on the sport... a football referee and a cricket umpire. This particular complexity is one we'll embrace rather than ignore...

Meaning Comes From Context

How do we know whether a word is similar to another. Well we could look them up in a dictionary and compare definitions. That's not a stupid idea at all! But dictionary definitions are actually not that comparable .. and we'd still have to automatically interpret those definitions ... which takes us back to the original problem of getting computers to understand human language.

Let's try again...

Have a look at the following diagram which shows three words car, van, and banana.

The words are linked to other words that they co-occur with. So the word car co-occurs with the words seats, drive and take. That's what you'd expect .. nothing controversial here.

Now if you look at the word van you can see it shares a lot of the co-occurring words with car .. again, you would entirely expect this. That's because the two things, car and van, and very similar ... they act similarly, they are used similarly, built similarly and can often be used interchangeably.

But there are ways in which a van is different from a car. The above shows that the word height is sometimes used with van but rarely or not at all with car.

Now the word banana .. well, that word doesn't share any co-occurring words with car or van .. except maybe the word take it shares with car.

So this diagram allows us to see which words are similar in meaning or use, without even having to use a dictionary to actually understand what the words really mean! We do it simply by spotting the shared context of the related words.

That's pretty powerful!

There's more! ...

The words car and van don't co-occur much ... yet the above diagram shows that they are similar through their shared neighbours!

... That's really powerful!

Uses of Finding Similar Words

The uses for finding similar words are huge! Off the top of my head, we could:

find words that were similar to ones we had
documents that were related to others (which contained enough similar words)
visually cluster similar words to understand what an unknown text was about quickly without having to read all the text
find new words, or uses of words, in a corpus .. and understand how they were used
...

Exciting!

How Would We Do It (By Hand)?

Let's now think about how we would find similar words, following the above idea of shared co-occuring words.

We should certainly use the matrix of co-occurrences that we built last time. That gives us the co-occurrence of one word following another.

When we talked above about co-occurrence we din't care about word order .. so we'd need to change. that matrix a little bit. We'd have to sum the values for word1 -> word2 and word2 -> word1 because that gives us the co-occurrence irrespective of which word comes first. Easy enough!

Now, imagine we have a word car ... how do we use that matrix to find similar words? Remember, we mean words that are similar because they share lots of co-occuring words.

Have a look at the following diagram of a co-occurrence matrix:

What's the word that's the most similar to car? Let's take this slowly and step by step.

Let's see if the word fish is the answer. Remember - we're not looking at simple co-occurance of fish and car. We're looking for other shared co-occurent words.

If you look across from car you can see which it it is most co-occurent with - wheels, paint.
If you look across from fish, it is most co-occurent with nothing much else.
So car and fish don't share much co-occurrent words.

Let's see if the word van is the answer.

If you look across from van, we can see it is mostly co-occurrent with wheels, paint.
That's the same set of words that are most co-occurrent with car, which we just did above.
That means car and van share lots of co-occurring words.

Yay! That means car and van are similar!

Ok - that's great, how do we turn this into some kind of calculation that a computer can do?

How Would We Do It (By Computer)?

Let's look back at what we did above by hand. We looked across from each candidate word, and considered which words were most co-occurrent with that candidate word ...

... And if those co-occurrent words matched ... then the two words were similar. Similar according to our definition above.

How do we calculate such commonality? Well, if you look again at that matrix above, the darker matrix cells have higher values .. and it's those we wanted to match. The following diagram shows this more clearly:

Between car and fish, there aren't dark cells that line up. Between car and van there are, as shown by the green arrows.

Numerically, we could multiply the values. This is simple .. which we like ... and seems to have the right outcomes. If there are low vales in the cells we want to match .. the multiple is low. If both are high the multiple is high .. which is what we want. And, crucially, we try to match a high value with a low value .. we get a lowish value too .. which is what we want.

Let's see this in action, with calculations done in a spreadsheet:

Yup - seems to work as we want. The high valued multiples only happen if both multiplied values are high .. that is a match for co-occurring word.

Then we simply sum up the multiples, representing the matches, ... to get a score for similarity. Here we have a similarity score of 0.01 for car and fish ... makes sense. And we have a score of 1.76 for car and van ... also makes sense!

So we now have a computer method for finding similar words! .. wasn't such hard work!

Getting Fancy - Word Vectors And The Spaces They Live In

You've probably heard loads of people talking about using vector spaces to do fancy stuff with words. They certainly give the impression that this mysterious stuff is kind of magic and powerful!

Let's demystify it a bit!

First of all - there are many different ways vectors and vector calculations are used .. there isn't only a single way. We'll start with just one.

If you look at that calculation we just did .. we multiplied the cells of a sequence of cells. All of the numbers in a row are associated with the word we're interested in.

So you can see that the row of numbers is a vector (of numbers) for the word in question. Here's the vector for the word car:

$$ \mathbf{car} =
\begin{pmatrix}
1.00 \\
0.90 \\
0.00 \\
0.80 \\
0.20 \\
0.00 \\
\end{pmatrix}$$

So similarity score is the dot product of these vectors ... which is a nice coincidence, because the dot product sums the multiplies of the corresponding vector elements - precisely what we've done above to calculate the similarity score!

$$ \mathbf{similarity} = \mathbf{word_1} \cdot \mathbf{word_2}$$

This is also great because many computer programming languages or their supporting libraries make dot products of vectors easy and efficient. That includes Python too!

Let's take a different look at what this dot product does. First let's see what that vector really is, and where it lives:

Each element in those vectors is a magnitude along an axis. An axis that represents the co-occurence word. So the space that these vectors live in is the space of co-occurence words. It's nice to think of this space as a space of themes, or topics, or ideas, or meanings...

Taking the dot product of two vectors in this space is the same as measuring the angle between them (the maths is easy).

$$\mathbf{a \cdot b} = |a| \ |b| \ \mathbf{cos}(\theta)$$

If the vectors are normalised to 1, this super simplifies nicely to

$$\mathbf{a \cdot b} = \mathbf{cos}(\theta)$$

You can think of that dot product as a measure of similarity ... a smaller angle means the vectors are more aligned, and $\mathbf{cos}(\theta) \to 1$ ... a bigger angle means they are point in very different directions and are not similar, and $\mathbf{cos}(\theta) \to 0 \to -1$.

Next Time

Next time we'll implement these ideas and see how well they work. We can look forward to seeing if we can:

recreate the force-directed graphs based on this new measure of similarity (based on commonality of co-occurrent words)
cluster similar words based on this similarity
cluster similar documents
search engine based - not on exact matches - but on similar words, where similarity is this new definition

Fancy Terminology

Here's some fancy terminology you might hear others using about some of the stuff we talked about here:

2nd order co-occurrence ... that's just the definition we developed above .. where we consider how much of word1's co-occurrent words are the same as those for word2. We saw how van and car are themselves no co-occurring but they are similar because they share many common other co-occurring words. Second order just means "once removed" here .. as shown in the diagram earlier.

vector space - a space in which vectors exist ... the focus is on how we define the axes. There are all kinds of vector spaces, and above we made one which had axes of co-occurrent words.

Saturday, 15 October 2016

Profiling Indexing, Relevance and Co-occurrence Part I

Profiling Indexing

Improving The Slow AlphaNumeric Filter

Wednesday, 12 October 2016

Profiling Python Performance

Super Simple - Time It!

Python Profiler: cProfile

Python Profiler: pprofile

Next Post

Tuesday, 11 October 2016

Co-occurrence .. Refined ... Part 2/2

Combining Word Relevance & Co-Occurrence

Results

Next Steps - Performance Again

Saturday, 8 October 2016

Co-occurrence .. Refined ... Part 1/2

Longer Reach Co-Occurrence

Results

Thursday, 29 September 2016

Similar Words ... an Easy Intro to Vector Spaces

Similar/Related Words

Meaning Comes From Context

Uses of Finding Similar Words

How Would We Do It (By Hand)?

How Would We Do It (By Computer)?

Getting Fancy - Word Vectors And The Spaces They Live In

Next Time

Fancy Terminology