Intuitive Text Mining

Sunday, 25 September 2016

Co-Occurrence .. and Generating Sentences

Most of our work has looked at words on their own. We know that meaning is often in the context of a word, so we must look around any single word to try to capture this contextual meaning.

There are several ways of doing that - some of them fairy sophisticated. But for now, we'll start super simple, as usual, and grow from there, ready for the more complex stuff later.

Co-Occurrence: Words That Occur After A Word

Let's start by considering which words occur after a given word. So for example, if I thought about the word "apple", maybe the word "pie" is most likely to occur after this. If I think about the word "Iraq" maybe the word "war" is most likely to appear next.

We kinda looked at this when we broke a corpus into 2-grams and counted how often they occurred. We're going to do the same sort of counting again but a bit differently because we want to later build on the approach we develop here. It will become clear why we're doing it differently, I promise!

Let's jump into the (not very deep) deep end! Have a look at the following table.

Let's decode what this is. You can see down the left words like "apple", "pie" and "fish". These have been labelled word1. That was easy enough.

You can also see across the top a similar thing, words labelled word2. What the numbers in the table are showing is how likely the second word2 is right after word1 in a given corpus. Fair enough.

What are the numbers inside the table? Those are the likelihoods that word1 is followed by word2. The illustrated example is "apple" followed by "pie" ... as in "apple pie" ... which has a likelihood of 0.99 out of a maximum of 1.0. That's very high - which means that apple is very likely to be followed by pie. Makes sense again ...

Often such matrices are symmetric - but that's not the case here. The word "pie" is not likely to be followed by the word "apple", and it shows in the table as a low likelihood of 0.05. It's not zero because our text cleaning might strip punctuation and so the word apple might actually be the beginning of a new sentence of sub-phrase. Again this all makes sense .. apple pie .. not pie apple.

The actual numbers used here are probabilities between 0 and 1. We could have chosen another kind of measure, like word count, which is simpler. Remember from earlier though, word count is easily biased by bigger documents, so maybe a more normalised measure like word frequency is better.

We could even use the measures of interesting-ness (TF-IDF) we developed earlier to moderate these probabilities ... that way we de-emphasise words like "the" and "a" which will dominate co-occurrence. But for now let's keep it simple.

So what can we do with such tables of co-occurrence values?

1. Identify Insightful 2-Word Phrases

We could take the word pairs with the highest co-occurrence value and see if they give us any insight into the corpus. Here's the top 20 for the mini Italian recipes corpus we use for experimenting with:

	word1	word2	co-occurrence
0	of	the	19.0
1	with	a	17.0
2	in	the	16.0
3	a	little	16.0
4	grated	cheese	16.0
5	salt	and	15.0
6	it	is	14.0
7	in	a	13.0
8	them	in	13.0
9	the	fire	12.0
10	and	pepper	10.0
11	and	put	10.0
12	on	the	10.0
13	they	are	10.0
14	oil	and	9.0
15	to	be	9.0
16	over	the	9.0
17	tomato	sauce	8.0
18	with	the	8.0
19	butter	and	8.0

This is just like taking the top 20 n-grams by word count that we did before. That's ok - remember we're doing it this way because this way can be extended to new ideas .. hang in there!

We can see that only a few of the top 20 word pairs are actually that insightful ... most of the words are what we previously called stop word - boring and not unique. We'll try applying the uniqueness factor later.

If we didn't know what a corpus was about, bringing out the top 20 2-grams like this helps us understand that it might be about food or recipes.

2. Generate sentences!

This is going to be fun.

Imagine we pick a word at random .. we can use that table to look up the most likely next word. We're going from word1 to word2. Fine, nothing special here.

But if we then take that second word, and use it as the first word, we can find the most likely next word .. a third word. If we repeat this, we ca get a whole sequence of words.

Maybe we'll get meaningful sentences? Let's try a few ... the following shows sequences of 7 words (we could do more, or less) with deliberately chosen first words.

the fire with a little more and pepper
once into the fire with a little more
then cut them in the fire with a
olive oil and pepper and pepper and pepper
tomato sauce salsa bianca this sauce salsa bianca
cut them in the fire with a little
cook in the fire with a little more
little more and pepper and pepper and pepper

So for a first go ... that's kinda spookily good! Some of the sentence seem to be fairly well formed, and actually very much on topic!

Because we're not doing anything more sophisticated (we could) we do fall into traps like word sequences that keep repeating like "olive oil and pepper and pepper and pepper".

Phrases like "cook in the fire with a little more" are pretty impressively natural!

We should be really pleased with this - a super simple technique - and we get spookily good results!

3. Graphs of Word Nodes Linked by Co-occurence

This is also kinda cool .. and open up lots of possibilities that we'll explore later.

If we take the word pairs with the highest co-occurrence values, and plot them as connected nodes, the resulting graphs can be interesting, beautiful, and hopefully insightful too ... because we can see which words are connected which other words, and whether some words are the nexus of many others .. and in some way central to the meaning of the corpus.

Let's try it with the small Italian Recipes corpus. We developed force directed graphs in the last post and they've been improved to have word labels, colours, and also wider links for larger co-occurrence values ... click on the following to enlarge it.

Well .. it's busy! It's the top 500 word pairs by co-occurrence value. You can also see that there is some structure .. with some words clustered around what must be important words. Either drag a node, or hover over it, to see what the word is.

As we half-expected, many of the central words are actually those stop-words .. those boring words that aren't that insightful. The words "in", "of", "the" are at the centre of quite a few clusters. That's ok - we're just starting to explore using graphs .. we can find ways of removing or reducing them, be using stop-word exclusion lists or using the TF-IDF measure of interesting-ness we developed earlier.

Before we do that - have a look at just the following graph which only shows those pairs which have a co-occurrence above 3 .. in an attempt to remove noise and concentrate on those pairs for which there is lots of evidence for that pairing.

That's much clearer if we're busy and don't have time to explore a graph of loads of nodes.

Code

As usual the code for both the data processing pipeline is in a notebook on github:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/04_co-occurrence_graph.ipynb

And the code for the visualisation is also on github, the following sows the library function and the d3 javascript:

Next Ideas

This is a great start ... and we've developed a powerful (and pretty) visualisation too! Here are some of the ideas we can try next:

Use stop-word removal or TF-IDF measures to reduce importance of boring words.
Extend co-occurence beyond just two words .. maybe 3, 4 or even much longer sequences ... suitably adjusted so that closer words are still given a higher score. This will deal with language which often places important words around a boring word like "apple and pear" ... "fish in the oven" ...

Tuesday, 20 September 2016

Interactive D3.v4.js in a Jupyter Notebook

One of things we want to develop is a visualisation for word co-occurance .. that is, a measure of how often words appear one after another.

A good way of doing this is with a graph of nodes, kinda like the following:

You can see how each node can represent a word, and the links between nodes (words) can be coloured or widened to indicate how often one word is followed by the next.

It would also be great to be able to play with the elements, to shift them around or zoom into a big graph.

D3.js

There is a really popular javascript library for creating rich and interactive visualisations called d3. Check out their gallery to get a feel for the really cool, and interactive, visualisations it can make. There's even more here ... I can't get enough of them!

Here's an example of an live interactive graph at bl.ocks.org .. have a go at dragging the nodes around.

You can see it constantly reshapes to maximise the visibility of each node, whilst trying to maintain the links between them.

Problem: D3.v4.js in a NoteBook (from a Library)

Plotting images and charts in a notebook is easy and has been done for a long time. You may already know about matplotlib, the de-facto library for plotting in a notebook.

Working with d3.js is slightly different because it is a javascript library - that means it needs to work in the browser. Normally tools would work behind the scenes in the engine (notebook kernel) and once the results were created, they would be pushed to the browser to be displayed. Javascript libraries don't do that - the do the work in the browser.

So d3.js plots are created in the browser. How do we do that with a jupyter notebook?

Luckily, there is a fairly mature method for executing HTML and Javascript in a notebook cell. Here's a super simple example:

You can see how an object is created from the raw HTML, and then this object is "displayed" .. or rather executed. That's cool!

You can find more examples of executing rich media in a notebook cell here ... including examples showing how javascript can be executed in a browser notebook.

If we're developing a toolkit visualisation library, we don't want the user to type all that stuff into a cell and then execute it. That would defeat the point of a library that takes all the details away.

Can we run the above python in a library function, for the results to be displayed in a notebook? Yes, we can!

Here's a basic library function which does the HTML mangling so the notebook user never sees it:

import IPython.display

def test_html(s):
# create html object
h = IPython.display.HTML(s)
IPython.display.display_html(h)
pass

And here is what it looks like ... nice and simple!

That's how we want to use d3 ... by getting a library to do all the HTML and Jascripty stuff. Let's try it:

import IPython.display

def test_html(s):
# create html object
h = IPython.display.HTML(s)
IPython.display.display_html(h)
pass

def test_d3():
html_string = """
<svg width="400" height="200"></svg>

<script src="https://d3js.org/d3.v3.min.js"></script>
"""

js_string="""
console.log(d3);

var svg = d3.select("svg");

var circle = svg.append("circle")
.attr("cx", 100)
.attr("cy", 100)
.attr("r", 40)
.attr("fill", "blue")
.attr("stroke", "black");

"""

h = IPython.display.HTML(html_string)
IPython.display.display_html(h)

j = IPython.display.Javascript(js_string)
IPython.display.display_javascript(j)
pass

You can see how we use d3 to select the SVG HTML element and then create a circle inside it.

Here's what happens:

Ok .. that didn't work! Seems like d3 isn't visible from the python library when used in a notebook. hmmm... even more weird is if you just hit it again, it works:

That's not inspiring confidence .. and anyway we want to work with d3.v4 not d3.v3 ... let's see what happens if we change the source URL to point to https://d3js.org/d3.v4.min.js:

A similar error, and this one doesn't go away like the other one did by repeatedly calling the function.

So we have a problem to fix. It might be tempting to work with the d3v3 weird behaviour - but we won't, we'll press ahead and look to the future with the new d3v4 ... we don't want to be locked into an ancient version of a library. And anyway, we want to fix the problem properly if we can, not reply on some wobbly workaround.

Diagnosis

I'm not an expert but after a week of wrestling with this ... and terrible documentation ....

The problems seem to be due to a different way that the d3.v4.js library presents itself to whatever it is loaded into.

The v3 version seemed to create a global reference, so anyone could see and call it. That doesn't appear to be the case for v4 .. Again, I'm no expert, but I think it is trying to be clever and present itself based on what it thinks the host environment is ... and it gets it wrong when the host is a Jupyter notebook.

The various tickets on this seem unanswered - e.g. http://stackoverflow.com/questions/39335992/d3-4-0-does-not-create-a-global-d3-variable-when-imported-into-jupyter-notebook

To make things worse, browsers and/or the notebook seem to cache previous javascript libraries .. making you think things are working when they're not .. or are broken when they're not! To properly test this I needed to close down a browser properly and kill all history/cookies/cache.

My Solution

Here's how I finally got it to work.

The key is to use the require.js method to load the d3 library and remove the <script> method of loading the library. The Jupiter notebook already makes use of require.js so we don't need to load that. Note the lack of a ".js" at the end of the URL.

require.config({
paths: {
d3: "https://d3js.org/d3.v4.min"
}
});

require(["d3"], function(d3) {
// do stuff here
}

So let's try it ...

That works! .. and with version 4 of the D3 library too!

Here's a preview of work in progress developing an interactive force-directed graph for word co-occurance .. you can see the node "pear" highlighted. You can't see here, but we can drag the nodes about and let the graph rebalance.

Reference Code

Here's sample reference code for getting d3v4.js working from a Python library invoked by a Jupiter notebook:

import IPython.display

def test_html(s):
# create html object
h = IPython.display.HTML(s)
IPython.display.display_html(h)
pass

def test_d3():
html_string = """
<svg width="400" height="200"></svg>
"""

js_string="""
require.config({
paths: {
d3: "https://d3js.org/d3.v4.min"
}
});

require(["d3"], function(d3) {

console.log(d3);

var svg = d3.select("svg");

var circle = svg.append("circle")
.attr("cx", 100)
.attr("cy", 100)
.attr("r", 40)
.attr("fill", "green")
.attr("stroke", "black");

});
"""

h = IPython.display.HTML(html_string)
IPython.display.display_html(h)

j = IPython.display.Javascript(js_string)
IPython.display.display_javascript(j)
pass

Monday, 12 September 2016

Multi Word Search Queries

We've already developed a simple way of searching an index to see which documents contain a word.

We even refined the way we want the search results, first prioritising the documents which the highest word count, and then by using our more sophisticated relevance measure.

The challenge now is .. how do we handle queries with more than one word?

More Than One Query Word

Let's have a think about this. Imagine we have 3 documents and an small index of only 3 words, shown here:

You can see, for example, that the word rice doesn't appear in doc1. You can see that it appears in doc2 with a high relevance score. Doc1 seems to feature cake a lot!

What do we want to happen when we give our computer a search query with more than one word - say "rice cake"? That's not a simple question. So let's take it slow.

We don't want our toolkit to ignore one word or the other. That's why we provided both. Good - that's a start, at least.

Do we want our toolkit to consider each word on its own, and return the two sets of results? So we could see the scores for the word rice, and note that doc2 came top, and similarly see that that doc1 came top for the word cake. We could then .. maybe order doc1 above doc2 because the score for cake was 0.5, higher than the 0.4 for rice.

Well .. that could work, and it's not a terrible start ... but why did we provide two words in the first place? Because we wanted to find results which were related to both words together.

That's our first clue as to how to handle the query. We want to prioritise documents which are related to both search query words - both "rice" and "cake" in our example.

How do we do this? Well. we could look at an index based on 2-grams and try to find "rice cake" in there. That's actually a very good answer - but fails if the two words are not always right next to each other. And anyway, we often find ourselves having to work with an index of 1-grams not 2-grams, or 3-grams, etc ...

Let's have another go using the 1-gram index above. Could we combine these scores in some way that reflects our intention when we issues the search?

Combining Scores

As always, let's start with the simplest thing we can do, and only make things more complex if we really really need to.

A very simple thing we can do is simply add the scores that each query word has for a document.

Have a look again at the index:

Using the words rice and cake:

doc1 has a combined score of 0.0 + 0.5 = 0.5
doc2 has a combined score of 0.4 + 0.1 = 0.5
doc3 has a combined score of 0.3 + 0.3 = 0.6

So this means doc3 wins! ... not doc2 which had the highest score for rice alone, and not doc1 with had the highest score for cake.

You can see how, doc1 might have been abut cakes, but maybe not rice cakes. And doc2 might have been a rice dish but not rice cakes.

So let's try this idea of adding relevance scores for each search word in a query ... often called search terms.

Code

The code to implement this is really simple .. which shows you the power and ease of Python and pandas.

# query string to list of search terms
search_query_list = search_query.split()

# do query
documents = relevance_index.loc[search_query_list]

# sum the scores
results = documents.sum()

# filter out those with score of zero
results = results[results > 0]
return results.sort_values(ascending=False)

The query is still a simple lookup of the index data frame. The sum is also really simple. The results can contain zeros (word not present) so we get rid of those, again very simple. We finally sort the results in descending order, so the highest scores are first.

Results

Let's try it and see what happens:

A search for "saffron rice" places 05.txt top for both indices. That's good because it is in fact the recipe for "rice with saffron".
A search for "bread soaked in milk" results in both indices returning 17.txt and 03.txt which both contain bread (crumbs) soaked in milk. But the relevance index returns 00.txt as the top result. This is not ideal, especially as the document doesn't contain the word milk but is boosted because the document contains a high concentration of the word bread.

That second observation suggests we need to refine the combination of scores. We'll come back to this later, as we've made a good start for now.

Sunday, 11 September 2016

Results from Improved Word Relevance Measure

In the last post we looked at improving the early measure of relevance which was based on word count.

Word count works well, but suffers from bias due to long documents (more words), and also from rewarding lots of boring words (the, it, a, is, to ...).

Let's see how well it all worked.

Finding Interesting / Boring Words Automatically

If we apply the measure to the recipes data set, we should be able to identify the boring (stop) words automatically. These will be the ones with the lowest relevance scores.

Previously, the most frequent words for a recipes corpus was full of boring words. The first interesting word "oil" appeared at position 13, after 12 boring words ranked higher! Here they are again for easy reference:

the 273

and 203

a 133

with 79

for 77

in 72

1 58

to 56

of 49

2 44

then 43

until 41

oil 41

Relevance scores for a word are per-document. But we can sum them up across all documents. Here's what the words with the top 10 look like (shown with their relevance scores)

[('sauce', 0.072479554515290895),

('them', 0.070833510080924769),

('little', 0.062832341904529035),

('rice', 0.058276255134080412),

('butter', 0.057278834991103734),

('bread', 0.055933326383546908),

('they', 0.054793552703163897),

('quantity', 0.051914014707442335),

('together', 0.050062707525060902),

('grated', 0.048860949578921925),

('broth', 0.048693197257887351),

('tomato', 0.048444379129175236),

('boiled', 0.047588072543957513),

('flour', 0.047404394383271695),

('water', 0.046392876975336547),

('then', 0.046343005577901039),

('pepper', 0.044368328433798621),

('some', 0.044303584017951085),

('very', 0.043970962688707774),

('that', 0.043831469829911401)]

Much much better! That top 20 contains lots of interesting words with only a few we would consider not so interesting. There are no short boring stop-words like "a", "in", "to" .. so our measure which penalises shorter words works well!

Let's see what the bottom 20 look like ... these should be the boring words:

[('and', 0.0),

('in', 0.0),

('the', 0.0),

('up', 0.0012725563329122366),

('i', 0.0012816579235383377),

('bit', 0.0017987306071704534),

('had', 0.0017987306071704534),

('now', 0.0017987306071704534),

('1', 0.0020042948378737836),

('3', 0.0020042948378737836),

('4', 0.0020042948378737836),

('2', 0.0020042948378737836),

('32', 0.0020148808604443747),

('leaf', 0.0022240465989832267),

('your', 0.0022240465989832267),

('wet', 0.0022988261123030452),

('top', 0.0022988261123030452),

('pie', 0.0022988261123030452),

('than', 0.0024192873309550364),

('line', 0.0024192873309550364)]

Yup - these are all boring .. except maybe leaf.

For a method that didn't rely on manually providing a list of stop words, this approach works really well. And it'll adapt to different corpuses, as each will have different word usage.

Here's a word cloud based on these relevance scores.

Compare that with our previous best word cloud:

What we're seeing is two different pictures - the first tries to show words that are interesting per-document, the second shows words that are prominent across the corpus. Both of these approaches are useful when exploring text.

Notebook on Github

All the code is on github. The notebook for this pipeline is at:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/03_index_and_search.ipynb

Challenging Texts

Some text corpora are more challenging than others. The word clouds for the Iraq War Chilcot Report and the Clinton Emails data sets have not been that illuminating ... because they'e filled up with words that appear often but aren't that interesting. Here's a reminder of what they looked like before:

Let's see how our attempt at reducing the impact of boring words works. Remember, this means reducing the impact of words that appear often but in all documents, making them less "unique".

You can see that the Iraq Report word cloud has new and interesting insights - multinational, resolution, basra, baghdad, de-ba'athification, map, inspectors, intelligence, did witness, weapons, ... much more insightful than the more expected words we had before. Interestingly, the words iraq, report and inquiry are in the bottom 20 for relevance (interestingness).

Let's try the Clinton emails which were really very challenging because the emails are fairly informal, lack continuity or strucure, and full of un-polished prose.

Well, we can see that we no longer have boring words like state and department, unclassified and date, subject and sent. You'll recognise these as words you find in the raw text of all emails - they're not that illuminating .. so our relevance measure really works to penalise them. Great!

What else does that word cloud show - well it brings forward key dates, 2009, 08-31-2015 and 06-30-2015. If you were investigating the emails, these would be worth looking into. As it happens, these are the dates the emails were released, so not that interesting for us .. but they could have been.

Identities pop out too. Abedin is a close advisor and aide to Clinton - you can read about here: http://www.vanityfair.com/news/2016/01/huma-abedin-hillary-clinton-adviser. Cheryl and Mills are the same person, Cheryl Mills, another aide: http://www.politico.com/story/2015/09/cheryl-mills-hillary-clinton-aide-213242. The same for Jacob Sullivan https://en.wikipedia.org/wiki/Jake_Sullivan

A few key email addresses also pop out .. including hrod17@clintonemail.com and hdr22@clintonemail.com ... apparently secret email accounts she is alleged to have used inappropriately for official business: http://www.factcheck.org/2015/05/clintons-secret-email-accounts/

Search Relevance

Let's see how search result ranking changes with this relevance measure. The following shows the results of searching for the word "king" in Shakespeare's Macbeth.

Searching the basic word count index gives:

macbeth_act_01_scene_02.txt    14.0
macbeth_act_01_scene_04.txt     7.0
macbeth_act_01_scene_03.txt     6.0
macbeth_act_01_scene_06.txt     5.0
macbeth_act_04_scene_03.txt     4.0
macbeth_act_05_scene_08.txt     3.0
macbeth_act_03_scene_01.txt     3.0
macbeth_act_01_scene_05.txt     3.0
macbeth_act_04_scene_01.txt     2.0
macbeth_act_03_scene_06.txt     2.0
macbeth_act_02_scene_03.txt     2.0
macbeth_act_03_scene_02.txt     1.0

Searching the relevance index gives:

macbeth_act_01_scene_02.txt    0.010196
macbeth_act_01_scene_06.txt    0.006776
macbeth_act_01_scene_04.txt    0.005312
macbeth_act_01_scene_03.txt    0.001888
macbeth_act_01_scene_05.txt    0.001810
macbeth_act_03_scene_06.txt    0.001803
macbeth_act_05_scene_08.txt    0.001636
macbeth_act_03_scene_01.txt    0.000935
macbeth_act_03_scene_02.txt    0.000807
macbeth_act_04_scene_03.txt    0.000759
macbeth_act_04_scene_01.txt    0.000627
macbeth_act_02_scene_03.txt    0.000605

Both ways of sorting the search results give is Act 1 Scene 2 as the first result. If we have a look at the actual text, we can see why .. it is a key scene with King Duncan.

But the second place result is different. The relevance index places Act 1 Scene 6 second, whereas the basic word count give sis Act 1 Scene 4. Both involve the king, but scene 6 is shorter and is mostly about the king's arrival speech, so is less "diluted".

The fact that very similar documents appear in the top, albeit in a different, order suggests we're not wilding off with the new relevance measure.

Summary

Basic word count is a good way of working out what a text is about. Given it's extreme simplicity, it's power is pretty amazing.

But where it falls down, we can use a refined measure - we call it relevance - which counters the bias of longer documents, penalises short words, and promotes the uniqueness of a word too.

We should use both measures as part of our toolkit, as this relevance measure can diminish words which really are relevant but happen to be prevalent in the text corpus.

Saturday, 10 September 2016

Pandas DataFrame HDFStore Bug

Took me a whole week to narrow down and find this bug!

Some Context

We are using pandas dataframes as indices. The row labels are the word, and the columns are the documents the words occur in. The cell content is the wordcount, or relevance, depending on which index we're working with.

We wanted to save the index by simply pickling it using pandas.to_pickle(). That failed for large dataframes.

So we chose the more mature, and designed for larger datasets, HDF format. That's available officially in pandas, too.

That seemed to work ... until ...

The Bug

Saving dataframes into a HDF store is easy. Let's create a small dataframe representing one of our indices:

import pandas
df1 = pandas.DataFrame()
df1.ix['apple', '001'] = 0.1
df1.ix['banana', '001'] = 0.2
df1.ix['apple', '002'] = 0.3
df1.ix['banana', '002'] = 0.7
df1.ix['nan', '001'] = 0.5

df1
        001 002
apple   0.1 0.3
banana 0.2 0.7
nan     0.5 NaN

So we've created a super simple dataframe. It refers to the words "apple", "banana" and "nan".

Let's save it to a HDF store:

s = pandas.HDFStore('test')
s['index'] = df1
s.close()

That's nice an easy. The HDF file is called test, and the object inside the file is called index .. you can have many objects in a single HDF store, if you wanted to.

Let's exit out of python, and restart it, to make sure we're truly bringing back the dataframe from the HDF file, and not accidentally bringing back from a memory still hanging around in a variable.

Let's now reopen the store:

import pandas
s = pandas.HDFStore('test')
s
<class 'pandas.io.pytables.HDFStore'>
File path: test
/index            frame        (shape->[3,2])

Here we've opened the HDF file called test, and listed what's inside it. You can see it contains an object called index. Let's bring that into python as a dataframe.

df2 = s['index']
s.close()
del s
df2
        001 002
apple   0.1 0.3
banana 0.2 0.7
NaN     0.5 NaN

You can see the problem! The words "apple" and "banana" are fine, but the word "nan" has been turned into a NaN (not a number) .. it should be a string.

That's the bug!

And it leads to all kinds of problems .. like not being able to find the word "nan", and other stuff with not being able to our relevance calculations. Even storing the index gives a warning as Python says the index label isn't orderable, and sometimes python tries to cast NaN to a float.

Workaround

A temporary workaround is to force the index values to be recast as strings, every time you retrieve an index back from a HD5 store:

df2.index = df2.index.astype(str)
df2
        001 002
apple   0.1 0.3
banana 0.2 0.7
nan     0.5 NaN

Fix?

I'm hoping the to_pickle() method is fixed ... not just this error.

This bug has been reported on the github pandas issues tracker.

Wednesday, 7 September 2016

Speeding Up Indexing Performance

In implementing the new relevance indexing ideas from the previous post, I ran into trouble!

Sloooooow ...

The indexing for the small Italian Recipes data set was quick as a flash. The bigger Iraq Inquiry report took a few minutes ...

... but the Clinton emails took ages .. 12 hours and still less than half the documents had been indexed.

This is not good, especially if we want to experiment and change things quickly.

Optimising Indexing

On this journey we want to develop ideas, algorithms and code which prioritse clarity and simplicity. We don't want to get into overly-complex, deeply sophisticated stuff .. that's not the idea here. The idea is to understand the basics.

So for this reason, we started with an approach to indexing that was super simple:

for every document in a corpus ...
      load previously saved index (if any)
      go through all the words in a document ...
      add words to index
      save updated index

That was simple conceptually, we can all understand picking up an index and updating it with new content from new documents.

But you can also see that as the corpus gets larger, we're loading and saving an ever bigger index. In much of computing, doing stuff with files is slow and expensive.

Can we make this more efficient? With less file loading and saving?

Well, we can't avoid going through all the text content because we do need to index it.

But we can avoid loading and saving ever bigger index files to update them. Instead we index each content text file separately, and afterwards we merge the index files.

for every document in a corpus ...
      go through all the words in a document ...
      save document index

start with an empty corpus index
for every document index in corpus ...
      load document index and join with corpus index
save corpus index

There will still be loads of file load and saves ... but they won't be getting bigger and bigger.

There is a downside to this - we lose the ability to update an index. We now have to think about regenerating an index for an entire corpus. Given the speed up for large corpora, this is worth it.

Results - Zooooom!

This results is way faster indexing! Instead of hours and hours for about half the Clinton emails done .. they were indexed in minutes.

The merging took longer, but not hours. This is to be expected as the corpus index does grow larger and larger, but we're not doing it without saving and loading it as we progress through every document.

The Code

The updated code for indexing in this new way is always at github:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/text_mining_toolkit/index_search.py

Pickling vs HDF5

During the coding it turned out that the simplest way of storing pandas dataframes in a file - known as pickling - didn't work for larger dataframes. This is a known bug, and seems to effect Python 3.5 on Mac OS X.

This was a perfect prompt to look for potentially better formats or ways of storing index dataframes. For now I've settled on HDF5 format data files - a mature format. It might be more complex than we need, because the format allows us to do things like querying and concurrent access, but is also simple enough and also works for larger dataframes.