Tuesday, 26 July 2016

Word Counts, Word Clouds .. and Stopwords

One of the simplest ways of understanding a bunch of text is to see which words occur most often.

Yes, there many many more sophisticated techniques, but we're at the beginning of our journey and we want to start super super simple.


Counting Words

What do we mean by counting words? It's simply going through the text, a word at a time, and keeping a tally of how often we see each word.

Let's look at a short piece of text:


She sells sea shells on the sea shore.
She also sells ice-cream by the sea shore.

The tally of each word looks like this:


    Word        Frequency  
    she            2       
    sells          2       
    sea            3       
    shells         1       
    on             1       
    the            2       
    shore          2       
    also           1       
    ice-cream      1       
    by             1       

You can see how the word "shells" only appears once. Similarly the word "on" only appears once. More interestingly, the word "sea" appears most often - 3 times. Perhaps it represents best what the text is about? Maybe it is an important theme of the text?

What about words that appear a medium number of times? The word "she" appears twice, and we can see she is an important part of the story. So do the words "sells" and "shore" both of which are also an important part of the story.

So word frequency seems to be a fairly good way of working out the most important themes in some text.

Let's  try it on a couple of bigger buckets of text,
The most frequent words for the food recipes are shown here - only the most frequent, not all of them:

   the        273   
   and        203   
   a           133  
   with        79   
   for         77   
 
  in          72
  
 
  1           58
  
 
  to          56
  

   of          49   
   2           44   
   then        43   
   until       41   
   oil         41   

What happened?! Almost all the words are totally boring and uninformative. The words "the", "and", "a" .. "then" and "until" are not very enlightening. It's the thirteenth most frequent word that is in any way useful, the word "oil".

So oil is a major feature of the recipes. But there were 12 more frequent but useless words above it. If we look at the words more closely, we can see why. Those words are just the words we use in English to connect other words together to make proper sentences.

Maybe we should ignore them by filtering them out?

In fact that is exactly what many people do .. and such useless words are called stop words. You can read much more about stop words here, but for now we'll keep it simple with a quite minimal but effective list of stop words:


If we filter out the stop words, the most frequent words now look the following:

   1        58   
   2        44   
   then     43   
   oil      41   
   tsp      38   
 chopped    37   

The word "oil" is now number four on the list. That's better. And we now have useful words like "tsp" (tablespoon) and "chopped" in the top six words.

So stop words have improved things .. and we did it using only a very simple idea.

We still have some words that aren't that useful (we think) .. and we can refine the stop words list again later if we want to, to remove the word "then" for example.


Word Clouds

We could look at top-10 lists for word frequency ... and that would be fine. Sometimes a more visual approach helps readers understand the most important themes quickly, and with a lot less mental effort.

One visual way is to take these words, and plot them on a diagram, sized bigger if they occur more frequently. Many will know these as word clouds or tag clouds:

Python has a nice module called wordcloud which does the job. Here's the kind of code you need ..  simple enough.

# word cloud
wc = wordcloud.WordCloud(max_words=100, width=1600, height=800,background_color="white", margin=10,prefer_horizontal=1.0) 
wc.generate_from_frequencies(c.most_common())

# plot wordcloud
plt.figure(dpi=300, figsize=(16,8)
plt.imshow(wc)
plt.axis("off")

Here's the word cloud for the recipes:


You can see that oil, and chopped and olive are the prominent themes. The words "1" and "2" are the most prominent and could be filtered out. Looking at the recipes, they come from the common use of 1 or 2 as quantities.

And here's the word cloud for the Hillary Clinton emails:


We're starting to get the sense that the words "State" and "Department" are key, as is the word "US". You might say .. well, we'd expect the Clinton emails to to be about these words .. what's new? Well we can see "F-2014-20439" as prominent, and is probably a document or case referred to often .. worth checking out ;) There prominent dates too, like "06/30/2015" and "2009" which again are probably related to key events. Who are "Cheryl", "Abedin" and "Jacob" .. they seem to be referred to often enough?

And finally the word cloud for the Iraq Inquiry report:


This one is not so informative. Almost all the words should be stop words. We'll develop other methods later on our journey to help us get insight into the Iraq Report.


Word Frequency - Simple but Effective

What we've done here is very simple .. but actually quite powerful. The idea of using word frequency to imply importance can be applied to huge volumes of text ... without us having to read them manually .. and we can produce a nice visual representation of the most important themes.

Yes, there are imperfections ... and we can use stop words to make the results much better. Again this is a very simple idea. We've not needed to do anything very advanced at all .. all these ideas could be understood by a school student.


Minimum Word Length 5

Before we finish, let's try a rather brutal but effective method to improve the word clouds ... ignore any word that is less than 5 letters long. The idea is that most important words are longer than 4 letters, and most stop words will be short.

Here are the much more interesting results.... enjoy!






Update

Lowercase All Text

After I published the post I started to think about cleaning up the words a bit more. It seemed to me that the following words:

Wordcount  WordCount  wordCount  WOrdcount

would be considered as separate words by our code, because .. well .. they are different. We humans might consider them to be the same, and for any word counts to consider them the same word too.

One easy way to do that is to force all letters to be lowercase .. which will have some downsides (human names, code-names or case file identifiers might be case sensitive) .. but overall it will help improve our aim of distilling out the most important themes in some text.

Here's the code to do it in Python .. again, super easy:

# lowercase words
words[:] = [w.lower() for w in words]


And here are the resultant word clouds ... you can see some changes have happened. For example "olive" is much more prominent now, perhaps because before it was considered as smaller sets of different words. For the Iraq Inquiry, the words "military" and "security" are very much more prominent too, which you would expect.







Monday, 11 July 2016

Clinton Emails and Chilcot Iraq Inquiry Report in Plain Text

It's always much more interesting to explore data sets that are interesting themselves. 

So I've converted two "hot topic" data sets into plain text:

  • The Chilcot Iraq Inquiry Report into whether it was right to go to war, and whether the war and it's aftermath could have been better planned for.
  • Hillary Clinton's use of a personal email server for official business led to controversy. A redacted set of emails was released, and a version is at Kaggle.

The Iraq Inquiry report is in PDF form which is not ideal for text analytics. I've extracted the text using the open source "pdftotext" utility, with an attempt to preserve the text flow layout.

The Clinton emails are provided as an sqlite database or as a CSV file. I've extracted the "RawText" because the provided ExtractedBodyText hasn't worked in some cases. The plain text files are named with the DocumentNumber.

Here are the links on github:

I may update the Iraq Inquiry Report to also include the additional evidence documents.

Have fun!

Thursday, 23 June 2016

Letters, Words or Documents?

The Natual Language Challenge

Trying to extract meaning from unstructured natural language text data is different, if not more challenging, than structured numbers or labels.

There are several reasons for this.
  • There is no strict clear structure in natural language that gives meaning - no columns or rows, lists or arrays, column headers or fields labels.
  • Even the small amount of structure (grammar) that exists in natural language is often broken. Human languages have exceptions, ambiguity, multiple spellings, idiomatic phrases, regional expressions, ... and even more difficult things like sarcasm and irony.


    That I ain't done nuffin' could mean blue bird has not done nothing, that is, he's done everything!

    Natural language wasn't designed to be an efficient, unambiguous and precise language that we could efficiently compute with. 

    It grew organically over time, with more chaos than people sometimes expect. It was only until about 400 years ago that spelling for common words started to settle down for English.


    Messy Ambiguity

    Have a look at the following piece of natural language:


    What does it mean? Fruit flies, a kind of insect, like a banana as a meal? Fruit, when thrown, has the aerodynamics of a banana ... and lands with a splat!? The meaning is ambiguous.

    Do we give up? No! We accept the reality of natural language, and try to develop algorithms that are useful, good enough, even if occasionally the messiness of human language breaks them.


    What Do We Compute With?

    Ok, if we are going to forge ahead and try to compute with natural language, what are the things we manipulate and do calculations with?

    As we said at the top, for structured data this is much easier. Numbers are easy to calculate with. We can sum them, find averages, cluster them, etc. Structured text is easy enough too - the names of candidates people vote for, the class names of flowers in the Iris dataset, the names of regions ... and so on. We can count them, create sets of them, and sometimes order them or group similar ones together. In all these cases, each item of structured data is a number or a label, and each has a precise meaning.

    It is worth asking what the unit of computation for natural language text should be. Is it a letter? A word? A sentence? A paragraph? A document? A collection?

    We could answer this by looking at what computers actually do themselves. Computers are nothing more than a bunch of electrical switches. The electricity flow is either switched on or off - which naturally represents 0 and 1 in binary numbers. You'll remember these from school, binary 001 is one, 010 is two, 101 is five.

    The letters you see on your screen as you read this, are represented inside your computer or smartphone using a character numbering code that has been in place since the 1960s, called ASCII. In this scheme A is 65, B is 66 and lowercase z is 122. You can see all of the characters at ascii-code.com. If you watch traffic on your network, you'll see these codes flying past as they travel to and from the internet. (Don't do this without permission on a network that isn't yours!)

    If ASCII characters are how computers store and transmit text, then maybe we should use letters as the basic unit for computing with natural language?


    Let's remind ourselves of what we're trying to do. We're trying to extract meaning from natural language text. When we humans read text, we don't understand a word until we've seen all the letters of it. In fact, our minds can correct the spelling of a word because internally we refer back to an existing notion of what the whole word should be. This suggests it is whole words, not letters, that should be the unit of computation.

    Doing simple tasks like changing all the letters to lower case can be done letter by letter. But that's because that task doesn't care what the words mean.

    So we're arriving at the conclusion that words are the smallest unit of computation for text mining.


    We said smallest unit quite carefully there. In some cases, we humans can't understand the meaning of a word without looking at the words around it. Have a look at the following, which illustrates homonyms - words which are written the same but mean different things.


    We can only tell the meaning of the word saw, based on the words around it. They tell us that the first saw is a cutting tool for wood, and the second saw is the past tense of the verb to see. Similarly, the first branch is a part of a tree, and the second branch is a local establishment of a larger bank organisation.

    So ... does all this mean we were wrong to think that the unit of computation is a word on its own?

    No. Like many things in natural language text mining, the theories are good up to a point. So for us, using a word as the basic unit is good for many cases, but will break down when words need context. In that case, we need to take into account, somehow, the words around the word we're interested in.

    In some cases, even a phrase, or even a whole sentence won't make sense on its own, and we'll need to reach further and look at the whole paragraph or even the whole document!


    Conclusion


    • The smallest unit of computation for text mining is a word, not a letter.
    • Sometimes we'll have to extend beyond a word to establish meaning, looking at the words around it.

    Friday, 17 June 2016

    Theory, Model and Method

    This is my third book, and this experience makes it clear to me that text mining is not yet as neatly coherent as a conceptual framework as other fields.

    Loads of Tools, But No Shed

    A survey of the many guides will give you lots of methods for processing text to give you some insight into its meaning. There will be methods like

    • word frequency
    • document clustering
    • co-occurance matrices
    • synonym searching
    • ... etc ... etc ..

    It really feels like that there are lots of methods, tools, that are offered for you to use but there doesn't seem to be an overall idea or theory which ties them all together. To put this another way, a larger conceptual framework doesn't help us to place each of these tools within it - so we can see which is appropriate to use and when.

    Natural Language is Messy

    This isn't surprising - because the data, the natural language text, itself isn't a mathematically precise and consistent thing. Human language, and the way we use it, is a messy, organic, incomplete and inefficient scheme ... never designed for crisp complete perfectly precise computation.

    This leaves us with conceptual frameworks which fall into two camps:

    • probabilistic - ignoring any underlying structure and simply working with the likelihood of an answer based on "counting" how often it has previously happened
    • structural - trying to make use of underlying structure - either known true, or suspected true - to find answers

    Theory -> Model -> Method

    With this book, which is focused firmly on being accessible to those new to the subject, we won't present methods like the above in an ad-hoc fashion. That would be unsettling, and not really give comfort that we know what we were doing.

    Instead we'll try really hard to follow a pattern for each idea:

    • Start with a Theory - an idea that we think is true, or even know to be true.
    • Use this theory to define a Model which is useful in a computational sense.
    • Explain the Method, or algorithm, that we use to do calculations with that model.


    This should be much better than simply throwing a load of methods at readers. Even if we fail to completely fit every tool into a nice perfect complete theory of natural language - we can at least explain what model a method is supposed to work with, and which theory (truth or assumption) that model is trying to reflect.

    This transparency also allows you to improve a method for a model, and come up with a better one. Or have more than one method for a model (eg gradient descent vs random search). It even allows you to disagree with the model for a theory, so you can come up with your own (eg frequent words vs rare words). It is even possible to have more than one theory, each one humbly not trying to be all-encompassing but targeting a specific part of natural language.

    The following shows these choices:


    Example

    Here's a simple illustration of the above idea.
    • Theory: The most key concepts are mentioned a lot in a document.
    • Model: Frequency of words indicates the key concepts.
    • Method 1: Count the occurrence of words, the most popular words are the most meaningful concepts. Alternative Method 2: Count the words, but also the synonyms too.

    The second method is a refinement of the first method.

    You may disagree with the model and say that it is not word frequency which best models the theory which you may agree with. You may say that words like "the", "and" and "or" will occur the most frequently, and these words don't indicate any concept at all. This may lead you to an alternative model, such as words which appear rarely in a paragraph, but do appear in many paragraphs - this counteracts the negative effect of boring words like "and".

    You may even disagree with the theory ... :)


    Conclusion

    Instead of throwing a large number of methods willy-nilly at the reader, we'll try to be more disciplined and transparent about what model, and which theory a particular algorithm is intended for.

    Wednesday, 8 June 2016

    Hello World!

    This is the blog that will follow the progress of Make Your Own Data Mining Toolkit.

    Just like Make Your Own Neural Network and Make Your Own Mandelbrot, the aim is to take a very gentle journey through the ideas and mathematics, to make this very cool topic as accessible as possible.

    We'll cover simple ideas like indexing for search, then maybe onto things like sentiment analysis and clustering, ... and we'll get to really powerful ideas such as searching for related information and even learning from text.

    We're going to have fun! ...