Saturday 15 September 2018

Stop Word Lists - Use With Care

Stop words lists are lists of words which are removed from text before further analysis because they are considered to be uninformative and only add noise and volume to text data.

Many practitioners will simply download a list and use that list without further thought.

Several open source tools come with stop word lists built into the software - popular tools such as nltk, scikit-learn, spacy and gensim.

The following shows a family tree of stop word lists and how we think they are related:



The Problem 1

I came across a paper which explores both the quality issues of stop words lists in open source tools, but also the unsuitability of some lists for the further processing you might be doing.




The paper explores how:

  • Some lists are much shorter or longer than others - with no documentation about how the stop words are created. Various methods exist including manual construction, word frequency, or word tf-idf measures.
  • Some lists have spelling errors - eg fify / fifty -  probably as a result of manual intervention or refinement.
  • Some words have very obvious words missing - again without explanation of how the list was intended to be created. Scikit-learn contains has but not doesvery but not really, and get but not make.
  • Some lists have words that are surprisingly included - the scikit-learn list contained the words system and cry, for example, without explanation.


We should ideally only use stop word lists if we understand the rational and method by which they were constructed. Too often this documentation is missing.


The Problem 2

Another deep problem is that stop words need to match our method for tokenising text data. Consider the following:

  • If our tokeniser breaks isn't into isn and t, then the words isn and t should be in the stop word list. 
  • If our tokeniser converts isn't into isnt then our stop word list should contain isnt.


You can see that if there is a mismatch between stop words and how the tokeniser words, then our stop word filtering won't work.

Again, many open source stop word lists don't include this description as part of their documentation.


Conclusions

There are two conclusions:

  • We need to understand the rationale and method by which stop word lists are constructed - to explain why some words appear and some don't. This documentation is often missing for open source lists.
  • We need to match the tokenisation method with the stop words. Again, this documentation is often missing.



Tuesday 31 July 2018

Mind Map of Intuitive Text Mining

I've spend some time over the last month thinking about what content should be in the new book Intuitive Text Mining.

This blog post is a way for me to organise my own thinking - and for readers to provide suggestions for improving it.

Get in touch by email intuitivetextmining at gmail dot com or @intuitivetext.

Click the following mind map to enlarge:



New and Old Ideas

I want to keep things as simple and easy as possible, but still introduce the reader to some of the cool modern ideas - like the application of neural networks, comparative vector embeddings and dimension reducing visualisation - that have fuelled the resurgence of text mining.

Many of the older traditional texts don't cover these cool ideas, so there is a gap in the market.

Dan Jurafsky and James Martin's upcoming book Speech and Language Processing 3rd Edition is looking like an excellent book - you can get a preview here: https://web.stanford.edu/~jurafsky/slp3/


Part 0 - Introduction and Challenge


Big Data, Bigger Text Data
I'll introduce the idea of a world overflowing with data, and that historically a lot of the algorithms developed are for structured numerical data, not the unstructured natural language text data which - although messy, imprecise and ambiguous - reflects the breadth of our human cultural experience - from grand literary heights to informal chat chat, from important evidence in legal cases to collections of scientific knowledge, and even writing going back thousands of years into history.

There are huge rewards for anyone able to extract the meaning, patterns and insights from this messy ocean of text data.


Measuring Similarity and Meaning
I'll use an example to illustrate directly the challenge of analysing text by asking:

  • which of these three things are closest to each other - 1.1, 1.5 and 9.3
  • which of these three things are closest to each other - apple, orange, cake?

We, humans can easily find the answer, but working out how to get a computer to work it out pinpoints the challenge of working with natural language text. What's the distance between apple and cake? Why is it bigger than the distance between apple and orange?

There are deeper challenges. What does "we saw her duck" mean  - does it mean we saw her ower herself, or that we saw her water-loving bird? Humour and sarcasm add even more difficulty.


Unsupervised vs Supervised vs Expert Systems
Because natural language is so difficult, many algorithms for working with it are:

  • supervised - a human that understand language is laboriously labelling data to train a machine learning algorithm
  • expert systems - make use of human's knowledge of language to create rules that go a long way to deconstructing natural language - parsing sentences into grammatical elements for example, or stemming words to a common root, for example.

I don't like these approaches. They are often a practical necessity, but ultimately they involve humans which means the process of analysis text is not as automatic as it could be. Also, humans make errors, and people also have subjective and inconsistent interpretations of text, making them not a reliable oracle. Also, many of the "rules" developed as expert systems are not perfect, sometime far from perfect. Stemming words is a good example of what I don't like. Sentiment analysis is the cliched application of text analytics but much of it relies on rules created by humans or human labelled training data. So in the book I'll acknowledge this and focus on unsupervised methods as much as possible.

A huge benefit of unsupervised methods is that they can be used in new domains, or even different human languages, because they don't make the assumptions or require as much human preparatory work that supervised and expert methods do. Try using the English Porter stemmer on French. Also, for very large data sets, any human involvement becomes impractical.


Theory > Model > Method
Because natural language isn't precise, consistent and unambiguous - any method of working with it is necessarily going to have to be based on assumptions and simplifications.

Rather than pretend we didn't make assumptions and simplifications, we should be very clear about them. This way we can revisit these assumptions and perhaps improve them.

More than that, we can separate out the theory (assumptions, simplifications, suppositions) from the method we use to implement that theory. This way we can improve or try different methods and keep the same theory. Actually we can separate that out even further, by thinking about the model (data structures) which best represents the theory, and then detailed method (code) that works with that model. This means someone else could come up with a better model, or better code for the same model.

Everyone talks about vector word embeddings .... what are the assumptions, simplifications and suppositions behind why we think they work?


Approach
The book isn't about learning tensorflow or nltk or any other library or toolkit. The firm focus of the book is to understand how and why we do text mining. At the vert least readers will develop an intuition for what the popular algorithms are doing.

Writing our own small code to implement the ideas is a good way for them to sink in, and also for us to understand their limits. In effect we'll develop our own text mining toolkit, but that will be a purely educational process, not something that replaces the much more robust and popular toolkits you might use for

The focus of the book won't be to teach a programming language, unlike my previous books which have devoted lots of space and time to that.

We'll use Python because it is easy, popular and pretty much the de-facto toolset for data science. Python code is really easy to read, unlike many other languages, and this is important for anyone trying to understand what's going on.

We'll need an extension to Python to work with arrays of data, and numpy is the mature standard. Pandas wraps more convenience around numpy.

Something I still need to investigate is how to process large amounts of data. Do we scale up and use a big machine in google's compute cloud? Or do we scale out and use lots of nodes, which only works for algorithms that can be parallelised. Dask seems to be the standard in the python data science toolset for doing this.


Part 1 - First Ideas


Word Frequency & Important Words
We'll start with the simplest possible of text mining. We'll take a passage of text and ask ourselves what it is about - without reading the whole passage. Even better - how would we get a computer to tell us what the text passage is about, without displaying the whole text for us to read.

A really simple idea is that the most frequent word, or words, are going to be the most important words. Instead of reading 100, 1000, or even 10,000 words, we can just read the top 3 most frequent words, and this should give us an idea of what the text was about.

Although the idea is not perfect, it is very powerful given it's simplicity - a feature worth celebrating about algorithms.

In developing the idea of word frequency, we'll need to maintain a count of how often words occur .. which leads us to the idea of an index.

We'll tackle the issue of word count vs frequency which is normalised fo document length. A simple idea but the notion is applicable to more sophisticated ideas later.

We'll start to explore visualisation as a way of understanding the results of a a text mining process. A simple visualisation of frequent words is a word cloud.

This is also the point at which we can start to think about multiple text passages (documents) in a collection (corpus) and applying the algorithm to the whole corpus and visualising the results.


Stop Words - Yuk!
We'll look at an obvious weaknesses of this simple algorithm - words that are frequent but not informative. We'll explore excluding these stop-words, but also discuss why this isn't a purely unsupervised approach - and how we'll come back this later with the idea of word relevance.


Search
We'll look at the idea of finding things in out text - search - and think about how we might make our own simple search engine. Thinking about how we might to it ourselves using a pencil and paper, we'll introduce the idea of an reverse index - just like the index we might find at the back of many books.

We'll experiment with simple searches, and think about how to order the results, because some clearly are better than others. We can use the idea of word frequency as a very simple way of ordering (ranking) search results.


Co-occurrence
After the very simple ideas so far, we'll increase the sophistication just a little bit and look at words that occur together - co-occurrence. The idea is that instead of looking at individual word frequency, we look at word pairs, that might be more informative. Worth a try...

Co-occurrence is an ideal point to introduce a new form of visualisation - a graph of connected nodes. Graphs are great for quickly understanding data where each item is linked in some way to many others.

We'll also look at extending the co-occurence beyond words that are right to each other but co-occur within a longer window. The words apple and orange co-occur but often with another word between them.

We'll also look at co-occurence as a different way of doing search. If we search for word apple, should we also find documents with the word orange (but not apple)? That's a sophisticated idea!


Exploring Langauge Use
This sophisticated idea of co-occurrence allows us to understand how language is used. It can help us discover new terms for themes we're interested in. For example, we might discover the term arachnid which co-occurs with words that also co-occurs with spider. Visualising co-occurrence with graphs makes this exploration much easier than looking at lists of numbers - and demonstrates the power of graphs.


Generating Sentences
We'll have some fun using co-occurrence to automatically generate sentences. We'll see how we do need those boring/stopwords to make sentences more realistic.


Part 2 - More Interesting Ideas


Relevance
In Part 1 we used the idea of word frequency as a very simple and surprisingly effective way of identifying the important themes in some text. But there was a major weakness - that many frequent words are boring, not interesting, and don't convey any insights. Words like the, and, is. We previously hacked a solution by simply excluding them - and that wasn't an unsupervised method.

We'll introduce the idea of balancing words that are frequent in some documents with words that appear in all documents and so are likely not to be that interesting - term frequency vs inverse document frequency, TF-IDF. This isn't perfect but an improvement.

We'll start with our own simple ratio of frequency TF / DF, and then see why some people use logs to take into account probability and information content (Shannon).

We'll also try to apply the idea to words frequent in a smaller window vs appearing in the entire document - something I haven't seen in other text books.


N-Grams
We'll reconsider the basic unit of computation, which we assumed was the word. Perhaps pairs of words, 2-grams is better? or 3-grams .. or n-grams captured using a sliding window. Intuitively pairs of words would better capture meaning that isolated words. We'll experiment to see if this works - and regenerate things like word clouds and relevance rankings.

We'll discuss what's too small an atom - a letter? And what's to big - a whole sentence? We'll approach the idea of the smallest units that can be compared.


Co-Occurrence Sliding Window
Following the idea of n-grams to look beyond one word, we'll revisit the powerful idea of co-occurrence and recognise that words which appear close to together are often related even if they aren't directly next to each other. We can use a sliding window to capture co-occurrence that happens beyond the immediate neighbour.

We'll also discuss how a hard cut-off window would exclude words that should really be considered. We'll try the idea of a sliding window which decays the relative importance of words the further apart they are.


Similarity
We'll ask an apparently innocent question - what is similarity? What makes two words similar? Our discussion up to this point suggests that similar words share commonly co-occurring words are probably similar. "You shall now a word by the company it keeps" is the famous quite.

After some calculations by hand, we'll see that we can use vector calculations to work out the similarity between two words using co-occurrence matrices. This will be the introduction to words as vectors, embeddings.

We can create nice visualisations using force-directed graphs to link similar words, and use stronger/closer links for very similar words.


Clustering
Grouping similar things is a common technique in data mining. Our use of force-directed graphs means we don't necessarily need to perform clustering as it happens as part of the graph visualisation. However, the process of exploring a graph is manual - and a more unsupervised approach is to generate a number of clusters.

We'll look at hierarchical and k-means clustering. The k-means method isn't perfect and has weaknesses - it doesn't know the right number of clusters, and struggles with non-discrete non-circular clusters, which we'll solve later using methods which reduce dimensions and try to encourage discrete grouping.


Part 3 - Even More Interesting Ideas


Topic Extraction and Summarisation
In Part 3 we'll look at slightly more advanced topics. The ability to extract out key - but distinct - themes from a body of text is clearly useful. We'd be able to quickly see what the main topics in a large set of text was without having to read it all.

We'll look at the idea of dimension reduction which forces the high dimensional text data onto a smaller dimensional space, and in the process loses lots of information .. but hopefully keeps the most important features. There are several ways of doing this, and we'll talk about Singular Value Decompostion (SVD) as a way of doing topic extraction fairly easily. We'll compare with similar ideas mentioned often in other guides, ideas like Latent Semantic Indexing (LSI) and principal component analysis (PCA).


Dimension Reduction and Visualisation
Text data, if we think about it in terms of co-occurrence or word frequency, is naturally high dimensional because we have lots of possible terms (words) to consider relationships between.

The task of reducing the massive dimensionality of that data is important, not just for extracting topics, but more generally to explore patterns in that data. Visualisation is key - we can't easily visualise very high dimensional data, which means we can't easily understand it.

We'll look at the popular t-SNE (and successor) methods for squishing data into 2 dimensions, losing the correctness of location in the process, but preserving the notion of nearness and distance between data points.

We'll explore the use of self-organising maps (SOM), a form of neural network training, to see how they can effectively reduce the dimension of data. We'll see that for smaller data sets the results are similar to k-means clustering, but for larger more complex data, the method preserves topological relations.


Topological Data Analysis and Logical Structure
Continuing the theme of analysing high dimensional data, we'll look at topological data analysis (TDA).

This is fairly new field, or at least use of the mathematics, and there are notable examples of its application to text data. I'll have to learn about it first!

The promise of discovering logical structure in data is intriguing - we'll see if it emerges. An example of a logical hierarchy is this - a dog kind of animal, and cat is a kind of animal. Being able to discover such structures in text would be really useful, especially if we could do it without the very-supervised sentence parsing and entity extraction methods.


Which Features? Which Relations?
Having worked through several methods for analysing data, we'll return to the basic question of which feature to use as basic atoms.

Why do we use words as the basic unit of text analysis? What is so special about using a  document as the first partition of of a body of text? Are n-grams the ideal unit, skip-grams perhaps?

Which relationships on those atoms are most useful. Frequency and co-occurrence are by far the most popular, but are there others? Perhaps a different approach would help us better separate the interesting atoms from the boring ones?


Part 4 - Fun Projects


Text Generation
We'll have some fun using the machine learned models we've come across to generate text. We'll look at the very simple markov chains, which simply model the probability of the next word in a sequence, to more sophisticated neural networks which learn sequences.

If we can, we'll also look at the generative adversarial network (GAN) approach which pits one generative network against a classifier trying to outwit it. The results from other domains, such as image synthesis are particularly impressive (but pixel/picture errors are more easily forgiven in that domain).


Interesting Data Sets
We'll apply some of the techniques we've developed to interesting data sets, such as the Hillsborough Independent Panel report, or the Iraq War Inquiry's Chillcott Report. We'll uncover interesting links within the text, as well as extract key topics - hoping some of them are valid but perhaps surprising.


Other Languages
As a test of our unsupervised approach, we'll apply our methods to another language. Spanish is a good one (because I know a little bit).

To test our ideas even further, we'll apply it to old spanish, perhaps Don Quixote, again to see if they work, and to see if we can discover anything interesting.


Part 5 - Appendices

We'll have appendices gently explaining the mathematics of simple and more involved ideas we've used in the book, including:

  • logarithms for probabilistic scaling (tf/idf)
  • SVD matrix decomposition
  • k-means clustering and hierarchical clustering
  • SOM map generation
  • neural network training
  • t-SNE method