Saturday 15 September 2018

Stop Word Lists - Use With Care

Stop words lists are lists of words which are removed from text before further analysis because they are considered to be uninformative and only add noise and volume to text data.

Many practitioners will simply download a list and use that list without further thought.

Several open source tools come with stop word lists built into the software - popular tools such as nltk, scikit-learn, spacy and gensim.

The following shows a family tree of stop word lists and how we think they are related:



The Problem 1

I came across a paper which explores both the quality issues of stop words lists in open source tools, but also the unsuitability of some lists for the further processing you might be doing.




The paper explores how:

  • Some lists are much shorter or longer than others - with no documentation about how the stop words are created. Various methods exist including manual construction, word frequency, or word tf-idf measures.
  • Some lists have spelling errors - eg fify / fifty -  probably as a result of manual intervention or refinement.
  • Some words have very obvious words missing - again without explanation of how the list was intended to be created. Scikit-learn contains has but not doesvery but not really, and get but not make.
  • Some lists have words that are surprisingly included - the scikit-learn list contained the words system and cry, for example, without explanation.


We should ideally only use stop word lists if we understand the rational and method by which they were constructed. Too often this documentation is missing.


The Problem 2

Another deep problem is that stop words need to match our method for tokenising text data. Consider the following:

  • If our tokeniser breaks isn't into isn and t, then the words isn and t should be in the stop word list. 
  • If our tokeniser converts isn't into isnt then our stop word list should contain isnt.


You can see that if there is a mismatch between stop words and how the tokeniser words, then our stop word filtering won't work.

Again, many open source stop word lists don't include this description as part of their documentation.


Conclusions

There are two conclusions:

  • We need to understand the rationale and method by which stop word lists are constructed - to explain why some words appear and some don't. This documentation is often missing for open source lists.
  • We need to match the tokenisation method with the stop words. Again, this documentation is often missing.