Intuitive Text Mining

Saturday, 10 September 2016

Pandas DataFrame HDFStore Bug

Took me a whole week to narrow down and find this bug!

Some Context

We are using pandas dataframes as indices. The row labels are the word, and the columns are the documents the words occur in. The cell content is the wordcount, or relevance, depending on which index we're working with.

We wanted to save the index by simply pickling it using pandas.to_pickle(). That failed for large dataframes.

So we chose the more mature, and designed for larger datasets, HDF format. That's available officially in pandas, too.

That seemed to work ... until ...

The Bug

Saving dataframes into a HDF store is easy. Let's create a small dataframe representing one of our indices:

import pandas
df1 = pandas.DataFrame()
df1.ix['apple', '001'] = 0.1
df1.ix['banana', '001'] = 0.2
df1.ix['apple', '002'] = 0.3
df1.ix['banana', '002'] = 0.7
df1.ix['nan', '001'] = 0.5

df1
        001 002
apple   0.1 0.3
banana 0.2 0.7
nan     0.5 NaN

So we've created a super simple dataframe. It refers to the words "apple", "banana" and "nan".

Let's save it to a HDF store:

s = pandas.HDFStore('test')
s['index'] = df1
s.close()

That's nice an easy. The HDF file is called test, and the object inside the file is called index .. you can have many objects in a single HDF store, if you wanted to.

Let's exit out of python, and restart it, to make sure we're truly bringing back the dataframe from the HDF file, and not accidentally bringing back from a memory still hanging around in a variable.

Let's now reopen the store:

import pandas
s = pandas.HDFStore('test')
s
<class 'pandas.io.pytables.HDFStore'>
File path: test
/index            frame        (shape->[3,2])

Here we've opened the HDF file called test, and listed what's inside it. You can see it contains an object called index. Let's bring that into python as a dataframe.

df2 = s['index']
s.close()
del s
df2
        001 002
apple   0.1 0.3
banana 0.2 0.7
NaN     0.5 NaN

You can see the problem! The words "apple" and "banana" are fine, but the word "nan" has been turned into a NaN (not a number) .. it should be a string.

That's the bug!

And it leads to all kinds of problems .. like not being able to find the word "nan", and other stuff with not being able to our relevance calculations. Even storing the index gives a warning as Python says the index label isn't orderable, and sometimes python tries to cast NaN to a float.

Workaround

A temporary workaround is to force the index values to be recast as strings, every time you retrieve an index back from a HD5 store:

df2.index = df2.index.astype(str)
df2
        001 002
apple   0.1 0.3
banana 0.2 0.7
nan     0.5 NaN

Fix?

I'm hoping the to_pickle() method is fixed ... not just this error.

This bug has been reported on the github pandas issues tracker.

Wednesday, 7 September 2016

Speeding Up Indexing Performance

In implementing the new relevance indexing ideas from the previous post, I ran into trouble!

Sloooooow ...

The indexing for the small Italian Recipes data set was quick as a flash. The bigger Iraq Inquiry report took a few minutes ...

... but the Clinton emails took ages .. 12 hours and still less than half the documents had been indexed.

This is not good, especially if we want to experiment and change things quickly.

Optimising Indexing

On this journey we want to develop ideas, algorithms and code which prioritse clarity and simplicity. We don't want to get into overly-complex, deeply sophisticated stuff .. that's not the idea here. The idea is to understand the basics.

So for this reason, we started with an approach to indexing that was super simple:

for every document in a corpus ...
      load previously saved index (if any)
      go through all the words in a document ...
      add words to index
      save updated index

That was simple conceptually, we can all understand picking up an index and updating it with new content from new documents.

But you can also see that as the corpus gets larger, we're loading and saving an ever bigger index. In much of computing, doing stuff with files is slow and expensive.

Can we make this more efficient? With less file loading and saving?

Well, we can't avoid going through all the text content because we do need to index it.

But we can avoid loading and saving ever bigger index files to update them. Instead we index each content text file separately, and afterwards we merge the index files.

for every document in a corpus ...
      go through all the words in a document ...
      save document index

start with an empty corpus index
for every document index in corpus ...
      load document index and join with corpus index
save corpus index

There will still be loads of file load and saves ... but they won't be getting bigger and bigger.

There is a downside to this - we lose the ability to update an index. We now have to think about regenerating an index for an entire corpus. Given the speed up for large corpora, this is worth it.

Results - Zooooom!

This results is way faster indexing! Instead of hours and hours for about half the Clinton emails done .. they were indexed in minutes.

The merging took longer, but not hours. This is to be expected as the corpus index does grow larger and larger, but we're not doing it without saving and loading it as we progress through every document.

The Code

The updated code for indexing in this new way is always at github:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/text_mining_toolkit/index_search.py

Pickling vs HDF5

During the coding it turned out that the simplest way of storing pandas dataframes in a file - known as pickling - didn't work for larger dataframes. This is a known bug, and seems to effect Python 3.5 on Mac OS X.

This was a perfect prompt to look for potentially better formats or ways of storing index dataframes. For now I've settled on HDF5 format data files - a mature format. It might be more complex than we need, because the format allows us to do things like querying and concurrent access, but is also simple enough and also works for larger dataframes.

Wednesday, 31 August 2016

Improving on Simple Word Counts (TF-IDF)

Up to now we've used word counts - how often a word occurs in a document or an entire corpus - as a measure of how important or relevant that word is.

We did it early on when we looked at word clouds to visualise the most important themes. And we did it more recently, when we tried to organise search results, trying to put the most relevant ones first.

The assumption that a word that appears more often in a document (or corpus) is relevant or important makes intuitive sense ...

.. but it also has some down-sides too .. so let's look at a few, and see if we can make simple improvements. I say "simple" because we want to stick to avoid overcomplicating things as much as possible if a simpler approach works well enough.

Problem 1: Boring Words

If you think about it, the most frequent words are boring words like "the", "and", "it", "is" .. and so on.

On their own, they're not informative, they don't tell us what a document is about. If I gave you a mystery document to analyse, and you could only see the top five words: "is", "the", "it", "is", "to" ... you wouldn't be able to say what the document was about.

If you remember the recipes example, the first 12 most frequent words were boring, uninformative words that had nothing to do with recipes at all.

We attempted to fix this by removing such stop words. But that approach is a little unsatisfactory, because we have to manually create such a list, and we humans might disagree on which words should be included and which shouldn't.

Also, we might spend ages crafting a perfect stop-word list for one set of documents, then find it was completely wasted effort for another set of documents, because we over-fitted them to the first set. So manual stop word lists are not adaptive - they don't adjust themselves to different sets of documents.

So let's think again about how we might automate identifying such boring words. Let's start by thinking about how they behave, their characteristics:

they occur often, and in all documents
they're often short words, long words tend to have significant meaning

That's all I can think of for now... but that's a powerful start already. Let's see why...

The first observation that boring words occurs often, and across all documents is pretty powerful. It allows us to automatically identify them, and that identification can be different for different sets of documents. The set of boring words for Shakespeare's plays might be different to those for medical reports.

But aren't we back to square one if a set of documents about sport, all contain the word sport, frequently and in all documents .. as you'd expect they might? Well, that second observation can help, because we know that many boring words are short.

Anyway ... how would be encode this idea as a practical algorithm? Well, let's think out loud about what we're trying to express ...

boring words occur often, and in all documents ... so this suggests we note all the documents the word occurs in, and the fraction $\frac{documents\ with\ word}{total\ number\ of\ documents}$ is a measure of boring-ness. A low score means the word is interesting because it is not liberally peppered everywhere in all documents.
boring words are often short words ... suggests we apply a moderating factor to the above measure which penalises shorter words but rewards longer words. Maybe something simple like dividing by word length is enough? Normalised word length might suppress the effect of normal words if there was an outlier in the text.

So a measure of boring-ness could be:

$$\frac{documents\ with\ word}{total\ number\ of\ documents} \cdot \frac{1}{(word\ length)}$$

That's boring-ness. If we want the opposite, interesting-ness, we just subtract it from 1 it to get:

$$\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {(word\ length)}$$

... so that the first fraction ranges between 1 (interesting) and 0 (boring).

By the way, subtracting from 1 avoids the problem of division by zero if we, instead, inverted the fraction, which many people like to do.

Ok, ok! ... all this isn't precise mathematical derivation from axioms .. but that's the way many methods in text analytics have grown because natural language itself isn't a mathematically correct algebra. We can fix these early ideas later if we find they don't work so well.

Problem 2: Longer Documents Cheat Word Counts

This problem is easy to see.

Imagine a document about books. It will contain the word books, and we expect it will have a higher count of the word books than other documents that aren't focussed on the fascinating topic of books.

But imagine, someone wanted to cheat our system of using word frequency counts ... and took our document and simply doubled it. That is, copied and pasted the text at the end of the same document, to make a new document, double the length, but with the original repeated twice.

You can see in the diagram above, that document 2 actually has the word book in every sentence, but the cheating document 1 doesn't. Should we reward this cheat? Should we really give this document double the word count for the word banana?

You can start to see that actually, document length can have an unfair bias on relevance.

What to do about it? That's easy too! When trying to account for biasing factors, a very common approach is to normalise what we're interested in (frequency) with the factor that might be causing bias (document length).

Here's how we would counter the biasing (cheating) effects of longer documents:

$$normalised\ frequency = \frac{word\ count}{total\ words\ in\ document}$$

For the above illustrated example, we'd have an updated measure for the word book:

document 1 normalised frequency = 4/44= 0.091
document 2 normalised frequency = 4/17 = 0.176

Now the second document comes up top, with almost double the relevance score .. we fixed the bias!

You can think of this normalised frequency as a way of measuring word density. Actually, if we're being pedantic, frequency is the right word .. we should have been using count, not frequency before.

Combined Measure of Relevance

We can combine the above two measures which aim to reflect word interestingness, and also counter the biasing effect of document length.

The combined measure for the relevance of a word could be:

$$
\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {(word\ length)}
\cdot
\left\{
\frac{word\ count}{total\ words\ in\ document} \right\}$$

A Small Refinement

We should stop fiddling with that expression .. but there is just one more that I want to do. Two of the three parts of that expression have values between 0 and 1. The only one that doesn't is $(word\ length)$.

A common "squishing" function for squeezing in a function's range between 0 and 1 is the $tanh(x)$ function, shown below:

You can see that $tanh(word\ length)$ stays 0 if the word length is 0. But as word length grows, $tanh(word\ length)$ grows towards 1, but never really reaches it.

We do need to think about scale because we don't want all words of length 1 to 8 to be mapped to some tiny 0.000001. We want a normal word, of say length 5 letters, to be mapped to about 0.8.

After playing about with different scaling factors, we find dividing word length by 5 seems to be a good setting, giving words of length 5 a decent score, and penalising words of length 2.

Here's a plot of $tanh(word\ length / 5)$ making clear how words of length 0 to 9 are mapped to the range 0 to 1.

So our expression for relevance, which we'll stop fiddling with now, is ...

$$
\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {tanh(\frac{word\ length}{5})}
\cdot
\left\{
\frac{word\ count}{total\ words\ in\ document} \right\}$$

Looks complicated, but it's only three fractions multiplied, representing:

term frequency
inverse document frequency
penalising short words.

Using A Word Relevance Measure

How do we use a word relevance measure, like the one we've developed above?

There are at least two ways we can use it:

Use it to identify the most important words in a document or corpus .. like we previously used word count to drive how big words were plotted in word clouds.
Use it to rank search results .... like we used simple word count before. If the search query contains more than one word, we simply add the measures for each query word found in the matching document.

And there will be more, I'm sure ...

Everyone Talks About TF-IDF

What we've arrived by ourselves at is very similar to the popular and widely used measure of interesting-ness called Term-Frequency Inverse-Document-Frequency (TF-IDF).

If you refer to most textbooks, or wikipedia, you'll see that most variants of TF-IDF contain 2 of our 3 factors. Many implementations also seem to use $log(\frac{total\ number\ of\ documents}{documents\ with\ word})$ for word interesting-ness - that's fine because it increases from 1 as the number documents with a word decreases. But it doesn't do what our measure does which is go from 0 to 1.

Next Time - Results and New Code

We'll have to try out these ideas to see how they work. The next post will cover this .. and also talk about a revised code implementation to do all these new things that the current very simple index based on word counts doesn't.

Wednesday, 24 August 2016

Indexing, Search and Relevance - From Scratch!

A very common thing that people want to do with text is to try to find things in it. That's called search, and we're going to make our own search engine!

Of course, we'll start with a very basic one, but the core ideas are the same across complex search engines.

We deliberately won't be using a 3rd party library to do this - we'll do this from scratch ourselves using basic Python because that's the best way to learn.

The Index at the Back of the Book

Let's start with a familiar example. Almost everyone has used a book which has an index at the back. Here's an example:

You can recall how easy it is to use. If I wanted to find the page(s) mentioning the word "ant" I'd follow the Ant entry along to find that it occurs on pages 4 and 5. Easy!

Some things like "boat" only appear once in the book, at page 2. Other things like "banana" appear on several pages, 4, 5 and 8.

So using the index is easy .. but how do we make an index in the first place?

Let's stay with our book example ... we go through each word, in each sentence, on each page, and for every interesting word, we make a note of which page it appeared on. As we scan through the pages, we'll be building up the index. In the above example, we'll encounter the word "ant" on page 4, ... then again on page 5, so we'll update the index to add that second occurrence. By the time we've reached the end of the last page, we'll have a full index.

We do have to examine every word on every page... which is a bit laborious and boring. Luckily, we have computers to do laborious work for us.

Indexing Text

So let's apply the same ideas we use for indexing a real book to our computer world.

We will again scan through the text, word by word, and keep a note of where we found each word. The following illustrates this:

You can see the word "saw" being scanned and entered into the index. You can also see the word "tree" being entered into the index too.

But have you noticed a difference with the real book example? Instead of noting down which page we found a word, we instead note down which document it came from. What's happening?

We actually have a choice about which one we do. It depends on how useful noting down which page, or document will be. If we only have one document, a single book, then noting down the page is enough. Some documents don't have pages, and so we can't do that anyway. For now, let's keep things simple and just note which document in a collection a word was found.

Searching

Once we've don't the hard work to build an index, searching is super easy. We just look it up in the index!

Using the example above, if we wanted to search for the word "tree", we'd look it up in the index .. follow the dots along .. and see that it was in "document 1". Easy peasy!

Now some readers will say that search isn't that easy, and is in fact very complicated and sophisticated. Well, yes, it can be complicated and sophisticated ... but what we've just done is the very simple, works, and is at the core of the more advanced methods.

Relevance

One problem that we will try to solve, even at this very early stage, is the problem of search result relevance.

Imagine, doing a search for the word "banana", and the results coming back telling us that it is to be found in 258 documents. That's cool, and very thorough ... but it doesn't help us decide which documents might be relevant to us, which ones to start looking at first.

This is actually quite a hard problem which nobody has solved perfectly .. but, we can apply a very simple idea that works quite well in sorting these 258 results in order of relevance.

Let's walk through the thinking ourselves ... imagine documents that are actually about bananas. Then imagine documents that mention a banana, but only do so in passing, and aren't actually about bananas. It's a good bet to say that documents actually about bananas will mention the word several time .. many times.

Maybe we can have a theory that says, the more a word is mentioned in a document, the more likely that document is about that word. Sounds reasonable enough!

If we believe this theory, then we can sort our search results with the documents that had the word many times at the top. Even if we don't believe this theory is perfect, we can still imagine that sorting in this way will give good results on many occasions.

So what do we need to change to our simple indexing and search steps above?

Well, when we're indexing, we should keep track of each and every occurrence of a word, even if it was in the same document. This way we keep a count of how often a word appears in a document, not just the fact that it did. Look at the following, which illustrates this:

You can see the word "owl" is noted in the index as appearing three times in document 1, and once in document 2. The word "tree" only appears once in both documents.

If we did a search for the word "owl" you can see that putting document 1 above document 2 would be helpful.

So we have a very simple, but fairly effective, way of ranking (sorting) the results by relevance.

Some Simple Code

Let's look at some simple code, and talk about it.

Building the Index
The following shows the Python code that creates the index. It's very simple .. it really is just 2 lines of code!!

# start with empty index
index = collections.defaultdict(list)

# update index

# (word, [document_name]) dictionary, there can be many [document_names] in list

[index[word].append(document_name) for word in doc_words_list]

So what's going on here?
Let's start with the idea of a Python dictionary which allows us to store values and associate them with keys. Here's an example:

d = {"John": "London", "Mary": "New York"}

which associates John with London, and Mary with New York. We can query the dictionary like this:

d["John"]

"London"

You can see how a dictionary could work as an index .. instead of John we have the word, and instead of London we have the document where it exists. In fact we need a list because there may be many documents that a word exists in.

So that second line, works through every word in the list of words supplied, and adds it to the index, using the word as the key, and the document name as the value. Just like a normal Python dictionary. So what's the append? Well, a normal Python dictionary would complain with an error if we tried to append a value if the key didn't exist. This is where the defaultdict comes in handy. It allows us to create a brand new entry with a blank list as a value, ready for us to append to. You can read more about Python's defaultdict here.

That first line simply sets the type for the index and tells Python that we want the values to be lists, rather than, say, numbers.

All that in just 2 lines of elegant code!

Querying the Index
The following shows the code for querying the index.

# do query
matching_documents = index[search_query]
# count occurrences in matching documents
matching_documents_counter = collections.Counter(matching_documents)
# return list of matching documents ordered by those with most occurences
return list(matching_documents_counter.most_common())

The first line is easy .. we simply query our index using the search query string. Remember the index is basically a dictionary, so we provide the key, and get back the value ... which is the list of matching documents. Remember the reason why it is a list - a word could be found in many documents.
The next line uses another cool tool from the Python collections toolset. The Counter() simply counts the number of times items appear in the list. In this case the matching documents. So if document 2 appears 5 times, we'd get a count of 5. Then we return that tally, ordered by the most common first. Simple!

Here's an example output:

tmt.index_search.search_index(cr.content_directory, "rice")
[('06.txt', 5), ('05.txt', 4), ('04.txt', 1)]

Here's we've searched for the word "rice". The results show that it occurs in documents 06.txt, 05.txt and 04.txt. You can also see that it occurs 5 times in 06.txt, 4 times in 05.txt and only once in 04.txt.
So the first result 06.txt is most likely to actually be about rice. Yummy!

Github

The full code is on github, as well as a notebook illustrating how these functions can be used:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/03_index_and_search.ipynb

Enjoy!

Monday, 22 August 2016

N-Gram Word Clouds

We've just developed code to count n-grams, and plot their frequency as word clouds:

n-grams word cloud python notebook https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/02_ngrams_frequency.ipynb

So explore some text data sets to see what happens.

Italian Recipes

We know this small corpus fairly well so these experiments are to see how well word clouds of n-grams work .. or don't ... we have to be open to when these algorithms don't work, otherwise we'll fall into the trap of blindly believing their results.

Here's the word cloud of 2-grams using a min word length of 4. Compared to previous 1-gram word clouds, this is really rather informative. We can see phrases which actually have meaning, or are things .. like bread crumbs, tomato sauce, grated cheese. You can also see that salt and pepper is prominent. The previous 1-grams wouldn't have captured these phrases, or in the case of salt and pepper, the fact that 2 things are closely related. Clearly these things are prominent in Italian cooking!

What about 3-grams? You can see below, that some additional insight is to be had, but not as much as the leap from 1-grams to 2-grams.

With 4-grams, there isn't much that is interesting or informative. It's as if the most interesting language snippets are 1 or 2 words long, sometimes 3.

Chilcott's Iraq War Report

The following shows the 2-gram word cloud for the Chilcott Iraq War Report.

Although the phrases are very relevant to the text corpus, they're not that informative because we know what the report was about and the main elements like the dates and people.

This is actually a useful prompt for an idea we'll explore later - that the most common phrases or words aren't the informative .. :)

Mystery Corpus!

In real life, we may be trying to work out what a set of text is about, without having seen it before. We'll actually be using some "mystery" text corpora as we journey through text mining .. but here's the first sample:

Mystery Corpus 01 https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/data_sets/mystery_corpus_01

You could take a peek to see what it is .. but try not to. Instead plot the word cloud to see what the main themes or elements are.

Can you tell what the text was about? Yes - it's the story of Little Red Ridinghood. There are quite a few big clues in the cloud to work this out - world, grannie, child, basket ...

We're text analytics detectives now!

First Data Pipeline - from Corpus to Word Cloud

Following on from the previous post on the need for a text processing pipeline framework ... I've just implemented a simple one.

It's simple but powerfully illustrates the ideas discussed last time. It also started to flesh out the framework, which will be provided as a Python package for easy reuse.

Simple Pipeline

To recap, the simple pipeline for creating word clouds is:

get text (from data_sets/recipes/txt/??.txt)
simplify whitespace (remove multiple spaces, change line feeds to whitespace)
filter out any non-alphanumeric characters
lowercase the text
split text into words
remove stop words (from stopwords/minimal-stop.txt)
keep only words of a minimum length (5)
count word frequency
plot word cloud

This is slightly more steps than previously to deal with unruly source text. An example of this is source text which contains multiple spaces, tabs, and new lines - all of which needs to be simplified down to a single white space.

Python Package

In starting to learn by doing, it because clear that a Python package was the right way to package up and provide the reusable framework we're developing. A fuller guide to doing this is here - but we'll start minimally.

It's available on github under the text_mining_toolkit directory:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/text_mining_toolkit

Recipes Corpus to WordCloud

The following diagram shows the above pipeline steps to take the recipes text corpus and emerge with a frequency word cloud - click it to enlarge it.

Word Cloud Text Processing Pipeline

You can see the python text_mining_toolkit package modules and functions being used. Feel free to explore the code at the above github link.

The python notebook is also on github, and shows you the commands implementing this pipeline, making use of the package we're making, and word cloud graphic itself - all very simple, as intended!

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/01_word_frequency.ipynb

Organising the Package

The process of experimenting, and doing, helps us learn and raise questions which we might have missed otherwise.

As I implemented this simple pipeline, it because clear I needed to think about the structure oft he text_mining_toolkit package. Here are a summary of these thoughts:

Make as much of the package using functional functions - ie functions that take an input, produce and output, with no reference or dependency on any existing state elsewhere
The exception to this is the CorpusReader which is an object containing the corpus, and able to present it on request as individual documents, an aggregation of all documents, or just the names of the documents. This exception should be fine as it is the start of any data pipeline.
There are processing steps which make sense applied to the entire text, and others which make sense applied to a sequence/set of words. Therefore two modules are used: text_processing_steps and word_processing_steps to keep things clearer. It may be that some operations are implemented in both modules because they can be applied to both text and words (such as lowercase).
Visualisation steps are put into a separate visualisation module.
Function names should make absolutely clear what's going on. I dislike working with other frameworks where the function or object names don't make clear what's going to happen, or what is available. I have made a point of using long descriptive function names, and also verbs at the start to really help readers or coders understand the packages. For example, it is really obvious what the function split_text_into_words() does.

Italian Recipes Word Cloud (from the pipeline)

Here's the output image from this pipeline, just for fun again.

Now that the pipeline has proven itself, it is a really clear and simple way to experiment and tweak it, without getting lost in code spaghetti, if you'll excuse the Italian food pun ;)

Tuesday, 16 August 2016

Data Pipelines, Networks, & Functional Programming

I've started to write out some initial code, to "learn by doing".

One thing this brought up is the question of how best to design a software toolkit which:

provides a simple conceptual model for thinking about the data as it goes through various stages of processing and analytics, from data source to answer.
enables easy flexibility for creating our own recipes for data analytics, simple ones and complex ones with many processing steps.

Data Processing Pipeline

We know we will always have a data source - we've called these the text corpora. We also know we want an answer, an output, perhaps in the form of a chart but sometimes just a list or table.

We also know we need to do something to the data between the data source and the answer. Thinking ahead a little, we know we will want to try all kinds of ideas between the source and the answer, and it would be good not to have to reinvent the wheel every time. That suggests having some kind of framework into which we can easily plug our ideas for processing steps. And we very likely will want to to apply more than one of these steps - we saw earlier the application of "lowercase" and "minimum length" steps to our recipes data.

The follows shows such a framework - a data pipeline, into which we easily plug as many processing steps as we like.

This is the framework we want to make for our text mining toolkit.

There are alternative designs to think about too. We might have considered having the data sat in a Python object, and repeatedly mutated by applying methods which change the data. That could work but has disadvantages because you're destroying the data with each mutation.

Pipeline Networks

It may seem that having a pipeline is less memory efficient, because we're retaining the data that results from each processing step, and also passing it to the next step, but a significant advantage is that we can create more complex networks of pipelines. We'll have to see if the overhead defeats this ambition.

Functional

There is also another benefit, which is that the concept of processing steps taking data input(s) and creating data output(s) is simple, and reflects the functional approach to coding. This has two strong advantages:

It is possible to parallelise the processing (eg using GPUs or clusters), because each flow is independent of another.
The output of a processing step (function) is only dependent on the input, and no other state, making it much easier to debug pipelines, and more easily make claims about correctness.