Intuitive Text Mining: August 2016

Wednesday 31 August 2016

Improving on Simple Word Counts (TF-IDF)

Up to now we've used word counts - how often a word occurs in a document or an entire corpus - as a measure of how important or relevant that word is.

We did it early on when we looked at word clouds to visualise the most important themes. And we did it more recently, when we tried to organise search results, trying to put the most relevant ones first.

The assumption that a word that appears more often in a document (or corpus) is relevant or important makes intuitive sense ...

.. but it also has some down-sides too .. so let's look at a few, and see if we can make simple improvements. I say "simple" because we want to stick to avoid overcomplicating things as much as possible if a simpler approach works well enough.

Problem 1: Boring Words

If you think about it, the most frequent words are boring words like "the", "and", "it", "is" .. and so on.

On their own, they're not informative, they don't tell us what a document is about. If I gave you a mystery document to analyse, and you could only see the top five words: "is", "the", "it", "is", "to" ... you wouldn't be able to say what the document was about.

If you remember the recipes example, the first 12 most frequent words were boring, uninformative words that had nothing to do with recipes at all.

We attempted to fix this by removing such stop words. But that approach is a little unsatisfactory, because we have to manually create such a list, and we humans might disagree on which words should be included and which shouldn't.

Also, we might spend ages crafting a perfect stop-word list for one set of documents, then find it was completely wasted effort for another set of documents, because we over-fitted them to the first set. So manual stop word lists are not adaptive - they don't adjust themselves to different sets of documents.

So let's think again about how we might automate identifying such boring words. Let's start by thinking about how they behave, their characteristics:

they occur often, and in all documents
they're often short words, long words tend to have significant meaning

That's all I can think of for now... but that's a powerful start already. Let's see why...

The first observation that boring words occurs often, and across all documents is pretty powerful. It allows us to automatically identify them, and that identification can be different for different sets of documents. The set of boring words for Shakespeare's plays might be different to those for medical reports.

But aren't we back to square one if a set of documents about sport, all contain the word sport, frequently and in all documents .. as you'd expect they might? Well, that second observation can help, because we know that many boring words are short.

Anyway ... how would be encode this idea as a practical algorithm? Well, let's think out loud about what we're trying to express ...

boring words occur often, and in all documents ... so this suggests we note all the documents the word occurs in, and the fraction $\frac{documents\ with\ word}{total\ number\ of\ documents}$ is a measure of boring-ness. A low score means the word is interesting because it is not liberally peppered everywhere in all documents.
boring words are often short words ... suggests we apply a moderating factor to the above measure which penalises shorter words but rewards longer words. Maybe something simple like dividing by word length is enough? Normalised word length might suppress the effect of normal words if there was an outlier in the text.

So a measure of boring-ness could be:

$$\frac{documents\ with\ word}{total\ number\ of\ documents} \cdot \frac{1}{(word\ length)}$$

That's boring-ness. If we want the opposite, interesting-ness, we just subtract it from 1 it to get:

$$\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {(word\ length)}$$

... so that the first fraction ranges between 1 (interesting) and 0 (boring).

By the way, subtracting from 1 avoids the problem of division by zero if we, instead, inverted the fraction, which many people like to do.

Ok, ok! ... all this isn't precise mathematical derivation from axioms .. but that's the way many methods in text analytics have grown because natural language itself isn't a mathematically correct algebra. We can fix these early ideas later if we find they don't work so well.

Problem 2: Longer Documents Cheat Word Counts

This problem is easy to see.

Imagine a document about books. It will contain the word books, and we expect it will have a higher count of the word books than other documents that aren't focussed on the fascinating topic of books.

But imagine, someone wanted to cheat our system of using word frequency counts ... and took our document and simply doubled it. That is, copied and pasted the text at the end of the same document, to make a new document, double the length, but with the original repeated twice.

You can see in the diagram above, that document 2 actually has the word book in every sentence, but the cheating document 1 doesn't. Should we reward this cheat? Should we really give this document double the word count for the word banana?

You can start to see that actually, document length can have an unfair bias on relevance.

What to do about it? That's easy too! When trying to account for biasing factors, a very common approach is to normalise what we're interested in (frequency) with the factor that might be causing bias (document length).

Here's how we would counter the biasing (cheating) effects of longer documents:

$$normalised\ frequency = \frac{word\ count}{total\ words\ in\ document}$$

For the above illustrated example, we'd have an updated measure for the word book:

document 1 normalised frequency = 4/44= 0.091
document 2 normalised frequency = 4/17 = 0.176

Now the second document comes up top, with almost double the relevance score .. we fixed the bias!

You can think of this normalised frequency as a way of measuring word density. Actually, if we're being pedantic, frequency is the right word .. we should have been using count, not frequency before.

Combined Measure of Relevance

We can combine the above two measures which aim to reflect word interestingness, and also counter the biasing effect of document length.

The combined measure for the relevance of a word could be:

$$
\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {(word\ length)}
\cdot
\left\{
\frac{word\ count}{total\ words\ in\ document} \right\}$$

A Small Refinement

We should stop fiddling with that expression .. but there is just one more that I want to do. Two of the three parts of that expression have values between 0 and 1. The only one that doesn't is $(word\ length)$.

A common "squishing" function for squeezing in a function's range between 0 and 1 is the $tanh(x)$ function, shown below:

You can see that $tanh(word\ length)$ stays 0 if the word length is 0. But as word length grows, $tanh(word\ length)$ grows towards 1, but never really reaches it.

We do need to think about scale because we don't want all words of length 1 to 8 to be mapped to some tiny 0.000001. We want a normal word, of say length 5 letters, to be mapped to about 0.8.

After playing about with different scaling factors, we find dividing word length by 5 seems to be a good setting, giving words of length 5 a decent score, and penalising words of length 2.

Here's a plot of $tanh(word\ length / 5)$ making clear how words of length 0 to 9 are mapped to the range 0 to 1.

So our expression for relevance, which we'll stop fiddling with now, is ...

$$
\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {tanh(\frac{word\ length}{5})}
\cdot
\left\{
\frac{word\ count}{total\ words\ in\ document} \right\}$$

Looks complicated, but it's only three fractions multiplied, representing:

term frequency
inverse document frequency
penalising short words.

Using A Word Relevance Measure

How do we use a word relevance measure, like the one we've developed above?

There are at least two ways we can use it:

Use it to identify the most important words in a document or corpus .. like we previously used word count to drive how big words were plotted in word clouds.
Use it to rank search results .... like we used simple word count before. If the search query contains more than one word, we simply add the measures for each query word found in the matching document.

And there will be more, I'm sure ...

Everyone Talks About TF-IDF

What we've arrived by ourselves at is very similar to the popular and widely used measure of interesting-ness called Term-Frequency Inverse-Document-Frequency (TF-IDF).

If you refer to most textbooks, or wikipedia, you'll see that most variants of TF-IDF contain 2 of our 3 factors. Many implementations also seem to use $log(\frac{total\ number\ of\ documents}{documents\ with\ word})$ for word interesting-ness - that's fine because it increases from 1 as the number documents with a word decreases. But it doesn't do what our measure does which is go from 0 to 1.

Next Time - Results and New Code

We'll have to try out these ideas to see how they work. The next post will cover this .. and also talk about a revised code implementation to do all these new things that the current very simple index based on word counts doesn't.

Wednesday 24 August 2016

Indexing, Search and Relevance - From Scratch!

A very common thing that people want to do with text is to try to find things in it. That's called search, and we're going to make our own search engine!

Of course, we'll start with a very basic one, but the core ideas are the same across complex search engines.

We deliberately won't be using a 3rd party library to do this - we'll do this from scratch ourselves using basic Python because that's the best way to learn.

The Index at the Back of the Book

Let's start with a familiar example. Almost everyone has used a book which has an index at the back. Here's an example:

You can recall how easy it is to use. If I wanted to find the page(s) mentioning the word "ant" I'd follow the Ant entry along to find that it occurs on pages 4 and 5. Easy!

Some things like "boat" only appear once in the book, at page 2. Other things like "banana" appear on several pages, 4, 5 and 8.

So using the index is easy .. but how do we make an index in the first place?

Let's stay with our book example ... we go through each word, in each sentence, on each page, and for every interesting word, we make a note of which page it appeared on. As we scan through the pages, we'll be building up the index. In the above example, we'll encounter the word "ant" on page 4, ... then again on page 5, so we'll update the index to add that second occurrence. By the time we've reached the end of the last page, we'll have a full index.

We do have to examine every word on every page... which is a bit laborious and boring. Luckily, we have computers to do laborious work for us.

Indexing Text

So let's apply the same ideas we use for indexing a real book to our computer world.

We will again scan through the text, word by word, and keep a note of where we found each word. The following illustrates this:

You can see the word "saw" being scanned and entered into the index. You can also see the word "tree" being entered into the index too.

But have you noticed a difference with the real book example? Instead of noting down which page we found a word, we instead note down which document it came from. What's happening?

We actually have a choice about which one we do. It depends on how useful noting down which page, or document will be. If we only have one document, a single book, then noting down the page is enough. Some documents don't have pages, and so we can't do that anyway. For now, let's keep things simple and just note which document in a collection a word was found.

Searching

Once we've don't the hard work to build an index, searching is super easy. We just look it up in the index!

Using the example above, if we wanted to search for the word "tree", we'd look it up in the index .. follow the dots along .. and see that it was in "document 1". Easy peasy!

Now some readers will say that search isn't that easy, and is in fact very complicated and sophisticated. Well, yes, it can be complicated and sophisticated ... but what we've just done is the very simple, works, and is at the core of the more advanced methods.

Relevance

One problem that we will try to solve, even at this very early stage, is the problem of search result relevance.

Imagine, doing a search for the word "banana", and the results coming back telling us that it is to be found in 258 documents. That's cool, and very thorough ... but it doesn't help us decide which documents might be relevant to us, which ones to start looking at first.

This is actually quite a hard problem which nobody has solved perfectly .. but, we can apply a very simple idea that works quite well in sorting these 258 results in order of relevance.

Let's walk through the thinking ourselves ... imagine documents that are actually about bananas. Then imagine documents that mention a banana, but only do so in passing, and aren't actually about bananas. It's a good bet to say that documents actually about bananas will mention the word several time .. many times.

Maybe we can have a theory that says, the more a word is mentioned in a document, the more likely that document is about that word. Sounds reasonable enough!

If we believe this theory, then we can sort our search results with the documents that had the word many times at the top. Even if we don't believe this theory is perfect, we can still imagine that sorting in this way will give good results on many occasions.

So what do we need to change to our simple indexing and search steps above?

Well, when we're indexing, we should keep track of each and every occurrence of a word, even if it was in the same document. This way we keep a count of how often a word appears in a document, not just the fact that it did. Look at the following, which illustrates this:

You can see the word "owl" is noted in the index as appearing three times in document 1, and once in document 2. The word "tree" only appears once in both documents.

If we did a search for the word "owl" you can see that putting document 1 above document 2 would be helpful.

So we have a very simple, but fairly effective, way of ranking (sorting) the results by relevance.

Some Simple Code

Let's look at some simple code, and talk about it.

Building the Index
The following shows the Python code that creates the index. It's very simple .. it really is just 2 lines of code!!

# start with empty index
index = collections.defaultdict(list)

# update index

# (word, [document_name]) dictionary, there can be many [document_names] in list

[index[word].append(document_name) for word in doc_words_list]

So what's going on here?
Let's start with the idea of a Python dictionary which allows us to store values and associate them with keys. Here's an example:

d = {"John": "London", "Mary": "New York"}

which associates John with London, and Mary with New York. We can query the dictionary like this:

d["John"]

"London"

You can see how a dictionary could work as an index .. instead of John we have the word, and instead of London we have the document where it exists. In fact we need a list because there may be many documents that a word exists in.

So that second line, works through every word in the list of words supplied, and adds it to the index, using the word as the key, and the document name as the value. Just like a normal Python dictionary. So what's the append? Well, a normal Python dictionary would complain with an error if we tried to append a value if the key didn't exist. This is where the defaultdict comes in handy. It allows us to create a brand new entry with a blank list as a value, ready for us to append to. You can read more about Python's defaultdict here.

That first line simply sets the type for the index and tells Python that we want the values to be lists, rather than, say, numbers.

All that in just 2 lines of elegant code!

Querying the Index
The following shows the code for querying the index.

# do query
matching_documents = index[search_query]
# count occurrences in matching documents
matching_documents_counter = collections.Counter(matching_documents)
# return list of matching documents ordered by those with most occurences
return list(matching_documents_counter.most_common())

The first line is easy .. we simply query our index using the search query string. Remember the index is basically a dictionary, so we provide the key, and get back the value ... which is the list of matching documents. Remember the reason why it is a list - a word could be found in many documents.
The next line uses another cool tool from the Python collections toolset. The Counter() simply counts the number of times items appear in the list. In this case the matching documents. So if document 2 appears 5 times, we'd get a count of 5. Then we return that tally, ordered by the most common first. Simple!

Here's an example output:

tmt.index_search.search_index(cr.content_directory, "rice")
[('06.txt', 5), ('05.txt', 4), ('04.txt', 1)]

Here's we've searched for the word "rice". The results show that it occurs in documents 06.txt, 05.txt and 04.txt. You can also see that it occurs 5 times in 06.txt, 4 times in 05.txt and only once in 04.txt.
So the first result 06.txt is most likely to actually be about rice. Yummy!

Github

The full code is on github, as well as a notebook illustrating how these functions can be used:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/03_index_and_search.ipynb

Enjoy!

Monday 22 August 2016

N-Gram Word Clouds

We've just developed code to count n-grams, and plot their frequency as word clouds:

n-grams word cloud python notebook https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/02_ngrams_frequency.ipynb

So explore some text data sets to see what happens.

Italian Recipes

We know this small corpus fairly well so these experiments are to see how well word clouds of n-grams work .. or don't ... we have to be open to when these algorithms don't work, otherwise we'll fall into the trap of blindly believing their results.

Here's the word cloud of 2-grams using a min word length of 4. Compared to previous 1-gram word clouds, this is really rather informative. We can see phrases which actually have meaning, or are things .. like bread crumbs, tomato sauce, grated cheese. You can also see that salt and pepper is prominent. The previous 1-grams wouldn't have captured these phrases, or in the case of salt and pepper, the fact that 2 things are closely related. Clearly these things are prominent in Italian cooking!

What about 3-grams? You can see below, that some additional insight is to be had, but not as much as the leap from 1-grams to 2-grams.

With 4-grams, there isn't much that is interesting or informative. It's as if the most interesting language snippets are 1 or 2 words long, sometimes 3.

Chilcott's Iraq War Report

The following shows the 2-gram word cloud for the Chilcott Iraq War Report.

Although the phrases are very relevant to the text corpus, they're not that informative because we know what the report was about and the main elements like the dates and people.

This is actually a useful prompt for an idea we'll explore later - that the most common phrases or words aren't the informative .. :)

Mystery Corpus!

In real life, we may be trying to work out what a set of text is about, without having seen it before. We'll actually be using some "mystery" text corpora as we journey through text mining .. but here's the first sample:

Mystery Corpus 01 https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/data_sets/mystery_corpus_01

You could take a peek to see what it is .. but try not to. Instead plot the word cloud to see what the main themes or elements are.

Can you tell what the text was about? Yes - it's the story of Little Red Ridinghood. There are quite a few big clues in the cloud to work this out - world, grannie, child, basket ...

We're text analytics detectives now!

First Data Pipeline - from Corpus to Word Cloud

Following on from the previous post on the need for a text processing pipeline framework ... I've just implemented a simple one.

It's simple but powerfully illustrates the ideas discussed last time. It also started to flesh out the framework, which will be provided as a Python package for easy reuse.

Simple Pipeline

To recap, the simple pipeline for creating word clouds is:

get text (from data_sets/recipes/txt/??.txt)
simplify whitespace (remove multiple spaces, change line feeds to whitespace)
filter out any non-alphanumeric characters
lowercase the text
split text into words
remove stop words (from stopwords/minimal-stop.txt)
keep only words of a minimum length (5)
count word frequency
plot word cloud

This is slightly more steps than previously to deal with unruly source text. An example of this is source text which contains multiple spaces, tabs, and new lines - all of which needs to be simplified down to a single white space.

Python Package

In starting to learn by doing, it because clear that a Python package was the right way to package up and provide the reusable framework we're developing. A fuller guide to doing this is here - but we'll start minimally.

It's available on github under the text_mining_toolkit directory:

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/text_mining_toolkit

Recipes Corpus to WordCloud

The following diagram shows the above pipeline steps to take the recipes text corpus and emerge with a frequency word cloud - click it to enlarge it.

Word Cloud Text Processing Pipeline

You can see the python text_mining_toolkit package modules and functions being used. Feel free to explore the code at the above github link.

The python notebook is also on github, and shows you the commands implementing this pipeline, making use of the package we're making, and word cloud graphic itself - all very simple, as intended!

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/01_word_frequency.ipynb

Organising the Package

The process of experimenting, and doing, helps us learn and raise questions which we might have missed otherwise.

As I implemented this simple pipeline, it because clear I needed to think about the structure oft he text_mining_toolkit package. Here are a summary of these thoughts:

Make as much of the package using functional functions - ie functions that take an input, produce and output, with no reference or dependency on any existing state elsewhere
The exception to this is the CorpusReader which is an object containing the corpus, and able to present it on request as individual documents, an aggregation of all documents, or just the names of the documents. This exception should be fine as it is the start of any data pipeline.
There are processing steps which make sense applied to the entire text, and others which make sense applied to a sequence/set of words. Therefore two modules are used: text_processing_steps and word_processing_steps to keep things clearer. It may be that some operations are implemented in both modules because they can be applied to both text and words (such as lowercase).
Visualisation steps are put into a separate visualisation module.
Function names should make absolutely clear what's going on. I dislike working with other frameworks where the function or object names don't make clear what's going to happen, or what is available. I have made a point of using long descriptive function names, and also verbs at the start to really help readers or coders understand the packages. For example, it is really obvious what the function split_text_into_words() does.

Italian Recipes Word Cloud (from the pipeline)

Here's the output image from this pipeline, just for fun again.

Now that the pipeline has proven itself, it is a really clear and simple way to experiment and tweak it, without getting lost in code spaghetti, if you'll excuse the Italian food pun ;)

Tuesday 16 August 2016

Data Pipelines, Networks, & Functional Programming

I've started to write out some initial code, to "learn by doing".

One thing this brought up is the question of how best to design a software toolkit which:

provides a simple conceptual model for thinking about the data as it goes through various stages of processing and analytics, from data source to answer.
enables easy flexibility for creating our own recipes for data analytics, simple ones and complex ones with many processing steps.

Data Processing Pipeline

We know we will always have a data source - we've called these the text corpora. We also know we want an answer, an output, perhaps in the form of a chart but sometimes just a list or table.

We also know we need to do something to the data between the data source and the answer. Thinking ahead a little, we know we will want to try all kinds of ideas between the source and the answer, and it would be good not to have to reinvent the wheel every time. That suggests having some kind of framework into which we can easily plug our ideas for processing steps. And we very likely will want to to apply more than one of these steps - we saw earlier the application of "lowercase" and "minimum length" steps to our recipes data.

The follows shows such a framework - a data pipeline, into which we easily plug as many processing steps as we like.

This is the framework we want to make for our text mining toolkit.

There are alternative designs to think about too. We might have considered having the data sat in a Python object, and repeatedly mutated by applying methods which change the data. That could work but has disadvantages because you're destroying the data with each mutation.

Pipeline Networks

It may seem that having a pipeline is less memory efficient, because we're retaining the data that results from each processing step, and also passing it to the next step, but a significant advantage is that we can create more complex networks of pipelines. We'll have to see if the overhead defeats this ambition.

Functional

There is also another benefit, which is that the concept of processing steps taking data input(s) and creating data output(s) is simple, and reflects the functional approach to coding. This has two strong advantages:

It is possible to parallelise the processing (eg using GPUs or clusters), because each flow is independent of another.
The output of a processing step (function) is only dependent on the input, and no other state, making it much easier to debug pipelines, and more easily make claims about correctness.

Sunday 7 August 2016

Italian Recipes Revisited

After much trying, the BBC still won't let me use the recipes on their website for this book. That is a shame because the BBC is publicly funded and content should be public wherever possible.

The Italian Cookbook - The Art of Eating Well

Project Gutenberg, as we saw earlier, hosts books which are freely available and usable, mostly out of copyright books. So I used "The Italian Cookbook - The Art of Eating Well" (1919).

Our Own Small Recipes Corpus

I sampled some of the recipes .. 22 of them .. to make our own small corpus of recipes. A small corpus will be useful to experiment with, and this on is specialised to a domain - Italian cooking.

I included a range of dishes, except desserts, which would have competed with the savoury dishes in terms of ingredients and processes.

Here are the plain text files on github: https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/data_sets/recipes

Italian Recipes Word Cloud

Following the same approach as before - we obtain the following word cloud (stop words, lower case, min word length 5):

What does this tell us? There's a lot of chopping and olives in Italian cooking ...

Wednesday 3 August 2016

Using the Humongous British National Corpus (BNC)

Many models for text mining need an example set of natural language text to learn from .. that set of text is the "example" for machine learning methods.

The word used for such examples of text is corpus. I know .. sounds very grand!

You can see that small set of text would provide very limited learning opportunities .. because no machine or human mind can learn from a paucity of examples. So a large corpus is a good thing ... it provides lots of examples of language use, including odd variations that we humans like to put into our language.

Sometimes it is useful to have a corpus that is narrowly focused on a specific domain - like medical research, or horticulture, or Shakespeare's plays. That way the learning is also focused on that specific domain .. and adding additional text from another domain would dilute the example.

But there are cases where we actually do want a wide range of domains represented in a corpus .. to give us as as general an understanding of how language is used.

Finding Corpora

So given how useful large, and sometimes specialised, corpora are .. where do we find them? We don't want to make them ourselves as that would take huge amounts of effort.

Sadly, many of the best corpora are proprietary. They are not freely available, and even when they are for personal use, you have to agree to a scary looking set of terms. Almost always, you are prohibited from sharing the corpus onwards. This is a shame, because many of these corpora are publicly funded, or derived from publicly funded sources.

There are some out-of-date corpora if you look hard enough - 20 nntp news groups here (scikit-learn) and here (Apache Mahout), ... seriously?! And there is a tendency for too many researchers to use the same set of corpora which happen to be freely available.

There are some notable good examples of freely available and usable text. Project Gutenberg publishes out of copyright texts in very accessible forms. It's a great treasure trove .. have a look: https://www.gutenberg.org

Another good source are the public data releases, such as the Clinton emails we used earlier in previous posts. Similarly, public reports such as the Iraq Inquiry report are great sets of text, especially if you're interesting in exploring a particular domain.

The British National Corpus

The British National Corpus (BNC) is a truly massive corpus of English language. It is a really impressive effort to collate a very wide range of domains and usage, including spoken and regional variations.

You can find out more about the BNC at http://www.natcorp.ox.ac.uk/corpus/index.xml but here are the key features:

100 million words .. yes one hundred million words!
90% from written text including newspapers, books, journals, letters, fiction, ...
10% from spoken text including informal chat, formal meetings, phone-ins, radio shows .. and from a range of social and regional contexts.

Sadly the BNC corpus is proprietary - you can't take it and do what you want with it. You can apply for a copy for personal use from http://ota.ox.ac.uk/desc/2554.

There is a smaller 45,000 word free sample, called the BNC Baby, at http://ota.ox.ac.uk/desc/2553 which we will use to test our algorithms on first, as it is quicker and less resource intensive than working on the humongous full BNC.

Extracting the Text with Python

The BNC is not apparently available in plain text form. It is instead published in a rich XML format, which includes lots of annotation about the word such as parts of speech (verb, noun, etc).

We want make our own text mining toolkit - so we want to start with the plain text. The following is the simple Python code for accessing and extracting the plain text, in the form of sentences and words. You can see below how we can switch between the full BNC and the BNC Baby.

# code to convert the BNC XML to plain text words

# import NLTK BNC corpus reader
import nltk.corpus.reader.bnc

# full BNC text corpus
#a = nltk.corpus.reader.bnc.BNCCorpusReader(root='data_sets/bnc/2554/2554/download/Texts', fileids=r'[A-K]/\w*/\w*\.xml')

# smaller sample BNC Baby corpus
a = nltk.corpus.reader.bnc.BNCCorpusReader(root='data_sets/bnc/2553/2553/download/Texts', fileids=r'[a-z]{3}/\w*\.xml')

# how many sentences
len(a.sents())
280851

# how many words
len(a.words())
3540423

# print out first 50 words
a.words()[:50]
['BEING',
'DRAWN',
'TO',
'AN',
'IMAGE',
'Guy',
'Brett',
'Why',
'do',
'certain',
'images',
'matter',
...

The snippet of code to write out a new plain text file is easy too:

# extract sentences and add to new file
with open("data_sets/bnc/txt/bnc_baby.txt", 'w') as nf:
    for s in a.sents():
        #print(' '.join(s))
        nf.write(' '.join(s))
    pass

The BNC Baby sample turns into a 17Mb plain text file!

Damned Lincense

Sadly I can't put this plain text file on github to share it because of the damned restrictive license http://www.natcorp.ox.ac.uk/docs/licence.html so you'll have to recreate it yourself using the above code.

Annoying, I know .. feel free to petition the University Of Oxford ota@it.ox.ac.uk to change the license and make it #openscience