The word used for such examples of text is corpus. I know .. sounds very grand!
You can see that small set of text would provide very limited learning opportunities .. because no machine or human mind can learn from a paucity of examples. So a large corpus is a good thing ... it provides lots of examples of language use, including odd variations that we humans like to put into our language.
Sometimes it is useful to have a corpus that is narrowly focused on a specific domain - like medical research, or horticulture, or Shakespeare's plays. That way the learning is also focused on that specific domain .. and adding additional text from another domain would dilute the example.
But there are cases where we actually do want a wide range of domains represented in a corpus .. to give us as as general an understanding of how language is used.
Finding Corpora
So given how useful large, and sometimes specialised, corpora are .. where do we find them? We don't want to make them ourselves as that would take huge amounts of effort.Sadly, many of the best corpora are proprietary. They are not freely available, and even when they are for personal use, you have to agree to a scary looking set of terms. Almost always, you are prohibited from sharing the corpus onwards. This is a shame, because many of these corpora are publicly funded, or derived from publicly funded sources.
There are some out-of-date corpora if you look hard enough - 20 nntp news groups here (scikit-learn) and here (Apache Mahout), ... seriously?! And there is a tendency for too many researchers to use the same set of corpora which happen to be freely available.
There are some notable good examples of freely available and usable text. Project Gutenberg publishes out of copyright texts in very accessible forms. It's a great treasure trove .. have a look: https://www.gutenberg.org
Another good source are the public data releases, such as the Clinton emails we used earlier in previous posts. Similarly, public reports such as the Iraq Inquiry report are great sets of text, especially if you're interesting in exploring a particular domain.
The British National Corpus
The British National Corpus (BNC) is a truly massive corpus of English language. It is a really impressive effort to collate a very wide range of domains and usage, including spoken and regional variations.You can find out more about the BNC at http://www.natcorp.ox.ac.uk/corpus/index.xml but here are the key features:
- 100 million words .. yes one hundred million words!
- 90% from written text including newspapers, books, journals, letters, fiction, ...
- 10% from spoken text including informal chat, formal meetings, phone-ins, radio shows .. and from a range of social and regional contexts.
Sadly the BNC corpus is proprietary - you can't take it and do what you want with it. You can apply for a copy for personal use from http://ota.ox.ac.uk/desc/2554.
There is a smaller 45,000 word free sample, called the BNC Baby, at http://ota.ox.ac.uk/desc/2553 which we will use to test our algorithms on first, as it is quicker and less resource intensive than working on the humongous full BNC.
Extracting the Text with Python
The BNC is not apparently available in plain text form. It is instead published in a rich XML format, which includes lots of annotation about the word such as parts of speech (verb, noun, etc).We want make our own text mining toolkit - so we want to start with the plain text. The following is the simple Python code for accessing and extracting the plain text, in the form of sentences and words. You can see below how we can switch between the full BNC and the BNC Baby.
# code to convert the BNC XML to plain text words
# import NLTK BNC corpus reader
import nltk.corpus.reader.bnc
# full BNC text corpus
#a = nltk.corpus.reader.bnc.BNCCorpusReader(root='data_sets/bnc/2554/2554/download/Texts', fileids=r'[A-K]/\w*/\w*\.xml')
# smaller sample BNC Baby corpus
a = nltk.corpus.reader.bnc.BNCCorpusReader(root='data_sets/bnc/2553/2553/download/Texts', fileids=r'[a-z]{3}/\w*\.xml')
# how many sentences
len(a.sents())
280851
# how many words
len(a.words())
3540423
# print out first 50 words
a.words()[:50]
['BEING',
'DRAWN',
'TO',
'AN',
'IMAGE',
'Guy',
'Brett',
'Why',
'do',
'certain',
'images',
'matter',
...
The snippet of code to write out a new plain text file is easy too:
# extract sentences and add to new file
with open("data_sets/bnc/txt/bnc_baby.txt", 'w') as nf:
for s in a.sents():
#print(' '.join(s))
nf.write(' '.join(s))
pass
The BNC Baby sample turns into a 17Mb plain text file!
Damned Lincense
Sadly I can't put this plain text file on github to share it because of the damned restrictive license http://www.natcorp.ox.ac.uk/docs/licence.html so you'll have to recreate it yourself using the above code.Annoying, I know .. feel free to petition the University Of Oxford ota@it.ox.ac.uk to change the license and make it #openscience
Thanks for sharing this. Are you aware of how to import XML text files (or the BNC/BNC Baby specifically) into R? I have found some functions for this in the tm and xml2 packages but they are not very user-friendly.
ReplyDeletesadly I don't Tom, i'm not an expert in R, but I hope you do find the answer. let us know if you do and i'll be happy to share.
DeleteThis comment has been removed by the author.
DeleteI did manage to write an R script that imports the BNC Baby corpus and converts it into CSV text files (one per original XML file) in which each row represents a token and the columns contain other metadata such as part-of-speech tag, lemma, whether there should be white space after the token, etc.
DeleteIt is a pretty ugly-looking piece of code but it does the job. I'd be happy to share it with you.
hi, Tom, how did you manage to convert BNC Baby corpus into the way you described above, it is possible that you share it with me?
DeleteDear 方南,
ReplyDeleteSorry for the slow response. I would be happy to share the R script with you. I don't know how best to do this. Could you maybe share your email address?