In the last post we talked about extracting the raw text data from the PDF's made public by the
Hillsborough Independent Panel as part of its collation and review of evidence about the
disaster from 89 organisations.
The Data
That data is fairly chunky:
- 19, 217 text files
- total of 874,408K in size .. or 853Mb in total
- an average size of 45.5k
Here's a breakdown of the documents from each organisation. You can see that most of the documents came from the South Yorkshire Police (SYP), the Home Office (HOM) and the department for Culture, Media and Sport (CMS).
Organisation CountSYP 10078HOM 3816CMS 1073FFA 413CPS 409SYC 345SPP 267YAS 259LCS 229COO 200AGO 191... ... PCC 1LHC 1Grand Total 19216
The following chart makes it easier to understand these numbers - click to enlarge.
Here we'll explore it with some of the tools we've developed.
Word Cloud - Simple Word Counts
A simple, and early, tool we developed was the word cloud to show the most occurring words, as to get an initial feel for what the text data set is about.
Before we create a word cloud chart, we need to clean the data. Here are the steps we're already familiar with:
- simplify whitespace
- only keep alphanumeric characters
- lowercase
- split into words
- remove stop-words from manual list
- only keep words of minimum length 5 (assumes longer words will be more interesting), and reduce
The top 10 most occurring words are:
police 358160there 251553ground 243109which 220052other 188236would 180488should 136455football 135171people 134446number 133143There are words in thee that aren't that informative. Remember why we moved instead to a measure of interesting-ness (TF-IDF) to ensure boring words aren't so prominent. Despite this the list does give us a feel for the text corpus - police, ground, football, number .. all relevant themes. Here's the word cloud.
Again - there are informative words in there, which we know are relevant to the history and events of. the disaster - police, stand, south, crowd, action, turnstiles, evidence, witness, supporters, pitch.
The word cloud is often derided - but is very simple and very effective.
Word Cloud - Relevance (TF-IDF)
Working out the relevance to reduce the effect of boring words blows up the memory of my computer - the dataset is too large - I'll need to improve the code in future, or shift to a non-memory based system like
Python Dask.
So for now, we'll chop the dataset up by selecting only documents that come from a specific organisation. This is easy because the text files have a prefix which identifies their origin. The Home Office files are prefixed with a
HOM .. like
HOM000049500001.txt, for example.
The selection is done using the corpus reader as follows:
cr = tmt.corpus_reader.CorpusReader(content_directory="data_sets/hillsborough/txt/", text_filename_pattern="HOM*.txt")Here's the top 20 list of most relevant words:
police 13.652282inquiry 12.031492secretary 11.218956justice 10.757810football 9.901103letter 9.803244yorkshire 9.501367evidence 9.079484hillsborough 8.938162london 8.917985taylor 8.695437would 8.640892disaster 8.519620scrutiny 8.362206reference 8.135450there 7.996140authority 7.765933should 7.601597south 7.530626report 7.525849That's a much much better list of most relevant terms.
Let's visualise the word cloud of relevant terms - click to enlarge.
The words included here are much more relevant and we can judge this because it is a subject we're fairly familiar with. The police are a major theme, for instance, and for good reason.
Co-occurrence
Again focussing on the HOM subset (because the entire set breaks my computer's memory) we can apply our co-occurrence tool.
Here are the top 20 most co-occurring words:
word1 word2 weight0 there there 1.0000001 would would 0.9746562 police police 0.8661063 south yorkshire 0.6817544 should should 0.6718555 football football 0.6665736 which which 0.5666817 justice taylor 0.5203758 police officers 0.5032529 there which 0.49439510 there people 0.49228011 there would 0.49076512 ground ground 0.48699113 which would 0.46800914 would there 0.46508515 yorkshire police 0.44568716 people people 0.43903417 justice stuartsmith 0.43264418 people there 0.42595819 which there 0.418774That contains word pairs where words are both the same .. I need to fix that! But also highlighted are pairs which are very informative about the data set.
It is interesting that there is a lot of use of the conditional future ... t
here would .. should ... which would. This suggests the material is discussing what should have happened, after the fact, in an almost apologetic way.
The graph of linked nodes representing co-occurring words should be interesting:
So what can we see here? We can see that:
- The word police is at the centre of many relationships - so the police are a very pertinent and relevant theme of the evidence. This is in fact true of the disaster, where many of the inquiries have been into the role of the police. That's a powerful revelation by the chart, if we didn't know this before.
- locations are important too .. Liverpool, Midlands, Sheffield, Yorkshire.
- Again the word would and should are central, reflecting the regretful view of hindsight.
Let's take only the most co-occurring words, with normalised scores of over 0.2.
Colours have been added to the groupings to make them clearer. We can see some themes already:
- Ground sports safety
- Lord Justice Stuart-Smith and Justice Taylor inquires and reports
- Chief constable
- Police control, authority and evidence.
We should really apply this to other organisations of the data set .. but first let's crack now with the other analyses and come back later.
Document Similarity
We'll skip over the document similarity for now because we're only looking at the Home Office documents. If we were doing a broader analysis across different organisations that would be much more interesting for a document similarity map.
Topic Extraction
The latest tool we developed was the extraction of topics using singular valued decomposition. It worked rather well for the Iraq Report. Let's see how it does for the Home Office Hillsborough documents.
The first thing to check when extracting topics is the distribution of SVD eigenvalues:
Ok, there are a lot of eingenvalues here! Luckily the first few seem to be much more significant then the long tail. Let's soon into the first few:
That's better. The first two eigenvalues are much larger than the rest. The next 2 are also significant. The next dozen or so are worth looking at, but beyond that we're into the long tail.
Let's see what the top 10 topics actually are:
topic # 0inquiry 0.238646police 0.233339secretary 0.228070justice 0.213491letter 0.200483scrutiny 0.198494yorkshire 0.195255london 0.186311reference 0.180697hillsborough 0.175345Name: 0, dtype: float64
topic # 1rfctr6fltyj 9.754874e-01statement 9.668067e-17stand 9.368396e-17report 7.629219e-17people 6.985798e-17private 6.167880e-17ground 5.815512e-17recommendations 5.815209e-17recommendation 5.489136e-17yorkshire 5.228198e-17Name: 1, dtype: float64
topic # 2chevf 7.613946e-01superintendent 1.157154e-16rover 1.024354e-16leppings 8.931737e-17aoorv 7.558747e-17dspys 7.558747e-17cuxjo 7.558747e-17chapman 6.959515e-17football 6.090088e-17trapped 5.805032e-17Name: 2, dtype: float64
topic # 3cecic 7.613946e-01submission 1.025629e-16psl4310 8.894763e-17heard 8.755130e-17tickets 8.643075e-17dated 7.646144e-17april 7.373294e-17tragedy 7.347410e-17early 7.005352e-17thank 6.976673e-17Name: 3, dtype: float64
topic # 4reference 0.268700scrutiny 0.239448midlands 0.193232authority 0.158826police 0.158758costs 0.146980yorkshire 0.136030london 0.127953stuartsmith 0.124732south 0.113730Name: 4, dtype: float64
topic # 5reference 0.215533midlands 0.204680yorkshire 0.185533costs 0.145715south 0.136164football 0.130660authority 0.125776scrutiny 0.106040ground 0.101162safety 0.084994Name: 5, dtype: float64
topic # 6ikwmmiwr 4.607135e-01jmnmiir 4.425598e-01liverpool 1.023169e-16semifinal 7.846477e-17manchester 7.286901e-17paragraph 7.279151e-17meeting 7.272110e-17taylor 7.171484e-17authority 7.121209e-17submission 6.293662e-17Name: 6, dtype: float64
topic # 7reference 0.212675inquiry 0.158190secretary 0.134060whalley 0.127025scrutiny 0.116281private 0.100031london 0.094764ground 0.092146people 0.088364letter 0.086487Name: 7, dtype: float64
topic # 818aug1989 4.732789e-01ifcrl 3.806973e-01provide 3.862433e-17stuartsmiths 3.757920e-17states 3.699572e-17bodies 3.634809e-17central 3.534114e-17community 3.329393e-17lloyd 3.059613e-17constabulary 3.041726e-17Name: 8, dtype: float64
topic # 9reference 0.252258private 0.118237football 0.117194evidence 0.111125extension 0.102815scrutiny 0.091190safety 0.087857stuartsmith 0.085739telephone 0.084538memorandum 0.075057Name: 9, dtype: float64Let's look at these topics:
- topic 0 seems to be about the inquiry and scrutiny into the police, involving the secretary of state, seeking justice as a theme. That's a good topic to extract!
- topic 1 seems to be related to safety recommendations about the stands and grounds after the disaster
- topic2 seems to be about the role of the chief superintendent and this tole in people being trapped, in relation to the Leppings Lane stand.
- ..
These topics are somewhat concrete, but some seem to be similar varying by a relatively small factor. This is likely because the Home Office documents are probably all about a similar set of themes - and as such it is difficult to extract very different topics .. because they aren't there!
A cross-organisation analysis would more likely extract different topics, just like we saw with the Iraq Report.
Note also the topic words are polluted by non-English words which are. there because of the process of optical character recognition (OCR) that tries to convert, often badly formed, scanned images into text.
Reduced Corpus To Ease Memory Pressure
We were forced to take only the Home Office documents because the entire set, and indeed just the South Yorkshire Police documents, broke the memory limits of my laptop with 16GB RAM!
Let's try a broader exploration by including all the documents except the HOM SYP sets. The easiest way to do this is to move all SYP*.txt and HOM*.txt files to a subdirectory, because the Python glob() doesn't support patterns that exclude files.
Trying that, the memory explodes again, we we exclude the CMS* files too.
The top 20 relevant words make sense:
police 11.081260football 10.203385sheffield 9.313793hillsborough 9.238009would 9.190118liverpool 8.361010there 8.161893letter 8.041335meeting 7.611749report 7.499404which 7.458836ground 7.424201evidence 7.025812should 6.927943telephone 6.765532authority 6.662008coroner 6.653248disaster 6.570584secretary 6.516341committee 6.341970Aside from the expected words, there is an interesting words in there: telephone. Maybe telephony was an important aspect of the events?
A word cloud of the relevant words is interesting too - click to enlarge:
We see another aspect coming through - safety, coroner, director, street...
Trying to extract topics again blows up the memory so we exclude more subsets, this time the FFA* and CPS* documents. But that didn't work either .. memory still blew up!
So let's have another look at the data. We notice there are lots and lots of words which are junk with repeated characters like
AAAAA and
00000 and
zzzzzz. A good filter would be to remove words which have n or more repeated characters. So here it is, added to the word_processing module:
# remove word with n repeated charactersdef remove_words_with_n_repeated_chars(input_words_list, n): # words with repeated chars anywhere in the strong (re.match only matches from the start) # seems to require (n-1) in expression regex = re.compile(r'(.)\1{' + str(n - 1) + r',}') output_text = [word for word in input_words_list if not regex.search(word)] return output_textThere aren't many (if any) English words with more than 3 consecutively repeated characters so let's add this to the filter at the top of the pipeline, and see if that helps. The filter is applied as follows:
# remove words which have a character conseqcutively repeated n=3 times or more gl = tmt.word_processing.remove_words_with_n_repeated_chars(fl, 3)That words to an extent but not my much! The memory used by the word count index is reduced from 13.8Gb to 13.4Gb .. so not a huge change.
Looking again at the data we see lots of numeric-only words .. so let's create a filter that removes numeric characters (which should be used only intentionally as numbers can be useful). Here it is, very similar to the
keep_only_alphanumeric() function we've used before.
# keep only alpha (not numeric) charactersdef keep_only_alpha(input_text): regex = re.compile('[^a-zA-Z ]+') output_text = regex.sub('', input_text) return output_textThat seems to work a bit better. The memory consumption of the word index is now reduced from 13.8Gb to 11.9GB. Still not a massive drop. Let's combine this with the removal of repeating characters. That reduces it to 11.6Gb.
Time to get drastic!Looking again at the data .. we still see nonsense words .. like:
aabac 0.0 0.0 aabalanceaan 0.0 0.0 aabalanceaanhanan 0.0 0.0 aabaw 0.0 0.0 aability 0.0 0.0 aabiscf 0.0 0.0 aabit 0.0 0.0 aabjjivujtzuy 0.0 0.0 aablt 0.0 0.0 aabout 0.0 0.0 aabove 0.0 0.0 aabroad 0.0 0.0 aabrook 0.0 0.0 aabtt 0.0 0.0 aabulance 0.0 0.0 aabulancenanwoman 0.0 0.0 aabulances 0.0 0.0 aabulanoe 0.0 0.0 aabularce 0.0 0.0 aabulonco 0.0 0.0 aaburo 0.0 0.0 aacabd 0.0 0.0 aacacxoiza 0.0 0.0 aaccmpaappll 0.0 0.0 aaccommodation 0.0 0.0 aaccord 0.0 0.0 aacdi 0.0 0.0 aacede 0.0 0.0 aacent 0.0 0.0 Let's get brutal and remove all words with 2 consecutive letters that are the same. This will remove valid English words .. but right now we can't process such huge data.
Removing these words, leads to a reduction to 8.2Gb. The SVD calculation takes a while but succeeds without crashing .. that's a testament to the quality of Python and the open source libraries that it can crunch through an 8Gb data frame to do a matrix decomposition.
Here's the resultant top 10 topics:
topic # 0aunder 1.173536newcastle 0.000598august 0.000438laboratory 0.000366research 0.000347services 0.000306stephenson 0.000297reference 0.000295building 0.000281telephone 0.000278Name: 0, dtype: float64
topic # 1police 0.201300authority 0.151340coroner 0.150925report 0.149464would 0.146609disaster 0.143472secretary 0.134563there 0.131142evidence 0.129285which 0.129257Name: 1, dtype: float64
topic # 2visitors 0.970682seiolisia 0.116385seiojisia 0.116385coroner 0.002195police 0.002041inquest 0.001848would 0.001843report 0.001822evidence 0.001820authority 0.001810Name: 2, dtype: float64
topic # 3sleisielie 9.637464e-01stage 1.463056e-16arvypftea 1.444467e-16csbcjfcjl 1.444467e-16jfdacjci 1.406117e-16fictnizt 1.406117e-16ctfyjzy 1.350711e-16dstaqef 1.350711e-16tsrigjt 1.350711e-16sctenjs 1.350711e-16Name: 3, dtype: float64
topic # 4lcvtuow 8.850934e-01acoruanw 2.947940e-16jfythli 3.623346e-17sjhtzl 3.411774e-17joflpw 2.758480e-17uvztl 2.520039e-17lavlfrvm 1.748524e-17while 1.747938e-17confidence 1.738987e-17ywjorim 1.679626e-17Name: 4, dtype: float64
topic # 5uircrmn 8.850934e-01barclays 2.745948e-16salmon 1.013610e-16government 9.558924e-17received 8.447307e-17hours 8.078021e-17court 7.419536e-17travel 7.372786e-17league 7.275761e-17merseyside 7.198510e-17Name: 5, dtype: float64
topic # 6skflfr 8.334114e-01lavlfrvm 4.561791e-16ywjorim 4.382041e-16cvwfmrj 4.382041e-16krktkt 4.126167e-16utaujt 4.126167e-16lcftj 3.769504e-16taxyi 3.769504e-16cyujw 3.769504e-16lsksy 3.769504e-16Name: 6, dtype: float64
topic # 7yjicyv 8.334114e-01brighton 8.086643e-17meyskens 5.765278e-17white 4.939777e-17emphasise 4.678263e-17anxious 3.861623e-17belgium 3.771096e-17important 3.679099e-17tickets 3.405656e-17misunderstand 3.161114e-17Name: 7, dtype: float64
topic # 8sflcpg 8.334114e-01budget 4.750198e-17domes 4.653915e-17crush 4.478878e-17event 4.470186e-17compensation 4.360769e-17detailed 4.334802e-17brain 4.158248e-17which 4.130543e-17arguments 4.114549e-17Name: 8, dtype: float64
topic # 9swrvja 8.334114e-01vhtwefva 8.298114e-17tfifiq 7.505693e-17emfaj 6.856907e-17cikiv 6.856907e-17judgement 5.835493e-17schemes 5.201091e-17behalf 5.043767e-17roger 4.794902e-17defence 4.680745e-17Name: 9, dtype: float64
topic # 10kaovi 7.613720e-01contd 3.137082e-17ctwvi 2.709474e-17raymond 2.668085e-17carter 2.554545e-17otherham 2.488737e-17indemnity 2.377327e-17children 2.289848e-17direction 2.255600e-17traynor 2.220917e-17Name: 10, dtype: float64
topic # 11ajudo 7.613720e-01opdwd 2.482052e-16persons 1.818546e-16authority 1.079457e-16royal 1.074050e-16index 1.062596e-16photcgraph 1.046602e-16qiaslr 1.045700e-16jaijo 9.553127e-17luodt 8.268289e-17Name: 11, dtype: float64
topic # 12downing 0.579697ytufa 0.287647inister 0.134422tvihvsv 0.078043vcoas 0.067134oatcr 0.067134stmotf 0.064715arocl 0.059121secretary 0.034207ambulance 0.032477Name: 12, dtype: float64
topic # 13ambulance 0.219331control 0.177408hospital 0.149227ground 0.143579incident 0.112497station 0.108420vehicle 0.094678patients 0.084698there 0.083653downing 0.081564Name: 13, dtype: float64
topic # 14coroner 0.234570inquest 0.161868resolved 0.126806services 0.109531working 0.099156digitised 0.097361evidence 0.092663council 0.086339party 0.085621sincerely 0.083311Name: 14, dtype: float64We can see some topics that make sense, for example:
- topic 0 - laboratory, research, services...
- topic 1 - police, authority, disaster, evidence, ..
- topic 2 - visitors, inquest, evidence, ..
- topic 8 - budget. domes, crush, event, compensation, brain, ..
- topic 13 - ambulance, control, hospital, ground, incident, vehicle, patients
- topic 14 - coroner, inquest, .. digitised, evidence, council ...
The good news is these topics are more varied now that we're looking across a wider more varied set of documents.
The bad news is that data quality is still causing problems with the analysis .. with words like
sflcpg, ctwvi, cyujw and so on dominating the data.
So the lesson here is that we need to spend much more time cleaning the data. We'll do that next time.
Lesson - Data Quality
We've done a good job managing a huge data set, and seen how our text mining toolkit works well.
The main lesson here is that data quality matters. The poor quality of original documents, the limited effectiveness of OCR, all combined lead to a dataset which is dominated by mostly meaningless text.
We need to deal with that next time.