Intuitive Text Mining: Exploring the Hillsborough Disclosure

In the last post we talked about extracting the raw text data from the PDF's made public by the Hillsborough Independent Panel as part of its collation and review of evidence about the disaster from 89 organisations.

The Data

That data is fairly chunky:

19, 217 text files
total of 874,408K in size .. or 853Mb in total
an average size of 45.5k

Here's a breakdown of the documents from each organisation. You can see that most of the documents came from the South Yorkshire Police (SYP), the Home Office (HOM) and the department for Culture, Media and Sport (CMS).

Organisation Count
SYP 10078
HOM 3816
CMS 1073
FFA 413
CPS 409
SYC 345
SPP 267
YAS 259
LCS 229
COO 200
AGO 191
...
...
PCC 1
LHC 1
Grand Total 19216

The following chart makes it easier to understand these numbers - click to enlarge.

Here we'll explore it with some of the tools we've developed.

Word Cloud - Simple Word Counts

A simple, and early, tool we developed was the word cloud to show the most occurring words, as to get an initial feel for what the text data set is about.

Before we create a word cloud chart, we need to clean the data. Here are the steps we're already familiar with:

simplify whitespace
only keep alphanumeric characters
lowercase
split into words
remove stop-words from manual list
only keep words of minimum length 5 (assumes longer words will be more interesting), and reduce

The top 10 most occurring words are:

police 358160
there 251553
ground 243109
which 220052
other 188236
would 180488
should 136455
football 135171
people 134446
number 133143

There are words in thee that aren't that informative. Remember why we moved instead to a measure of interesting-ness (TF-IDF) to ensure boring words aren't so prominent. Despite this the list does give us a feel for the text corpus - police, ground, football, number .. all relevant themes. Here's the word cloud.

Again - there are informative words in there, which we know are relevant to the history and events of. the disaster - police, stand, south, crowd, action, turnstiles, evidence, witness, supporters, pitch.

The word cloud is often derided - but is very simple and very effective.

Word Cloud - Relevance (TF-IDF)

Working out the relevance to reduce the effect of boring words blows up the memory of my computer - the dataset is too large - I'll need to improve the code in future, or shift to a non-memory based system like Python Dask.

So for now, we'll chop the dataset up by selecting only documents that come from a specific organisation. This is easy because the text files have a prefix which identifies their origin. The Home Office files are prefixed with a HOM .. like HOM000049500001.txt, for example.

The selection is done using the corpus reader as follows:

cr = tmt.corpus_reader.CorpusReader(content_directory="data_sets/hillsborough/txt/", text_filename_pattern="HOM*.txt")

Here's the top 20 list of most relevant words:

police 13.652282
inquiry 12.031492
secretary   11.218956
justice   10.757810
football 9.901103
letter 9.803244
yorkshire   9.501367
evidence 9.079484
hillsborough 8.938162
london 8.917985
taylor 8.695437
would   8.640892
disaster 8.519620
scrutiny 8.362206
reference   8.135450
there   7.996140
authority   7.765933
should 7.601597
south   7.530626
report 7.525849

That's a much much better list of most relevant terms.

Let's visualise the word cloud of relevant terms - click to enlarge.

The words included here are much more relevant and we can judge this because it is a subject we're fairly familiar with. The police are a major theme, for instance, and for good reason.

Co-occurrence

Again focussing on the HOM subset (because the entire set breaks my computer's memory) we can apply our co-occurrence tool.

Here are the top 20 most co-occurring words:

word1 word2 weight
0 there there 1.000000
1 would would 0.974656
2 police police 0.866106
3 south yorkshire 0.681754
4 should should 0.671855
5 football football 0.666573
6 which which 0.566681
7 justice taylor 0.520375
8 police officers 0.503252
9 there which 0.494395
10 there people 0.492280
11 there would 0.490765
12 ground ground 0.486991
13 which would 0.468009
14 would there 0.465085
15 yorkshire police 0.445687
16 people people 0.439034
17 justice stuartsmith 0.432644
18 people there 0.425958
19 which there 0.418774

That contains word pairs where words are both the same .. I need to fix that! But also highlighted are pairs which are very informative about the data set.

It is interesting that there is a lot of use of the conditional future ... there would .. should ... which would. This suggests the material is discussing what should have happened, after the fact, in an almost apologetic way.

The graph of linked nodes representing co-occurring words should be interesting:

So what can we see here? We can see that:

The word police is at the centre of many relationships - so the police are a very pertinent and relevant theme of the evidence. This is in fact true of the disaster, where many of the inquiries have been into the role of the police. That's a powerful revelation by the chart, if we didn't know this before.
locations are important too .. Liverpool, Midlands, Sheffield, Yorkshire.
Again the word would and should are central, reflecting the regretful view of hindsight.

Let's take only the most co-occurring words, with normalised scores of over 0.2.

Colours have been added to the groupings to make them clearer. We can see some themes already:

Ground sports safety
Lord Justice Stuart-Smith and Justice Taylor inquires and reports
Chief constable
Police control, authority and evidence.

We should really apply this to other organisations of the data set .. but first let's crack now with the other analyses and come back later.

Document Similarity

We'll skip over the document similarity for now because we're only looking at the Home Office documents. If we were doing a broader analysis across different organisations that would be much more interesting for a document similarity map.

Topic Extraction

The latest tool we developed was the extraction of topics using singular valued decomposition. It worked rather well for the Iraq Report. Let's see how it does for the Home Office Hillsborough documents.

The first thing to check when extracting topics is the distribution of SVD eigenvalues:

Ok, there are a lot of eingenvalues here! Luckily the first few seem to be much more significant then the long tail. Let's soon into the first few:

That's better. The first two eigenvalues are much larger than the rest. The next 2 are also significant. The next dozen or so are worth looking at, but beyond that we're into the long tail.

Let's see what the top 10 topics actually are:

topic # 0
inquiry 0.238646
police 0.233339
secretary 0.228070
justice 0.213491
letter 0.200483
scrutiny 0.198494
yorkshire 0.195255
london 0.186311
reference 0.180697
hillsborough 0.175345
Name: 0, dtype: float64

topic # 1
rfctr6fltyj 9.754874e-01
statement 9.668067e-17
stand 9.368396e-17
report 7.629219e-17
people 6.985798e-17
private 6.167880e-17
ground 5.815512e-17
recommendations 5.815209e-17
recommendation 5.489136e-17
yorkshire 5.228198e-17
Name: 1, dtype: float64

topic # 2
chevf 7.613946e-01
superintendent 1.157154e-16
rover 1.024354e-16
leppings 8.931737e-17
aoorv 7.558747e-17
dspys 7.558747e-17
cuxjo 7.558747e-17
chapman 6.959515e-17
football 6.090088e-17
trapped 5.805032e-17
Name: 2, dtype: float64

topic # 3
cecic 7.613946e-01
submission 1.025629e-16
psl4310 8.894763e-17
heard 8.755130e-17
tickets 8.643075e-17
dated 7.646144e-17
april 7.373294e-17
tragedy 7.347410e-17
early 7.005352e-17
thank 6.976673e-17
Name: 3, dtype: float64

topic # 4
reference 0.268700
scrutiny 0.239448
midlands 0.193232
authority 0.158826
police 0.158758
costs 0.146980
yorkshire 0.136030
london 0.127953
stuartsmith 0.124732
south 0.113730
Name: 4, dtype: float64

topic # 5
reference 0.215533
midlands 0.204680
yorkshire 0.185533
costs 0.145715
south 0.136164
football 0.130660
authority 0.125776
scrutiny 0.106040
ground 0.101162
safety 0.084994
Name: 5, dtype: float64

topic # 6
ikwmmiwr 4.607135e-01
jmnmiir 4.425598e-01
liverpool 1.023169e-16
semifinal 7.846477e-17
manchester 7.286901e-17
paragraph 7.279151e-17
meeting 7.272110e-17
taylor 7.171484e-17
authority 7.121209e-17
submission 6.293662e-17
Name: 6, dtype: float64

topic # 7
reference 0.212675
inquiry 0.158190
secretary 0.134060
whalley 0.127025
scrutiny 0.116281
private 0.100031
london 0.094764
ground 0.092146
people 0.088364
letter 0.086487
Name: 7, dtype: float64

topic # 8
18aug1989 4.732789e-01
ifcrl 3.806973e-01
provide 3.862433e-17
stuartsmiths 3.757920e-17
states 3.699572e-17
bodies 3.634809e-17
central 3.534114e-17
community 3.329393e-17
lloyd 3.059613e-17
constabulary 3.041726e-17
Name: 8, dtype: float64

topic # 9
reference 0.252258
private 0.118237
football 0.117194
evidence 0.111125
extension 0.102815
scrutiny 0.091190
safety 0.087857
stuartsmith 0.085739
telephone 0.084538
memorandum 0.075057
Name: 9, dtype: float64

Let's look at these topics:

topic 0 seems to be about the inquiry and scrutiny into the police, involving the secretary of state, seeking justice as a theme. That's a good topic to extract!
topic 1 seems to be related to safety recommendations about the stands and grounds after the disaster
topic2 seems to be about the role of the chief superintendent and this tole in people being trapped, in relation to the Leppings Lane stand.
..

These topics are somewhat concrete, but some seem to be similar varying by a relatively small factor. This is likely because the Home Office documents are probably all about a similar set of themes - and as such it is difficult to extract very different topics .. because they aren't there!

A cross-organisation analysis would more likely extract different topics, just like we saw with the Iraq Report.

Note also the topic words are polluted by non-English words which are. there because of the process of optical character recognition (OCR) that tries to convert, often badly formed, scanned images into text.

Reduced Corpus To Ease Memory Pressure

We were forced to take only the Home Office documents because the entire set, and indeed just the South Yorkshire Police documents, broke the memory limits of my laptop with 16GB RAM!

Let's try a broader exploration by including all the documents except the HOM SYP sets. The easiest way to do this is to move all SYP*.txt and HOM*.txt files to a subdirectory, because the Python glob() doesn't support patterns that exclude files.

Trying that, the memory explodes again, we we exclude the CMS* files too.

The top 20 relevant words make sense:

police 11.081260
football 10.203385
sheffield 9.313793
hillsborough 9.238009
would 9.190118
liverpool 8.361010
there 8.161893
letter 8.041335
meeting 7.611749
report 7.499404
which 7.458836
ground 7.424201
evidence 7.025812
should 6.927943
telephone 6.765532
authority 6.662008
coroner 6.653248
disaster 6.570584
secretary 6.516341
committee 6.341970

Aside from the expected words, there is an interesting words in there: telephone. Maybe telephony was an important aspect of the events?

A word cloud of the relevant words is interesting too - click to enlarge:

We see another aspect coming through - safety, coroner, director, street...

Trying to extract topics again blows up the memory so we exclude more subsets, this time the FFA* and CPS* documents. But that didn't work either .. memory still blew up!

So let's have another look at the data. We notice there are lots and lots of words which are junk with repeated characters like AAAAA and 00000 and zzzzzz. A good filter would be to remove words which have n or more repeated characters. So here it is, added to the word_processing module:

# remove word with n repeated characters
def remove_words_with_n_repeated_chars(input_words_list, n):
# words with repeated chars anywhere in the strong (re.match only matches from the start)
# seems to require (n-1) in expression
regex = re.compile(r'(.)\1{' + str(n - 1) + r',}')
output_text = [word for word in input_words_list if not regex.search(word)]
return output_text

There aren't many (if any) English words with more than 3 consecutively repeated characters so let's add this to the filter at the top of the pipeline, and see if that helps. The filter is applied as follows:

# remove words which have a character conseqcutively repeated n=3 times or more
gl = tmt.word_processing.remove_words_with_n_repeated_chars(fl, 3)

That words to an extent but not my much! The memory used by the word count index is reduced from 13.8Gb to 13.4Gb .. so not a huge change.

Looking again at the data we see lots of numeric-only words .. so let's create a filter that removes numeric characters (which should be used only intentionally as numbers can be useful). Here it is, very similar to the keep_only_alphanumeric() function we've used before.

# keep only alpha (not numeric) characters
def keep_only_alpha(input_text):
regex = re.compile('[^a-zA-Z ]+')
output_text = regex.sub('', input_text)
return output_text

That seems to work a bit better. The memory consumption of the word index is now reduced from 13.8Gb to 11.9GB. Still not a massive drop. Let's combine this with the removal of repeating characters. That reduces it to 11.6Gb.

Time to get drastic!

Looking again at the data .. we still see nonsense words .. like:

aabac 0.0 0.0
aabalanceaan 0.0 0.0
aabalanceaanhanan 0.0 0.0
aabaw 0.0 0.0
aability 0.0 0.0
aabiscf 0.0 0.0
aabit 0.0 0.0
aabjjivujtzuy 0.0 0.0
aablt 0.0 0.0
aabout 0.0 0.0
aabove 0.0 0.0
aabroad 0.0 0.0
aabrook 0.0 0.0
aabtt 0.0 0.0
aabulance 0.0 0.0
aabulancenanwoman 0.0 0.0
aabulances 0.0 0.0
aabulanoe 0.0 0.0
aabularce 0.0 0.0
aabulonco 0.0 0.0
aaburo 0.0 0.0
aacabd 0.0 0.0
aacacxoiza 0.0 0.0
aaccmpaappll 0.0 0.0
aaccommodation 0.0 0.0
aaccord 0.0 0.0
aacdi 0.0 0.0
aacede 0.0 0.0
aacent 0.0 0.0

Let's get brutal and remove all words with 2 consecutive letters that are the same. This will remove valid English words .. but right now we can't process such huge data.

Removing these words, leads to a reduction to 8.2Gb. The SVD calculation takes a while but succeeds without crashing .. that's a testament to the quality of Python and the open source libraries that it can crunch through an 8Gb data frame to do a matrix decomposition.

Here's the resultant top 10 topics:

topic # 0
aunder 1.173536
newcastle 0.000598
august 0.000438
laboratory 0.000366
research 0.000347
services 0.000306
stephenson 0.000297
reference 0.000295
building 0.000281
telephone 0.000278
Name: 0, dtype: float64

topic # 1
police 0.201300
authority 0.151340
coroner 0.150925
report 0.149464
would 0.146609
disaster 0.143472
secretary 0.134563
there 0.131142
evidence 0.129285
which 0.129257
Name: 1, dtype: float64

topic # 2
visitors 0.970682
seiolisia 0.116385
seiojisia 0.116385
coroner 0.002195
police 0.002041
inquest 0.001848
would 0.001843
report 0.001822
evidence 0.001820
authority 0.001810
Name: 2, dtype: float64

topic # 3
sleisielie 9.637464e-01
stage 1.463056e-16
arvypftea 1.444467e-16
csbcjfcjl 1.444467e-16
jfdacjci 1.406117e-16
fictnizt 1.406117e-16
ctfyjzy 1.350711e-16
dstaqef 1.350711e-16
tsrigjt 1.350711e-16
sctenjs 1.350711e-16
Name: 3, dtype: float64

topic # 4
lcvtuow 8.850934e-01
acoruanw 2.947940e-16
jfythli 3.623346e-17
sjhtzl 3.411774e-17
joflpw 2.758480e-17
uvztl 2.520039e-17
lavlfrvm 1.748524e-17
while 1.747938e-17
confidence 1.738987e-17
ywjorim 1.679626e-17
Name: 4, dtype: float64

topic # 5
uircrmn 8.850934e-01
barclays 2.745948e-16
salmon 1.013610e-16
government 9.558924e-17
received 8.447307e-17
hours 8.078021e-17
court 7.419536e-17
travel 7.372786e-17
league 7.275761e-17
merseyside 7.198510e-17
Name: 5, dtype: float64

topic # 6
skflfr 8.334114e-01
lavlfrvm 4.561791e-16
ywjorim 4.382041e-16
cvwfmrj 4.382041e-16
krktkt 4.126167e-16
utaujt 4.126167e-16
lcftj 3.769504e-16
taxyi 3.769504e-16
cyujw 3.769504e-16
lsksy 3.769504e-16
Name: 6, dtype: float64

topic # 7
yjicyv 8.334114e-01
brighton 8.086643e-17
meyskens 5.765278e-17
white 4.939777e-17
emphasise 4.678263e-17
anxious 3.861623e-17
belgium 3.771096e-17
important 3.679099e-17
tickets 3.405656e-17
misunderstand 3.161114e-17
Name: 7, dtype: float64

topic # 8
sflcpg 8.334114e-01
budget 4.750198e-17
domes 4.653915e-17
crush 4.478878e-17
event 4.470186e-17
compensation 4.360769e-17
detailed 4.334802e-17
brain 4.158248e-17
which 4.130543e-17
arguments 4.114549e-17
Name: 8, dtype: float64

topic # 9
swrvja 8.334114e-01
vhtwefva 8.298114e-17
tfifiq 7.505693e-17
emfaj 6.856907e-17
cikiv 6.856907e-17
judgement 5.835493e-17
schemes 5.201091e-17
behalf 5.043767e-17
roger 4.794902e-17
defence 4.680745e-17
Name: 9, dtype: float64

topic # 10
kaovi 7.613720e-01
contd 3.137082e-17
ctwvi 2.709474e-17
raymond 2.668085e-17
carter 2.554545e-17
otherham 2.488737e-17
indemnity 2.377327e-17
children 2.289848e-17
direction 2.255600e-17
traynor 2.220917e-17
Name: 10, dtype: float64

topic # 11
ajudo 7.613720e-01
opdwd 2.482052e-16
persons 1.818546e-16
authority 1.079457e-16
royal 1.074050e-16
index 1.062596e-16
photcgraph 1.046602e-16
qiaslr 1.045700e-16
jaijo 9.553127e-17
luodt 8.268289e-17
Name: 11, dtype: float64

topic # 12
downing 0.579697
ytufa 0.287647
inister 0.134422
tvihvsv 0.078043
vcoas 0.067134
oatcr 0.067134
stmotf 0.064715
arocl 0.059121
secretary 0.034207
ambulance 0.032477
Name: 12, dtype: float64

topic # 13
ambulance 0.219331
control 0.177408
hospital 0.149227
ground 0.143579
incident 0.112497
station 0.108420
vehicle 0.094678
patients 0.084698
there 0.083653
downing 0.081564
Name: 13, dtype: float64

topic # 14
coroner 0.234570
inquest 0.161868
resolved 0.126806
services 0.109531
working 0.099156
digitised 0.097361
evidence 0.092663
council 0.086339
party 0.085621
sincerely 0.083311
Name: 14, dtype: float64

We can see some topics that make sense, for example:

topic 0 - laboratory, research, services...
topic 1 - police, authority, disaster, evidence, ..
topic 2 - visitors, inquest, evidence, ..
topic 8 - budget. domes, crush, event, compensation, brain, ..
topic 13 - ambulance, control, hospital, ground, incident, vehicle, patients
topic 14 - coroner, inquest, .. digitised, evidence, council ...

The good news is these topics are more varied now that we're looking across a wider more varied set of documents.

The bad news is that data quality is still causing problems with the analysis .. with words like sflcpg, ctwvi, cyujw and so on dominating the data.

So the lesson here is that we need to spend much more time cleaning the data. We'll do that next time.

Lesson - Data Quality

We've done a good job managing a huge data set, and seen how our text mining toolkit works well.

The main lesson here is that data quality matters. The poor quality of original documents, the limited effectiveness of OCR, all combined lead to a dataset which is dominated by mostly meaningless text.

We need to deal with that next time.

Intuitive Text Mining

Monday, 17 April 2017

Exploring the Hillsborough Disclosure - Part 1/2