Intuitive Text Mining: Exploring the Hillsborough Disclosure

In the last post we started to explore the Hillsborough Independent Panel's Disclosure of evidence, that we had prepared from the not-so-open data.

Poor Data Noise Overwhelms Our Algorithms

We found that using our toolkit for mining the text ran into some severe problems because much of the original documents were of poor readability (many handwritten notes, degraded copies of copies, etc). The OCR processes to turn scans of these documents into plain text is also far from perfect. That means the text we get from the published PDFs contains lots and lots of noise. That is bad for text mining - at best, it adds low-value data to the set, and at worst, it overwhelms our algorithms which try to separate out meaningful words. The size of the noise also overwhelmed the memory limits of my laptop.

We tried a few drastic things like chopping out major contributors to the data set, like the Home Office (HO*) or South Yorkshire Police (SYP*), in an attempt to reduce the memory load. That's bad, because these two make up much of the data set - about 3/4 of the files by number. It didn't work enough so we ended up removed even more contributors...

We also tried brutal things like removing words which has 3 or more consecutive letters that were the same, in an attempt to remove the noise. That wasn't enough. We then tried words with 2 or more such consecutive letters .. which would remove valid English words too, and yet still junk words like sflcpg, ctwvi, cyujw remained.

A New Approach: An English Word List

A new approach is to directly address the problem of noise and junk from the poor initial dataset. This approach is to only include words that are actually English words, and filter everything else out.

This feels a little like defeat as we didn't want to include manual steps like "stop words" in our pipeline, but in the face of such overwhelming noise, it seems a reasonable thing to do.

There are a few sources of lists of English words, and in fact most unix/linux and macOS systems have a system /usr/share/dict/words.txt file. That is a good start, and serves most purposes. These days, some of the open source spell check technologies include quite sophisticated dictionaries which include variants of words too, and even proper names (like Stephen) and popular place names (like Sheffield).

Our approach uses the aspell spell checker to create a word list at http://app.aspell.net/create. The words are processed with additional steps:

remove all 's from the end of any words that have it
lowercase everything
remove duplicates

You can find this word list on the project's github page at

https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/dictionaries

The resultant word relevance data frame, without any organisations excluded, is 6.8Gb. That's a massive reduction on the 13+Gb previously when major chunks were cut out.

Let's see what this cleaned up data looks like through the previous tools.

Occurrence - Simple Word Counts

Taking the top 10 most occurring words (without normalisation) gives us:

police 358160
there 251553
ground 243109
which 220052
other 188236
would 180488
should 136455
football 135171
people 134446
number 133143

That's the same as before when we didn't filter by dictionary words, because all of these dictionary words appear in the top 10. Again the main themes come through - police, ground, football, number .. and a sense of regretful would, should regarding hindsight.

The word cloud is as follows - click to enlarge. It's very similar to our previous one.

Word Relevance (TF-IDF)

Calculating the relevance to reduce the prominence of meaningless words, we get the following top 20 most relevant words:

police 83.451991
raised 79.235280
ground 66.290401
actions   65.596721
document 63.932343
action 61.090630
material   57.589417
number   51.046630
indexer 50.493850
instructions  48.442706
statement 47.873629
other 46.347452
football   46.155356
indicated 44.902818
people   43.851017
there 42.416433
would 41.164988
sheffield 40.631458
supporters   40.336179
stand 40.262084

The is different to the one we arrived at before. This should be a more representative one as we have calculated relevance over all the documents, not just a subset like we did before to try to fit within memory.

We see very similar themes - but also some new ones. the second most relevant word is raised - which suggests a key action or verb in the document set, perhaps raised a complaint or concern. Again we see the, perhaps regretful in hindsight, would - what should and would have happened. The puzzling word is indexer - we need to look at what the context of indexer is.

Let's look at the word cloud made from word relevance - click to enlarge.

Co-Occurrence

We can now look again at word co-occurrance, and this time the entire data set is in scope, not just a subset. Here are the top 20 pairs.

word1 word2 weight
0 police police 1.000000
1 there there 0.673316
2 ground ground 0.650661
3 south yorkshire 0.595835
4 would would 0.568905
5 should should 0.442588
6 material police 0.422234
7 police officers 0.418765
8 action action 0.403173
9 which which 0.402548
10 south police 0.400237
11 stand stand 0.397440
12 ground police 0.394455
13 number number 0.372084
14 there people 0.372017
15 people people 0.369749
16 yorkshire police 0.360681
17 police officer 0.352889
18 material material 0.352867
19 ground there 0.352135

This isn't as informative as the visualisation as a graph of connected words... but again we see the themes of should/would and a new theme around the word material, which suggests we should explore its context in the corpus.

Here's the co-occurrence graph - click to enlarge.

We can see that:

the police are very central to much of the disucssion
the expected themes are prominent - grounds, ambulance, injury, witness, Sheffield, ...
unexpected themes emerge - material, telephone, indexer, signature - suggestion avenues for further investigation

Filtering out lower cooccurrence scores gives us the most prominent themes:

We're looking at these visualisations and perhaps not being that impressed. But if we didn't know what Hillsborough was about, and didn't know the themes of the documents already, these kinds of analysis are really helpful.

Topic Extraction

We can try topic extraction again, this time over the entire dataset. Previously we were limited to a small subset (Home Office documents only) which limits the useful of topic extraction because of the lesser diversity of topics in a narrower dataset.

Before we dive in, it is useful to check the distribution of eigenvalues from the SVD decomposition to see if any significant topics did emerge. The following shows the overall view - and we can see a strong peak as well as a very long tail.

Zooming in shows four really strong topics, but the above shows that the next few are still signifiant compared to the full set.

Here are the top 15 topics:

topic # 0
raised 1.646954
actions 1.344083
document 1.159934
indexer 0.970770
instructions 0.953516
action 0.946094
indicated 0.901837
number 0.591718
statement 0.580177
receivers 0.508224
Name: 0, dtype: float64

topic # 1
reference 3.179260
extension 0.137096
telephone 0.114481
memorandum 0.113648
raised 0.110221
london 0.102715
scrutiny 0.097901
actions 0.089925
secretary 0.084932
queen 0.071101
Name: 1, dtype: float64

topic # 2
police 0.677746
ground 0.550280
material 0.545654
raised 0.396858
stand 0.350034
people 0.342326
south 0.333551
actions 0.311439
supporters 0.310529
there 0.298958
Name: 2, dtype: float64

topic # 3
secretary 0.572476
london 0.459885
justice 0.365538
material 0.355462
taylor 0.308684
ground 0.307226
inquiry 0.292526
disaster 0.246981
stand 0.236582
queen 0.227525
Name: 3, dtype: float64

topic # 4
chambers 1.377323
castle 0.226434
tract 0.062330
secretary 0.051054
attorney 0.030952
justice 0.028493
general 0.024988
taylor 0.024985
inquiry 0.024941
street 0.021848
Name: 4, dtype: float64

topic # 5
material 0.363232
inquiry 0.338313
property 0.252636
number 0.237683
officers 0.207103
people 0.204529
briefly 0.195419
message 0.194922
secretary 0.182719
justice 0.182356
Name: 5, dtype: float64

topic # 6
statement 0.307106
property 0.281948
signed 0.280987
visitors 0.274388
people 0.255564
witness 0.253254
signature 0.227993
midlands 0.203407
continuation 0.202329
court 0.199706
Name: 6, dtype: float64

topic # 7
visitors 1.186298
business 0.127914
property 0.100223
statement 0.078688
number 0.069357
witness 0.062938
signed 0.062578
exhibit 0.054888
court 0.053854
photo 0.052688
Name: 7, dtype: float64

topic # 8
property 0.295102
telephone 0.265546
yorkshire 0.262619
police 0.249554
subject 0.246568
south 0.221909
number 0.219346
inquiry 0.195477
halley 0.181322
constable 0.176241
Name: 8, dtype: float64

topic # 9
halley 1.168691
property 0.046086
telephone 0.041354
yorkshire 0.041154
subject 0.038612
police 0.038588
south 0.034380
number 0.034135
inquiry 0.031168
constable 0.027442
Name: 9, dtype: float64

topic # 10
secretary 0.380099
private 0.255705
message 0.247981
telephone 0.213645
action 0.209681
london 0.197113
material 0.163540
number 0.153276
premises 0.145124
street 0.144161
Name: 10, dtype: float64

topic # 11
secretary 0.462739
private 0.388733
telephone 0.303815
message 0.264320
sheffield 0.207366
inquiry 0.181807
signed 0.165700
street 0.162892
costs 0.143109
midlands 0.134161
Name: 11, dtype: float64

topic # 12
hammond 1.048282
commences 0.217295
secretary 0.062752
private 0.038406
london 0.031946
evidence 0.030058
midlands 0.027553
support 0.027198
signed 0.024466
inquiry 0.023950
Name: 12, dtype: float64

topic # 13
message 0.250191
subject 0.236295
street 0.224257
london 0.219849
people 0.205558
downing 0.199686
action 0.195125
signed 0.181644
telephone 0.164518
report 0.133955
Name: 13, dtype: float64

topic # 14
arrive 1.033862
secretary 0.092545
private 0.087563
subject 0.053544
sheffield 0.050590
justice 0.042844
london 0.042612
message 0.038389
general 0.037748
attorney 0.037600
Name: 14, dtype: float64

Let's look at these topics:

topic 0 - seems to be about instructions issued to the police, their receivers, the actions, and statements about those actions
topic 1 - seems to be about subsequent scrutiny or narrative from a more London-centric perspective
topic 2 - is more about the supports and the stands and their relation to the police
topic 3 - is much more about the subsequent inquiry by justice Taylor into the ground and stands
topic 4 - is very much about the legal aspects - attorney, chambers, inquiry, tract,
topic 6 - is more about witnesses, signed and signatures, visitors
...

Overall these topics don't appear to be as distinct as those extracted from the Iraq Report or the test mixed set. This is because the data is poorer in quality, and because the dominating themes of the Hillsborough dataset are very similar.

Further Investigation - Material, Indexer

The above has promoted us to look further at the context of the following words, identified above as significant:

material
indexer
signature

We can even use our own search engine to find the most matching documents.

Looking at the top few results we see that the word material is used on witness statements explaining why it is prominent. The top document shows an example of material (cctv video) being submitted :

Searching for indexer, and looking at the results tells us the word is just part of a common form. The same applies to the word signature .. as might now be expected!

Iterative Text Mining Process

A good text mining process would be iterative, and learning what we have, we would re-run the analysis and exclude these words in the stop list, and also re-intriduce the word Hillsborough as it is not in the English dictionary we used.

Intuitive Text Mining

Thursday, 20 April 2017

Exploring the Hillsborough Disclosure - Part 2/2