Monday, 11 July 2016

Clinton Emails and Chilcot Iraq Inquiry Report in Plain Text

It's always much more interesting to explore data sets that are interesting themselves. 

So I've converted two "hot topic" data sets into plain text:

  • The Chilcot Iraq Inquiry Report into whether it was right to go to war, and whether the war and it's aftermath could have been better planned for.
  • Hillary Clinton's use of a personal email server for official business led to controversy. A redacted set of emails was released, and a version is at Kaggle.

The Iraq Inquiry report is in PDF form which is not ideal for text analytics. I've extracted the text using the open source "pdftotext" utility, with an attempt to preserve the text flow layout.

The Clinton emails are provided as an sqlite database or as a CSV file. I've extracted the "RawText" because the provided ExtractedBodyText hasn't worked in some cases. The plain text files are named with the DocumentNumber.

Here are the links on github:

I may update the Iraq Inquiry Report to also include the additional evidence documents.

Have fun!

No comments:

Post a Comment