It's always much more interesting to explore data sets that are interesting themselves.
So I've converted two "hot topic" data sets into plain text:
- The Chilcot Iraq Inquiry Report into whether it was right to go to war, and whether the war and it's aftermath could have been better planned for.
- Hillary Clinton's use of a personal email server for official business led to controversy. A redacted set of emails was released, and a version is at Kaggle.
The Iraq Inquiry report is in PDF form which is not ideal for text analytics. I've extracted the text using the open source "pdftotext" utility, with an attempt to preserve the text flow layout.
The Clinton emails are provided as an sqlite database or as a CSV file. I've extracted the "RawText" because the provided ExtractedBodyText hasn't worked in some cases. The plain text files are named with the DocumentNumber.
Here are the links on github:
- The Iraq Inquiry Report as plain text files: https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/data_sets/iraq_inquiry/
- The Clinton Emails as plain text files (zipped): https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/data_sets/clinton_emails
I may update the Iraq Inquiry Report to also include the additional evidence documents.
Have fun!
No comments:
Post a Comment