Sloooooow ...
The indexing for the small Italian Recipes data set was quick as a flash. The bigger Iraq Inquiry report took a few minutes ...... but the Clinton emails took ages .. 12 hours and still less than half the documents had been indexed.
This is not good, especially if we want to experiment and change things quickly.
Optimising Indexing
On this journey we want to develop ideas, algorithms and code which prioritse clarity and simplicity. We don't want to get into overly-complex, deeply sophisticated stuff .. that's not the idea here. The idea is to understand the basics.So for this reason, we started with an approach to indexing that was super simple:
for every document in a corpus ...
load previously saved index (if any)
go through all the words in a document ...
add words to index
save updated index
That was simple conceptually, we can all understand picking up an index and updating it with new content from new documents.
But you can also see that as the corpus gets larger, we're loading and saving an ever bigger index. In much of computing, doing stuff with files is slow and expensive.
Can we make this more efficient? With less file loading and saving?
Well, we can't avoid going through all the text content because we do need to index it.
But we can avoid loading and saving ever bigger index files to update them. Instead we index each content text file separately, and afterwards we merge the index files.
for every document in a corpus ...
go through all the words in a document ...
save document index
start with an empty corpus index
for every document index in corpus ...
load document index and join with corpus index
save corpus index
There will still be loads of file load and saves ... but they won't be getting bigger and bigger.
There is a downside to this - we lose the ability to update an index. We now have to think about regenerating an index for an entire corpus. Given the speed up for large corpora, this is worth it.
Results - Zooooom!
This results is way faster indexing! Instead of hours and hours for about half the Clinton emails done .. they were indexed in minutes.The merging took longer, but not hours. This is to be expected as the corpus index does grow larger and larger, but we're not doing it without saving and loading it as we progress through every document.
The Code
The updated code for indexing in this new way is always at github:Pickling vs HDF5
During the coding it turned out that the simplest way of storing pandas dataframes in a file - known as pickling - didn't work for larger dataframes. This is a known bug, and seems to effect Python 3.5 on Mac OS X.This was a perfect prompt to look for potentially better formats or ways of storing index dataframes. For now I've settled on HDF5 format data files - a mature format. It might be more complex than we need, because the format allows us to do things like querying and concurrent access, but is also simple enough and also works for larger dataframes.
No comments:
Post a Comment