Monday 24 October 2016

Fixed - Faster Indexing with Pandas

After all the hard work over the recent posts investigating the slow performance of indexing and pandas .. we now have a settled fix.



Biggest Indexing Bottleneck

The biggest performance bottleneck was the indexing code:

# create index
# (word, [document_name]) dictionary, there can be many [document_names] in list
words_ctr = collections.Counter(doc_words_list)
for w, c in words_ctr.items():
    #print("=== ", w, c)
    wordcount_index.ix[w, document_name] = c
    pass

# replace NaN wirh zeros
wordcount_index.fillna(0, inplace=True)

Pandas is not good as cell-level operations .. it is much better at whole-array operations. The above was not just a cell-level operation, repeated for every word in the documents, ... the way it was implemented in pandas effectively created a new copy of the dataframe and caused a reindex overtime it was called.

The lesson here is to understand what pandas (or any other library) is good at ... and bad at ... and use performance profiling tools like %time, cProfile and pprofile to identify the actual hotspots.



Pandas v Numpy? ... Clarity Wins!

Given everything we have found about pandas being slow in the last few posts .. it is very tempting to throw it away and just work with the lower-level numpy, on which pandas sits.

That would certainly give us much faster code .. but the cost is in readability and simplicity of that code.

Let's remind ourselves of why we're doing this - we're not developing the most optimised text mining toolkit libraries here .. we're developing code that illustrates the ideas about natural language processing with the minimum of barriers to understanding ... so simpler code, even if a little slower, is what we will do .. and only fix the worst performance hotspots.



New Code

The following replaces the above indexing code:

words_ctr = collections.Counter(doc_words_list)

# convert to numpy structured array, ready for hdf5 storage
names = ['word', document_name]
formats = ['S20', 'i4']
wordcount_index_np = numpy.fromiter(words_ctr.items(), dtype=dict(names=names, formats=formats))

# convert to pandas
wordcount_index = pandas.DataFrame(wordcount_index_np[document_name], index=wordcount_index_np['word'], columns=[document_name])
# convert bytecode string index to normal pandas string
wordcount_index.index = wordcount_index.index.astype(str)

What we do here is build a numpy array out of the collections Counter object of wordcounts ... this is done "in one go" rather than word by word. This is a crucial difference that makes the difference to performance and efficiency.

We then turn that into a pandas data frame with the correct row index and column names .. again done in one go.



Performance Improvement

Here we've retained the best of both words. We have much faster indexing and also retain the merging of document indices into a corpus index which was faster with a pandas merge.

The following shows the performance of the indexing and merging for the larger Iraq Report corpus using (i) the original pandas code, (ii) pure numpy code, and (3) our improved pandas code.

Pandas Numpy Better Pandas
Index 474 2.22 2.29         
Merge 7.32   23.9 4.34         
Total 481.32 26.12 6.63         

We get really fast indexing .. only a tiny tiny bit slower than pure numpy ... and we to take advantage of the good pandas merge ... for which my own numpy code wasn't as fast.


Overall that's 73x faster over both index and merge!
That's impressive.

And just for indexing the improvement is 200x!

No comments:

Post a Comment