Monday 24 October 2016

Fixed - Faster Indexing with Pandas

After all the hard work over the recent posts investigating the slow performance of indexing and pandas .. we now have a settled fix.



Biggest Indexing Bottleneck

The biggest performance bottleneck was the indexing code:

# create index
# (word, [document_name]) dictionary, there can be many [document_names] in list
words_ctr = collections.Counter(doc_words_list)
for w, c in words_ctr.items():
    #print("=== ", w, c)
    wordcount_index.ix[w, document_name] = c
    pass

# replace NaN wirh zeros
wordcount_index.fillna(0, inplace=True)

Pandas is not good as cell-level operations .. it is much better at whole-array operations. The above was not just a cell-level operation, repeated for every word in the documents, ... the way it was implemented in pandas effectively created a new copy of the dataframe and caused a reindex overtime it was called.

The lesson here is to understand what pandas (or any other library) is good at ... and bad at ... and use performance profiling tools like %time, cProfile and pprofile to identify the actual hotspots.



Pandas v Numpy? ... Clarity Wins!

Given everything we have found about pandas being slow in the last few posts .. it is very tempting to throw it away and just work with the lower-level numpy, on which pandas sits.

That would certainly give us much faster code .. but the cost is in readability and simplicity of that code.

Let's remind ourselves of why we're doing this - we're not developing the most optimised text mining toolkit libraries here .. we're developing code that illustrates the ideas about natural language processing with the minimum of barriers to understanding ... so simpler code, even if a little slower, is what we will do .. and only fix the worst performance hotspots.



New Code

The following replaces the above indexing code:

words_ctr = collections.Counter(doc_words_list)

# convert to numpy structured array, ready for hdf5 storage
names = ['word', document_name]
formats = ['S20', 'i4']
wordcount_index_np = numpy.fromiter(words_ctr.items(), dtype=dict(names=names, formats=formats))

# convert to pandas
wordcount_index = pandas.DataFrame(wordcount_index_np[document_name], index=wordcount_index_np['word'], columns=[document_name])
# convert bytecode string index to normal pandas string
wordcount_index.index = wordcount_index.index.astype(str)

What we do here is build a numpy array out of the collections Counter object of wordcounts ... this is done "in one go" rather than word by word. This is a crucial difference that makes the difference to performance and efficiency.

We then turn that into a pandas data frame with the correct row index and column names .. again done in one go.



Performance Improvement

Here we've retained the best of both words. We have much faster indexing and also retain the merging of document indices into a corpus index which was faster with a pandas merge.

The following shows the performance of the indexing and merging for the larger Iraq Report corpus using (i) the original pandas code, (ii) pure numpy code, and (3) our improved pandas code.

Pandas Numpy Better Pandas
Index 474 2.22 2.29         
Merge 7.32   23.9 4.34         
Total 481.32 26.12 6.63         

We get really fast indexing .. only a tiny tiny bit slower than pure numpy ... and we to take advantage of the good pandas merge ... for which my own numpy code wasn't as fast.


Overall that's 73x faster over both index and merge!
That's impressive.

And just for indexing the improvement is 200x!

Wednesday 19 October 2016

Pandas vs Numpy Performance

So we found previously that pandas is slow.

Let's see how promising coding with numpy is. Code in development is at github.



Indexing and Merging

The Italian recipes corpus we created is small, and is processed quickly. The Iraq Inquiry Report corpus is bigger and takes longer.

Let's compare the times taken for indexing individual documents in the corpus, and the merging of these into a corpus word count index.

 Pandas  Numpy  
Index  474  2.22   
Merge  7.32    23.9   
Total  481.32  26.12  

Some interesting results:

  • The indexing with numpy is 200x faster than pandas!
  • Merging is slower than indexing with numpy, about 3x slower.
  • Overall, indexing and merging with numpy is about 20x faster than pandas!


Pretty impressive!




cProfile Merging Indices with Numpy

But the apparent anomaly of merging being slower in numpy is worth looking at .. here's what cprofile has to say:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 9513/568    0.009    0.000    0.011    0.000 core.py:1282(_recursive_make_descr)
      568    0.004    0.000    0.015    0.000 core.py:1303(make_mask_descr)
      513    0.000    0.000    0.000    0.000 core.py:1339(getmask)
      171    0.000    0.000    0.002    0.000 core.py:1403(getmaskarray)
...
...
       59    0.001    0.000    0.001    0.000 group.py:33(__init__)
        1    0.000    0.000    0.010    0.010 group.py:52(create_dataset)
        1    0.130    0.130   22.307   22.307 index_search.py:130(merge_wordcount_indices_for_corpus2)
        1    0.000    0.000    0.000    0.000 ioloop.py:932(add_callback)
        2    0.000    0.000    0.000    0.000 iostream.py:228(_is_master_process)
        2    0.000    0.000    0.000    0.000 iostream.py:241(_schedule_flush)
...
...
       57    0.001    0.000    0.001    0.000 recfunctions.py:315(_fix_defaults)
      114    0.031    0.000    0.033    0.000 recfunctions.py:34(recursive_fill_fields)
      114    0.001    0.000    0.037    0.000 recfunctions.py:466(drop_fields)
      114    0.000    0.000    0.000    0.000 recfunctions.py:506(_drop_descr)
       57    0.106    0.002   22.030    0.386 recfunctions.py:823(join_by)
       57    0.000    0.000    0.000    0.000 recfunctions.py:906(<listcomp>)
       57    0.000    0.000    0.000    0.000 recfunctions.py:907(<listcomp>)
       57    0.000    0.000    0.000    0.000 recfunctions.py:937(<listcomp>)

We can see that the join_by() function for merging numpy structured arrays is slow. If you look at the source code, you can see it python code not C, and it looks not that well maintained, and hardly documented.

If we wanted to optimise further .. we could look again at this, but for now .. we can be happy with a 20x performance boost!

Sunday 16 October 2016

Profiling Indexing, Relevance and Co-occurrence Part II

In the last post we applied the Python profiling tools to our textminingtoolkit as has evolved up to this point.

The line-by-line pprofile tool allowed us to identify the string.join() method for filtering out non-alphanumeric characters was slow ... and our replacement using the regex engine shaved off a huge 21% off the indexing time for the large-ish Iraq Report corpus.

That was a fun little exercise ... and now that we've implemented that change .. let's look again at profiling the code.



cProfile Indexing Again

Let's look at the function-level cProfile again ... this time we've sorted the stats by the cumtime column which tells us which functions take the longest .. cumtime includes the time taken by functions called by those functions, which is useful for seeing which of our code is responsible.

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.111 0.111 782.002 782.002 <ipython-input-6-89b4c4dd17f2>:4(do_processing)
38/1 0.001 0 782.002 782.002 {built-in
58 2.187 0.038 780.38 13.455 index_search.py:70(create_wordcount_index_for_document)
267167 4.529 0 775.462 0.003 indexing.py:126(__setitem__)
267167 3.446 0 275.737 0.001 frame.py:2743(reindex_axis)
267167 3.793 0 272.291 0.001 generic.py:2307(reindex_axis)
1068900 5.425 0 191.576 0 base.py:1915(get_loc)
1603408 180.826 0 182.049 0 {method
267167 3.078 0 155.164 0.001 base.py:2295(reindex)
267167 4.515 0 144.563 0.001 base.py:2028(get_indexer)
267167 131.558 0 134.735 0.001 {method
267225 3.967 0 121.646 0 base.py:3011(insert)
267167 6.672 0 106.987 0 generic.py:2320(_reindex_with_indexers)
267167 3.829 0 90.711 0 internals.py:3560(reindex_indexer)
267167 1.621 0 68.623 0 indexing.py:101(_get_setitem_indexer)

The do_processing() function which we've wrapped around the indexing takes 782 seconds .. that's long but also has the overhead of profiling too.

More interesting is that inside do_processing() it is the call to create_wordcount_index_for_document() which takes the 780 seconds of that 782. We're not surprised, it's the function that does the indexing.

But what inside that function is taking most time? Well, cProfile isn't a line-by-line profiler .. so we won't know precisely but the next few lines might give us a clue. They tell us that indexing.py (inside pandas?) takes 775s of that 780 ... doing something called __setitem__ whatever that is.

Similarly we can see reindex_axis inside frame.py and generic.py is taking time too ... we're assuming they're inside pandas.

Doing some googling ... suggests that adding values to a pandas and growing it dynamically is ... sloooowww ....  and it is better to create a list first, and once all the values have been added, we then create a pandas data frame from that list. This way, there is no need for pandas to be constantly trying to insert values into a 2-d structure, and for each insert, update the indices, etc ..

Ok.. let's use the line-by-line pprofile tool .. to see if that gives any more clues.



pprofile Indexing Again

The pprofile output is huge again but here's the section with our own code relevant to creating the index.

    69|         0|            0|            0|  0.00%|# create word count index just for one document
    70|        58|  0.000509501|  8.78449e-06|  0.00%|def create_wordcount_index_for_document(content_directory, document_name, doc_words_list):
    71|         0|            0|            0|  0.00%|    # start with empty index
    72|        58|   0.00220084|  3.79456e-05|  0.00%|    wordcount_index = pandas.DataFrame()
(call)|        58|     0.306131|   0.00527811|  0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:210 __init__
    73|         0|            0|            0|  0.00%|
    74|         0|            0|            0|  0.00%|    # create index
    75|         0|            0|            0|  0.00%|    # (word, [document_name]) dictionary, there can be many [document_names] in list
    76|        58|   0.00183034|  3.15576e-05|  0.00%|    words_ctr = collections.Counter(doc_words_list)
(call)|        58|      0.44324|   0.00764207|  0.01%|# /Users/tariq/anaconda3/lib/python3.5/collections/__init__.py:512 __init__
    77|    267225|      1.69075|  6.32705e-06|  0.05%|    for w, c in words_ctr.items():
    78|         0|            0|            0|  0.00%|        #print("=== ", w, c)
    79|    267167|      10.7117|  4.00938e-05|  0.29%|        wordcount_index.ix[w, document_name] = c
(call)|    267167|      3692.81|    0.0138221| 99.16%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
(call)|    267167|      5.31215|  1.98833e-05|  0.14%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
    80|    267167|      2.18377|  8.17381e-06|  0.06%|        pass
    81|         0|            0|            0|  0.00%|
    82|         0|            0|            0|  0.00%|    # replace NaN wirh zeros
    83|        58|   0.00133228|  2.29704e-05|  0.00%|    wordcount_index.fillna(0, inplace=True)
(call)|        58|     0.167373|   0.00288574|  0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756 fillna
    84|         0|            0|            0|  0.00%|
    85|         0|            0|            0|  0.00%|    # finally save index
    86|        58|  0.000444651|  7.66639e-06|  0.00%|    wordcount_index_file = content_directory + document_name + "_index.wordcount"
    87|        58|   0.00152659|  2.63206e-05|  0.00%|    hd5_store = pandas.HDFStore(wordcount_index_file, mode='w')
(call)|        58|     0.253571|   0.00437191|  0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
    88|        58|   0.00135565|  2.33732e-05|  0.00%|    hd5_store['doc_index'] = wordcount_index
(call)|        58|      8.71204|     0.150208|  0.23%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
    89|        58|   0.00121045|  2.08698e-05|  0.00%|    hd5_store.close()
(call)|        58|     0.221662|   0.00382176|  0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
    90|        58|  0.000453472|  7.81849e-06|  0.00%|    pass

The bit that seems to take time is the creation of new dataframe cells but growing it by extension .. which we already had clues that this was inefficient ... and that we should really do it with lists first before turning the data into a frame.

In the next post, we'll redo the construction of the index dataframe from lists, as that's what these clues point to.



cProfile Merging Index Again

Here's the cProfile output, ordered by cumtime for the merging of the document indices into a single corpus index.

ncalls tottime percal
l cumtime percall filename:lineno(function)
1 0.003 0.003 6.598 6.598 <string>:1(<module>)
1 0 0 6.598 6.598 {built-in
1 0.082 0.082 6.595 6.595 index_search.py:94(merge_wordcount_indices_for_corpus)
58 0.005 0 5.209 0.09 merge.py:30(merge)
58 0.002 0 5.202 0.09 merge.py:212(get_result)
58 0.001 0 4.119 0.071 merge.py:333(_get_join_info)
58 0.004 0 4.118 0.071 base.py:2414(join)
57 0.007 0 2.391 0.042 base.py:1691(union)
57 2.104 0.037 2.104 0.037 {method
229 0.005 0 1.059 0.005 base.py:2028(get_indexer)
58 0.001 0 1.053 0.018 internals.py:4526(concatenate_block_managers)
58 0.012 0 1.011 0.017 internals.py:4542(<listcomp>)
229 0.993 0.004 0.996 0.004 {method
1711 0.011 0 0.97 0.001 internals.py:4630(concatenate_join_units)
58 0 0 0.908 0.016 pytables.py:416(__getitem__)
58 0 0 0.908 0.016 pytables.py:619(get)

Inside the merge_wordcount_indices_for_corpus() which takes 6.595 seconds .. we can see various pandas own merge.py methods taking most of that time.

That's not surprising .. but do we want to do anything about it? If it is only 6 seconds of the bigger 780 then it may not be worth the effort or additional code complexity. Remember, we value simplicity and clarity.

So let's not right now.



Relevance Calculations

We've not looked at the relevance calculations yet. First let's time them ..

%time tmt.index_search.calculate_relevance_index(cr.content_directory)

CPU times: user 1min 14s, sys: 943 ms, total: 1min 15s
Wall time: 1min 17s

That takes 1m 15s for the Iraq Corpus.

Let's look deeper now with the function-level cProfile.

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 134.64 134.64 <string>:1(<module>)
1 0 0 134.64 134.64 {built-in
1 2.787 2.787 134.635 134.635 index_search.py:129(calculate_relevance_index)
45692 0.64 0 40.787 0.001 series.py:580(__getitem__)
137076 0.597 0 33.479 0 indexing.py:1286(__getitem__)
411241 3.939 0 33.196 0 series.py:120(__init__)
137076 0.944 0 32.719 0 indexing.py:1431(_getitem_axis)
91384 0.582 0 31.138 0 indexing.py:126(__setitem__)
45692 0.556 0 29.27 0.001 base.py:1957(get_value)
45692 0.817 0 27.335 0.001 {method
91384 2.381 0 27.317 0 indexing.py:224(_setitem_with_indexer)
137076/45692 0.215 0 26.517 0.001 numeric.py:1835(array_str)
137076/45692 0.647 0 26.404 0.001 arrayprint.py:340(array2string)
45692 1.115 0 25.903 0.001 arrayprint.py:237(_array2string)
137076 0.395 0 25.767 0 indexing.py:80(_get_label)
137076 2.427 0 25.337 0 generic.py:1643(xs)
137076 0.5 0 15.583 0 arrayprint.py:529(__init__)
137076 6.769 0 15.083 0 arrayprint.py:543(fillFormat)
274165 2.157 0 14.33 0 series.py:2787(_sanitize_array)
91384 0.34 0 14.327 0 internals.py:2916(setitem)
91385 1.338 0 13.994 0 internals.py:2811(apply)
91384 1.145 0 13.702 0 ops.py:603(wrapper)

The clues above suggest the same issues with growing pandas dataframes. Let's see what the line-by-line pprofile suggests:

0.00% # calculate relevance index from wordcount index
0.00% def calculate_relevance_index(content_directory):
0.00%    # load wordcount index
0.00%    wordcount_index_file = content_directory + "index.wordcount"
0.00%    hd5_store = pandas.HDFStore(wordcount_index_file, mode='r')
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00%    wordcount_index = hd5_store['corpus_index']
0.10% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:416 __getitem__
0.00%    hd5_store.close()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00%
0.00%    # following is a workaround for a pandas bug
0.00%    wordcount_index.index = wordcount_index.index.astype(str)
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:2674 __setattr__
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:757 astype
0.00%
0.00%    # word frequency (per document) from wordcount, aka TF
0.00%    frequency_index = wordcount_index/wordcount_index.sum()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:5296 stat_func
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:1057 f
0.00%
0.00%    # penalise short word length
0.02%    for word in frequency_index.index.values:
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:440 values
0.26%        frequency_index.loc[word] = frequency_index.loc[word] * numpy.tanh(len(word)/5.0)
10.93% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
5.26% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:603 wrapper
8.52% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.10% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
0.02%        pass
0.00%
0.00%    # inverse document frequency
0.02%    for word in frequency_index.index.values:
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:440 values
0.11%        documents = frequency_index.loc[word]
0.05% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
9.34% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.15%        matching_documents = documents[documents > 0]
27.69% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/series.py:580 __getitem__
9.11% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:727 wrapper
0.09%        inverse_document_frequency = 1.0 - (len(matching_documents) / len(documents))
0.35% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/series.py:443 __len__
0.00%        # print("word =", word, " idf = ", inverse_document_frequency)
0.27%        frequency_index.loc[word] = frequency_index.loc[word] * inverse_document_frequency
5.83% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:603 wrapper
12.14% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.11% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
9.45% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.02%        pass
0.00%
0.00%    # save relevance index
0.00%    relevance_index_file = content_directory + "index.relevance"
0.00%    print("saving corpus relevance index ... ", relevance_index_file)
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
0.00%    hd5_store = pandas.HDFStore(relevance_index_file, mode='w')
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00%    hd5_store['corpus_index'] = frequency_index
0.05% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
0.00%    hd5_store.close()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00%    pass

This gives us some insight into the expensive bits of python .. and again it's the pandas operations. Something as innocent as selecting those parts a dataframe which is >0 seems to take about 37% of the time in this relevance calculating function!

There's more horror ... multiplying a row by a constant (which should be super fast) takes about 27% of the time. Those are the operation's we'd expect to be fast.

Bonkers!

But do we want to do anything about it here ... Relevance calculations here take 75 seconds compared to the bigger indexing 780 .. so right now we won't spend any effort on optimising this.



Co-occurrence Calculations

By now we've seen most of the evidence pointing to issues with growing pandas frames dynamically. Let's see if that's the main factor in co-occurrence calculations... we may find a completely different bottleneck.

As usual .. let's time it first ...

CPU times: user 1min 48s, sys: 2.04 s, total: 1min 50s
Wall time: 2min 1s

That's using only the top 100 relevant words for co-occurrence .. otherwise it would take aaages!

Let's see what cprofile points us to. Here's an output snippet of the output ordered by the cumtime:

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.093 0.093 161.219 161.219 <ipython-input-18-62f4de2c7dbf>:4(do_processing)
1 0 0 161.219 161.219 <string>:1(<module>)
1 0 0 161.219 161.219 {built-in
58 2.662 0.046 153.919 2.654 cooccurrence.py:41(create_cooccurrence_matrix_for_document)
255165 1.474 0 72.565 0 indexing.py:126(__setitem__)
510330 5.616 0 63.487 0 internals.py:2811(apply)
258801/255165 7.271 0 57.243 0 indexing.py:224(_setitem_with_indexer)
255165 1.25 0 50.183 0 frame.py:2756(fillna)
255165 3.256 0 48.933 0 generic.py:3096(fillna)
255165 0.817 0 36.011 0 internals.py:2931(fillna)
255165 0.702 0 28.995 0 internals.py:2916(setitem)
514024 2.141 0 28.372 0 internals.py:2578(__init__)
257495/255165 1.293 0 26.723 0 indexing.py:65(__getitem__)
255165 0.839 0 22.868 0 frame.py:1817(get_value)
517602 8.075 0 21.28 0 internals.py:2674(_rebuild_blknos_and_blklocs)
255165 1.206 0 19.891 0 generic.py:1345(_get_item_cache)
255165 1.469 0 17.511 0 internals.py:331(fillna)
255165 0.926 0 12.972 0 indexing.py:101(_get_setitem_indexer)
255165 1.074 0 10.968 0 indexing.py:163(_convert_tuple)
255165 2.689 0 9.782 0 internals.py:628(setitem)
510330 1.61 0 9.764 0 indexing.py:1102(_convert_to_indexer)

Yup - it's those pesky pandas functions again.

Now it's pprofile's turn. The output confirms most of the time is actually spent in pandas library functions. The following shows which of our own code is the cause:

0.00% # create word cooccurrence matrix just for one document, updated to extend beyond immediate neighbour
0.00% def create_cooccurrence_matrix_for_document(content_directory, document_name, doc_words_list, window):
0.00%
0.00%    # start with empty matrix
0.00%    cooccurrence_matrix = pandas.DataFrame()
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:210 __init__
0.00%
0.00%    # work along window
0.00%    for ci in range(1, window + 1):
0.00%        # first create word-pair list
0.00%        word_pair_list = zip(doc_words_list[:-ci], doc_words_list[ci:])
0.00%
0.00%        # counts for each pair
0.00%        word_pair_ctr = collections.Counter(word_pair_list)
0.02% # /Users/tariq/anaconda3/lib/python3.5/collections/__init__.py:512 __init__
0.00%
0.08%        for wp, c in word_pair_ctr.items():
0.11%            neighbour_factor = math.exp(- math.pow(ci / window,2))
0.00%            # this try-exceptis ugly, needed because pandas doesn't yet have df[wp] += ...
0.06%            try:
0.52%                cooccurrence_matrix.ix[wp] += (c * neighbour_factor)
0.19% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
18.90% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:65 __getitem__
41.38% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.00%            except KeyError:
0.01%                cooccurrence_matrix.ix[wp] = (c * neighbour_factor)
3.32% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
0.00%                pass
0.00%            # replaces any created NaNs with zeros
0.26%            cooccurrence_matrix.fillna(0, inplace=True)
28.65% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756 fillna
0.00%        pass
0.00%
0.00%    pass
0.00%
0.00%    # finally save matrix
0.00%    cooccurrence_matrix_file = content_directory + document_name + "_matrix.cooccurrence"
0.00%    hd5_store = pandas.HDFStore(cooccurrence_matrix_file, mode='w')
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00%    hd5_store['doc_matrix'] = cooccurrence_matrix
0.17% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
0.00%    hd5_store.close()
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00%    pass

You can see the main culprits:

  • incrementing the co-occurrence matrix contents "in place" .. that seems to take a whopping 60% of the time!
  • the operation to replace nan's with zero, fillna(), is also expensive .. taking about 29% of the time .. it is puzzling!
  • even creating a new entry in a pandas dataframe costs is 3% of the time .. 



Conclusion


  • Pandas is slow for cell-level operations .. terribly slow. It's sweet spot is whole-array or column operations.




Pandas -> Dicts / Lists

The next post will report on how well a re-implementation of the text processing functions using lists and dictionaries for item-level operations works.