In the
last post we applied the Python profiling tools to our textminingtoolkit as has evolved up to this point.
The line-by-line pprofile tool allowed us to identify the string.join() method for filtering out non-alphanumeric characters was slow ... and our replacement using the regex engine shaved off a huge 21% off the indexing time for the large-ish Iraq Report corpus.
That was a fun little exercise ... and now that we've implemented that change .. let's look again at profiling the code.
cProfile Indexing Again
Let's look at the function-level
cProfile again ... this time we've sorted the stats by the
cumtime column which tells us which functions take the longest ..
cumtime includes the time taken by functions called by those functions, which is useful for seeing which of our code is responsible.
ncalls tottime percall cumtime percall filename:lineno(function)1 0.111 0.111 782.002 782.002 <ipython-input-6-89b4c4dd17f2>:4(do_processing)38/1 0.001 0 782.002 782.002 {built-in58 2.187 0.038 780.38 13.455 index_search.py:70(create_wordcount_index_for_document)267167 4.529 0 775.462 0.003 indexing.py:126(__setitem__)267167 3.446 0 275.737 0.001 frame.py:2743(reindex_axis)267167 3.793 0 272.291 0.001 generic.py:2307(reindex_axis)1068900 5.425 0 191.576 0 base.py:1915(get_loc)1603408 180.826 0 182.049 0 {method267167 3.078 0 155.164 0.001 base.py:2295(reindex)267167 4.515 0 144.563 0.001 base.py:2028(get_indexer)267167 131.558 0 134.735 0.001 {method267225 3.967 0 121.646 0 base.py:3011(insert)267167 6.672 0 106.987 0 generic.py:2320(_reindex_with_indexers)267167 3.829 0 90.711 0 internals.py:3560(reindex_indexer)267167 1.621 0 68.623 0 indexing.py:101(_get_setitem_indexer)The
do_processing() function which we've wrapped around the indexing takes 782 seconds .. that's long but also has the overhead of profiling too.
More interesting is that inside do_processing() it is the call to
create_wordcount_index_for_document() which takes the 780 seconds of that 782. We're not surprised, it's the function that does the indexing.
But what inside that function is taking most time? Well, cProfile isn't a line-by-line profiler .. so we won't know precisely but the next few lines might give us a clue. They tell us that
indexing.py (inside pandas?) takes 775s of that 780 ... doing something called
__setitem__ whatever that is.
Similarly we can see
reindex_axis inside
frame.py and
generic.py is taking time too ... we're assuming they're inside pandas.
Doing some
googling ... suggests that adding values to a pandas and growing it dynamically is ... sloooowww .... and it is better to create a list first, and once all the values have been added, we then create a pandas data frame from that list. This way, there is no need for pandas to be constantly trying to insert values into a 2-d structure, and for each insert, update the indices, etc ..
Ok.. let's use the line-by-line
pprofile tool .. to see if that gives any more clues.
pprofile Indexing Again
The
pprofile output is huge again but here's the section with our own code relevant to creating the index.
69| 0| 0| 0| 0.00%|# create word count index just for one document 70| 58| 0.000509501| 8.78449e-06| 0.00%|def create_wordcount_index_for_document(content_directory, document_name, doc_words_list): 71| 0| 0| 0| 0.00%| # start with empty index 72| 58| 0.00220084| 3.79456e-05| 0.00%| wordcount_index = pandas.DataFrame()(call)| 58| 0.306131| 0.00527811| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:210 __init__ 73| 0| 0| 0| 0.00%| 74| 0| 0| 0| 0.00%| # create index 75| 0| 0| 0| 0.00%| # (word, [document_name]) dictionary, there can be many [document_names] in list 76| 58| 0.00183034| 3.15576e-05| 0.00%| words_ctr = collections.Counter(doc_words_list)(call)| 58| 0.44324| 0.00764207| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/collections/__init__.py:512 __init__ 77| 267225| 1.69075| 6.32705e-06| 0.05%| for w, c in words_ctr.items(): 78| 0| 0| 0| 0.00%| #print("=== ", w, c) 79| 267167| 10.7117| 4.00938e-05| 0.29%| wordcount_index.ix[w, document_name] = c(call)| 267167| 3692.81| 0.0138221| 99.16%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__(call)| 267167| 5.31215| 1.98833e-05| 0.14%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer 80| 267167| 2.18377| 8.17381e-06| 0.06%| pass 81| 0| 0| 0| 0.00%| 82| 0| 0| 0| 0.00%| # replace NaN wirh zeros 83| 58| 0.00133228| 2.29704e-05| 0.00%| wordcount_index.fillna(0, inplace=True)(call)| 58| 0.167373| 0.00288574| 0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756 fillna 84| 0| 0| 0| 0.00%| 85| 0| 0| 0| 0.00%| # finally save index 86| 58| 0.000444651| 7.66639e-06| 0.00%| wordcount_index_file = content_directory + document_name + "_index.wordcount" 87| 58| 0.00152659| 2.63206e-05| 0.00%| hd5_store = pandas.HDFStore(wordcount_index_file, mode='w')(call)| 58| 0.253571| 0.00437191| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__ 88| 58| 0.00135565| 2.33732e-05| 0.00%| hd5_store['doc_index'] = wordcount_index(call)| 58| 8.71204| 0.150208| 0.23%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__ 89| 58| 0.00121045| 2.08698e-05| 0.00%| hd5_store.close()(call)| 58| 0.221662| 0.00382176| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close 90| 58| 0.000453472| 7.81849e-06| 0.00%| passThe bit that seems to take time is the creation of new dataframe cells but growing it by extension .. which we already had clues that this was inefficient ... and that we should really do it with lists first before turning the data into a frame.
In the next post, we'll redo the construction of the index dataframe from lists, as that's what these clues point to.
cProfile Merging Index Again
Here's the cProfile output, ordered by
cumtime for the merging of the document indices into a single corpus index.
ncalls tottime percall cumtime percall filename:lineno(function)1 0.003 0.003 6.598 6.598 <string>:1(<module>)1 0 0 6.598 6.598 {built-in1 0.082 0.082 6.595 6.595 index_search.py:94(merge_wordcount_indices_for_corpus)58 0.005 0 5.209 0.09 merge.py:30(merge)58 0.002 0 5.202 0.09 merge.py:212(get_result)58 0.001 0 4.119 0.071 merge.py:333(_get_join_info)58 0.004 0 4.118 0.071 base.py:2414(join)57 0.007 0 2.391 0.042 base.py:1691(union)57 2.104 0.037 2.104 0.037 {method229 0.005 0 1.059 0.005 base.py:2028(get_indexer)58 0.001 0 1.053 0.018 internals.py:4526(concatenate_block_managers)58 0.012 0 1.011 0.017 internals.py:4542(<listcomp>)229 0.993 0.004 0.996 0.004 {method1711 0.011 0 0.97 0.001 internals.py:4630(concatenate_join_units)58 0 0 0.908 0.016 pytables.py:416(__getitem__)58 0 0 0.908 0.016 pytables.py:619(get)Inside the
merge_wordcount_indices_for_corpus() which takes 6.595 seconds .. we can see various pandas own
merge.py methods taking most of that time.
That's not surprising .. but do we want to do anything about it? If it is only 6 seconds of the bigger 780 then it may not be worth the effort or additional code complexity. Remember, we value simplicity and clarity.
So let's not right now.
Relevance Calculations
We've not looked at the relevance calculations yet. First let's time them ..
%time tmt.index_search.calculate_relevance_index(cr.content_directory)
CPU times: user 1min 14s, sys: 943 ms, total: 1min 15sWall time: 1min 17sThat takes 1m 15s for the Iraq Corpus.
Let's look deeper now with the function-level cProfile.
ncalls tottime percall cumtime percall filename:lineno(function)1 0.004 0.004 134.64 134.64 <string>:1(<module>)1 0 0 134.64 134.64 {built-in1 2.787 2.787 134.635 134.635 index_search.py:129(calculate_relevance_index)45692 0.64 0 40.787 0.001 series.py:580(__getitem__)137076 0.597 0 33.479 0 indexing.py:1286(__getitem__)411241 3.939 0 33.196 0 series.py:120(__init__)137076 0.944 0 32.719 0 indexing.py:1431(_getitem_axis)91384 0.582 0 31.138 0 indexing.py:126(__setitem__)45692 0.556 0 29.27 0.001 base.py:1957(get_value)45692 0.817 0 27.335 0.001 {method91384 2.381 0 27.317 0 indexing.py:224(_setitem_with_indexer)137076/45692 0.215 0 26.517 0.001 numeric.py:1835(array_str)137076/45692 0.647 0 26.404 0.001 arrayprint.py:340(array2string)45692 1.115 0 25.903 0.001 arrayprint.py:237(_array2string)137076 0.395 0 25.767 0 indexing.py:80(_get_label)137076 2.427 0 25.337 0 generic.py:1643(xs)137076 0.5 0 15.583 0 arrayprint.py:529(__init__)137076 6.769 0 15.083 0 arrayprint.py:543(fillFormat)274165 2.157 0 14.33 0 series.py:2787(_sanitize_array)91384 0.34 0 14.327 0 internals.py:2916(setitem)91385 1.338 0 13.994 0 internals.py:2811(apply)91384 1.145 0 13.702 0 ops.py:603(wrapper)The clues above suggest the same issues with growing pandas dataframes. Let's see what the line-by-line pprofile suggests:
0.00% # calculate relevance index from wordcount index0.00% def calculate_relevance_index(content_directory):0.00% # load wordcount index0.00% wordcount_index_file = content_directory + "index.wordcount"0.00% hd5_store = pandas.HDFStore(wordcount_index_file, mode='r')0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__0.00% wordcount_index = hd5_store['corpus_index']0.10% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:416 __getitem__0.00% hd5_store.close()0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close0.00% 0.00% # following is a workaround for a pandas bug0.00% wordcount_index.index = wordcount_index.index.astype(str)0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:2674 __setattr__0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:757 astype0.00% 0.00% # word frequency (per document) from wordcount, aka TF0.00% frequency_index = wordcount_index/wordcount_index.sum()0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:5296 stat_func0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:1057 f0.00% 0.00% # penalise short word length0.02% for word in frequency_index.index.values:0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:440 values0.26% frequency_index.loc[word] = frequency_index.loc[word] * numpy.tanh(len(word)/5.0)10.93% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__5.26% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:603 wrapper8.52% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__0.10% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer0.02% pass0.00% 0.00% # inverse document frequency0.02% for word in frequency_index.index.values:0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:440 values0.11% documents = frequency_index.loc[word]0.05% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer9.34% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__0.15% matching_documents = documents[documents > 0]27.69% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/series.py:580 __getitem__9.11% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:727 wrapper0.09% inverse_document_frequency = 1.0 - (len(matching_documents) / len(documents))0.35% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/series.py:443 __len__0.00% # print("word =", word, " idf = ", inverse_document_frequency)0.27% frequency_index.loc[word] = frequency_index.loc[word] * inverse_document_frequency5.83% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:603 wrapper12.14% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__0.11% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer9.45% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__0.02% pass0.00% 0.00% # save relevance index0.00% relevance_index_file = content_directory + "index.relevance"0.00% print("saving corpus relevance index ... ", relevance_index_file)0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write0.00% hd5_store = pandas.HDFStore(relevance_index_file, mode='w')0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__0.00% hd5_store['corpus_index'] = frequency_index0.05% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__0.00% hd5_store.close()0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close0.00% passThis gives us some insight into the expensive bits of python .. and again it's the pandas operations. Something as innocent as selecting those parts a dataframe which is >0 seems to take about 37% of the time in this relevance calculating function!
There's more horror ... multiplying a row by a constant (which should be super fast) takes about 27% of the time. Those are the operation's we'd expect to be fast.
Bonkers!
But do we want to do anything about it here ... Relevance calculations here take 75 seconds compared to the bigger indexing 780 .. so right now we won't spend any effort on optimising this.
Co-occurrence Calculations
By now we've seen most of the evidence pointing to issues with growing pandas frames dynamically. Let's see if that's the main factor in co-occurrence calculations... we may find a completely different bottleneck.
As usual .. let's time it first ...
CPU times: user 1min 48s, sys: 2.04 s, total: 1min 50sWall time: 2min 1sThat's using only the top 100 relevant words for co-occurrence .. otherwise it would take aaages!
Let's see what
cprofile points us to. Here's an output snippet of the output ordered by the
cumtime:
ncalls tottime percall cumtime percall filename:lineno(function)1 0.093 0.093 161.219 161.219 <ipython-input-18-62f4de2c7dbf>:4(do_processing)1 0 0 161.219 161.219 <string>:1(<module>)1 0 0 161.219 161.219 {built-in58 2.662 0.046 153.919 2.654 cooccurrence.py:41(create_cooccurrence_matrix_for_document)255165 1.474 0 72.565 0 indexing.py:126(__setitem__)510330 5.616 0 63.487 0 internals.py:2811(apply)258801/255165 7.271 0 57.243 0 indexing.py:224(_setitem_with_indexer)255165 1.25 0 50.183 0 frame.py:2756(fillna)255165 3.256 0 48.933 0 generic.py:3096(fillna)255165 0.817 0 36.011 0 internals.py:2931(fillna)255165 0.702 0 28.995 0 internals.py:2916(setitem)514024 2.141 0 28.372 0 internals.py:2578(__init__)257495/255165 1.293 0 26.723 0 indexing.py:65(__getitem__)255165 0.839 0 22.868 0 frame.py:1817(get_value)517602 8.075 0 21.28 0 internals.py:2674(_rebuild_blknos_and_blklocs)255165 1.206 0 19.891 0 generic.py:1345(_get_item_cache)255165 1.469 0 17.511 0 internals.py:331(fillna)255165 0.926 0 12.972 0 indexing.py:101(_get_setitem_indexer)255165 1.074 0 10.968 0 indexing.py:163(_convert_tuple)255165 2.689 0 9.782 0 internals.py:628(setitem)510330 1.61 0 9.764 0 indexing.py:1102(_convert_to_indexer)Yup - it's those pesky pandas functions again.
Now it's
pprofile's turn. The output confirms most of the time is actually spent in pandas library functions. The following shows which of our own code is the cause:
0.00% # create word cooccurrence matrix just for one document, updated to extend beyond immediate neighbour0.00% def create_cooccurrence_matrix_for_document(content_directory, document_name, doc_words_list, window):0.00% 0.00% # start with empty matrix0.00% cooccurrence_matrix = pandas.DataFrame()0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:210 __init__0.00% 0.00% # work along window0.00% for ci in range(1, window + 1):0.00% # first create word-pair list0.00% word_pair_list = zip(doc_words_list[:-ci], doc_words_list[ci:])0.00% 0.00% # counts for each pair0.00% word_pair_ctr = collections.Counter(word_pair_list)0.02% # /Users/tariq/anaconda3/lib/python3.5/collections/__init__.py:512 __init__0.00% 0.08% for wp, c in word_pair_ctr.items():0.11% neighbour_factor = math.exp(- math.pow(ci / window,2))0.00% # this try-exceptis ugly, needed because pandas doesn't yet have df[wp] += ...0.06% try:0.52% cooccurrence_matrix.ix[wp] += (c * neighbour_factor)0.19% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer18.90% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:65 __getitem__41.38% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__0.00% except KeyError:0.01% cooccurrence_matrix.ix[wp] = (c * neighbour_factor)3.32% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer0.00% pass0.00% # replaces any created NaNs with zeros0.26% cooccurrence_matrix.fillna(0, inplace=True)28.65% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756 fillna0.00% pass0.00% 0.00% pass0.00% 0.00% # finally save matrix0.00% cooccurrence_matrix_file = content_directory + document_name + "_matrix.cooccurrence"0.00% hd5_store = pandas.HDFStore(cooccurrence_matrix_file, mode='w')0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__0.00% hd5_store['doc_matrix'] = cooccurrence_matrix0.17% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__0.00% hd5_store.close()0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close0.00% passYou can see the main culprits:
- incrementing the co-occurrence matrix contents "in place" .. that seems to take a whopping 60% of the time!
- the operation to replace nan's with zero, fillna(), is also expensive .. taking about 29% of the time .. it is puzzling!
- even creating a new entry in a pandas dataframe costs is 3% of the time ..
Conclusion
- Pandas is slow for cell-level operations .. terribly slow. It's sweet spot is whole-array or column operations.
Pandas -> Dicts / Lists
The next post will report on how well a re-implementation of the text processing functions using lists and dictionaries for item-level operations works.