Intuitive Text Mining: October 2016

In the last post we applied the Python profiling tools to our textminingtoolkit as has evolved up to this point.

The line-by-line pprofile tool allowed us to identify the string.join() method for filtering out non-alphanumeric characters was slow ... and our replacement using the regex engine shaved off a huge 21% off the indexing time for the large-ish Iraq Report corpus.

That was a fun little exercise ... and now that we've implemented that change .. let's look again at profiling the code.

cProfile Indexing Again

Let's look at the function-level cProfile again ... this time we've sorted the stats by the cumtime column which tells us which functions take the longest .. cumtime includes the time taken by functions called by those functions, which is useful for seeing which of our code is responsible.

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.111 0.111 782.002 782.002 <ipython-input-6-89b4c4dd17f2>:4(do_processing)
38/1 0.001 0 782.002 782.002 {built-in
58 2.187 0.038 780.38 13.455 index_search.py:70(create_wordcount_index_for_document)
267167 4.529 0 775.462 0.003 indexing.py:126(__setitem__)
267167 3.446 0 275.737 0.001 frame.py:2743(reindex_axis)
267167 3.793 0 272.291 0.001 generic.py:2307(reindex_axis)
1068900 5.425 0 191.576 0 base.py:1915(get_loc)
1603408 180.826 0 182.049 0 {method
267167 3.078 0 155.164 0.001 base.py:2295(reindex)
267167 4.515 0 144.563 0.001 base.py:2028(get_indexer)
267167 131.558 0 134.735 0.001 {method
267225 3.967 0 121.646 0 base.py:3011(insert)
267167 6.672 0 106.987 0 generic.py:2320(_reindex_with_indexers)
267167 3.829 0 90.711 0 internals.py:3560(reindex_indexer)
267167 1.621 0 68.623 0 indexing.py:101(_get_setitem_indexer)

The do_processing() function which we've wrapped around the indexing takes 782 seconds .. that's long but also has the overhead of profiling too.

More interesting is that inside do_processing() it is the call to create_wordcount_index_for_document() which takes the 780 seconds of that 782. We're not surprised, it's the function that does the indexing.

But what inside that function is taking most time? Well, cProfile isn't a line-by-line profiler .. so we won't know precisely but the next few lines might give us a clue. They tell us that indexing.py (inside pandas?) takes 775s of that 780 ... doing something called __setitem__ whatever that is.

Similarly we can see reindex_axis inside frame.py and generic.py is taking time too ... we're assuming they're inside pandas.

Doing some googling ... suggests that adding values to a pandas and growing it dynamically is ... sloooowww .... and it is better to create a list first, and once all the values have been added, we then create a pandas data frame from that list. This way, there is no need for pandas to be constantly trying to insert values into a 2-d structure, and for each insert, update the indices, etc ..

Ok.. let's use the line-by-line pprofile tool .. to see if that gives any more clues.

pprofile Indexing Again

The pprofile output is huge again but here's the section with our own code relevant to creating the index.

69| 0| 0| 0| 0.00%|# create word count index just for one document
70| 58| 0.000509501| 8.78449e-06| 0.00%|def create_wordcount_index_for_document(content_directory, document_name, doc_words_list):
71| 0| 0| 0| 0.00%| # start with empty index
72| 58| 0.00220084| 3.79456e-05| 0.00%| wordcount_index = pandas.DataFrame()
(call)| 58| 0.306131| 0.00527811| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:210 __init__
73| 0| 0| 0| 0.00%|
74| 0| 0| 0| 0.00%| # create index
75| 0| 0| 0| 0.00%| # (word, [document_name]) dictionary, there can be many [document_names] in list
76| 58| 0.00183034| 3.15576e-05| 0.00%| words_ctr = collections.Counter(doc_words_list)
(call)| 58| 0.44324| 0.00764207| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/collections/__init__.py:512 __init__
77| 267225| 1.69075| 6.32705e-06| 0.05%| for w, c in words_ctr.items():
78| 0| 0| 0| 0.00%| #print("=== ", w, c)
79| 267167| 10.7117| 4.00938e-05| 0.29%| wordcount_index.ix[w, document_name] = c
(call)| 267167| 3692.81| 0.0138221| 99.16%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
(call)| 267167| 5.31215| 1.98833e-05| 0.14%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
80| 267167| 2.18377| 8.17381e-06| 0.06%| pass
81| 0| 0| 0| 0.00%|
82| 0| 0| 0| 0.00%| # replace NaN wirh zeros
83| 58| 0.00133228| 2.29704e-05| 0.00%| wordcount_index.fillna(0, inplace=True)
(call)| 58| 0.167373| 0.00288574| 0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756 fillna
84| 0| 0| 0| 0.00%|
85| 0| 0| 0| 0.00%| # finally save index
86| 58| 0.000444651| 7.66639e-06| 0.00%| wordcount_index_file = content_directory + document_name + "_index.wordcount"
87| 58| 0.00152659| 2.63206e-05| 0.00%| hd5_store = pandas.HDFStore(wordcount_index_file, mode='w')
(call)| 58| 0.253571| 0.00437191| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
88| 58| 0.00135565| 2.33732e-05| 0.00%| hd5_store['doc_index'] = wordcount_index
(call)| 58| 8.71204| 0.150208| 0.23%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
89| 58| 0.00121045| 2.08698e-05| 0.00%| hd5_store.close()
(call)| 58| 0.221662| 0.00382176| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
90| 58| 0.000453472| 7.81849e-06| 0.00%| pass

The bit that seems to take time is the creation of new dataframe cells but growing it by extension .. which we already had clues that this was inefficient ... and that we should really do it with lists first before turning the data into a frame.

In the next post, we'll redo the construction of the index dataframe from lists, as that's what these clues point to.

cProfile Merging Index Again

Here's the cProfile output, ordered by cumtime for the merging of the document indices into a single corpus index.

ncalls tottime percal
l cumtime percall filename:lineno(function)
1 0.003 0.003 6.598 6.598 <string>:1(<module>)
1 0 0 6.598 6.598 {built-in
1 0.082 0.082 6.595 6.595 index_search.py:94(merge_wordcount_indices_for_corpus)
58 0.005 0 5.209 0.09 merge.py:30(merge)
58 0.002 0 5.202 0.09 merge.py:212(get_result)
58 0.001 0 4.119 0.071 merge.py:333(_get_join_info)
58 0.004 0 4.118 0.071 base.py:2414(join)
57 0.007 0 2.391 0.042 base.py:1691(union)
57 2.104 0.037 2.104 0.037 {method
229 0.005 0 1.059 0.005 base.py:2028(get_indexer)
58 0.001 0 1.053 0.018 internals.py:4526(concatenate_block_managers)
58 0.012 0 1.011 0.017 internals.py:4542(<listcomp>)
229 0.993 0.004 0.996 0.004 {method
1711 0.011 0 0.97 0.001 internals.py:4630(concatenate_join_units)
58 0 0 0.908 0.016 pytables.py:416(__getitem__)
58 0 0 0.908 0.016 pytables.py:619(get)

Inside the merge_wordcount_indices_for_corpus() which takes 6.595 seconds .. we can see various pandas own merge.py methods taking most of that time.

That's not surprising .. but do we want to do anything about it? If it is only 6 seconds of the bigger 780 then it may not be worth the effort or additional code complexity. Remember, we value simplicity and clarity.

So let's not right now.

Relevance Calculations

We've not looked at the relevance calculations yet. First let's time them ..

%time tmt.index_search.calculate_relevance_index(cr.content_directory)

CPU times: user 1min 14s, sys: 943 ms, total: 1min 15s
Wall time: 1min 17s

That takes 1m 15s for the Iraq Corpus.

Let's look deeper now with the function-level cProfile.

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 134.64 134.64 <string>:1(<module>)
1 0 0 134.64 134.64 {built-in
1 2.787 2.787 134.635 134.635 index_search.py:129(calculate_relevance_index)
45692 0.64 0 40.787 0.001 series.py:580(__getitem__)
137076 0.597 0 33.479 0 indexing.py:1286(__getitem__)
411241 3.939 0 33.196 0 series.py:120(__init__)
137076 0.944 0 32.719 0 indexing.py:1431(_getitem_axis)
91384 0.582 0 31.138 0 indexing.py:126(__setitem__)
45692 0.556 0 29.27 0.001 base.py:1957(get_value)
45692 0.817 0 27.335 0.001 {method
91384 2.381 0 27.317 0 indexing.py:224(_setitem_with_indexer)
137076/45692 0.215 0 26.517 0.001 numeric.py:1835(array_str)
137076/45692 0.647 0 26.404 0.001 arrayprint.py:340(array2string)
45692 1.115 0 25.903 0.001 arrayprint.py:237(_array2string)
137076 0.395 0 25.767 0 indexing.py:80(_get_label)
137076 2.427 0 25.337 0 generic.py:1643(xs)
137076 0.5 0 15.583 0 arrayprint.py:529(__init__)
137076 6.769 0 15.083 0 arrayprint.py:543(fillFormat)
274165 2.157 0 14.33 0 series.py:2787(_sanitize_array)
91384 0.34 0 14.327 0 internals.py:2916(setitem)
91385 1.338 0 13.994 0 internals.py:2811(apply)
91384 1.145 0 13.702 0 ops.py:603(wrapper)

The clues above suggest the same issues with growing pandas dataframes. Let's see what the line-by-line pprofile suggests:

0.00% # calculate relevance index from wordcount index
0.00% def calculate_relevance_index(content_directory):
0.00% # load wordcount index
0.00% wordcount_index_file = content_directory + "index.wordcount"
0.00% hd5_store = pandas.HDFStore(wordcount_index_file, mode='r')
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00% wordcount_index = hd5_store['corpus_index']
0.10% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:416 __getitem__
0.00% hd5_store.close()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00%
0.00% # following is a workaround for a pandas bug
0.00% wordcount_index.index = wordcount_index.index.astype(str)
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:2674 __setattr__
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:757 astype
0.00%
0.00% # word frequency (per document) from wordcount, aka TF
0.00% frequency_index = wordcount_index/wordcount_index.sum()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:5296 stat_func
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:1057 f
0.00%
0.00% # penalise short word length
0.02% for word in frequency_index.index.values:
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:440 values
0.26% frequency_index.loc[word] = frequency_index.loc[word] * numpy.tanh(len(word)/5.0)
10.93% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
5.26% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:603 wrapper
8.52% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.10% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
0.02% pass
0.00%
0.00% # inverse document frequency
0.02% for word in frequency_index.index.values:
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:440 values
0.11% documents = frequency_index.loc[word]
0.05% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
9.34% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.15% matching_documents = documents[documents > 0]
27.69% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/series.py:580 __getitem__
9.11% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:727 wrapper
0.09% inverse_document_frequency = 1.0 - (len(matching_documents) / len(documents))
0.35% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/series.py:443 __len__
0.00% # print("word =", word, " idf = ", inverse_document_frequency)
0.27% frequency_index.loc[word] = frequency_index.loc[word] * inverse_document_frequency
5.83% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:603 wrapper
12.14% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.11% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
9.45% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.02% pass
0.00%
0.00% # save relevance index
0.00% relevance_index_file = content_directory + "index.relevance"
0.00% print("saving corpus relevance index ... ", relevance_index_file)
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
0.00% hd5_store = pandas.HDFStore(relevance_index_file, mode='w')
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00% hd5_store['corpus_index'] = frequency_index
0.05% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
0.00% hd5_store.close()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00% pass

This gives us some insight into the expensive bits of python .. and again it's the pandas operations. Something as innocent as selecting those parts a dataframe which is >0 seems to take about 37% of the time in this relevance calculating function!

There's more horror ... multiplying a row by a constant (which should be super fast) takes about 27% of the time. Those are the operation's we'd expect to be fast.

Bonkers!

But do we want to do anything about it here ... Relevance calculations here take 75 seconds compared to the bigger indexing 780 .. so right now we won't spend any effort on optimising this.

Co-occurrence Calculations

By now we've seen most of the evidence pointing to issues with growing pandas frames dynamically. Let's see if that's the main factor in co-occurrence calculations... we may find a completely different bottleneck.

As usual .. let's time it first ...

CPU times: user 1min 48s, sys: 2.04 s, total: 1min 50s
Wall time: 2min 1s

That's using only the top 100 relevant words for co-occurrence .. otherwise it would take aaages!

Let's see what cprofile points us to. Here's an output snippet of the output ordered by the cumtime:

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.093 0.093 161.219 161.219 <ipython-input-18-62f4de2c7dbf>:4(do_processing)
1 0 0 161.219 161.219 <string>:1(<module>)
1 0 0 161.219 161.219 {built-in
58 2.662 0.046 153.919 2.654 cooccurrence.py:41(create_cooccurrence_matrix_for_document)
255165 1.474 0 72.565 0 indexing.py:126(__setitem__)
510330 5.616 0 63.487 0 internals.py:2811(apply)
258801/255165 7.271 0 57.243 0 indexing.py:224(_setitem_with_indexer)
255165 1.25 0 50.183 0 frame.py:2756(fillna)
255165 3.256 0 48.933 0 generic.py:3096(fillna)
255165 0.817 0 36.011 0 internals.py:2931(fillna)
255165 0.702 0 28.995 0 internals.py:2916(setitem)
514024 2.141 0 28.372 0 internals.py:2578(__init__)
257495/255165 1.293 0 26.723 0 indexing.py:65(__getitem__)
255165 0.839 0 22.868 0 frame.py:1817(get_value)
517602 8.075 0 21.28 0 internals.py:2674(_rebuild_blknos_and_blklocs)
255165 1.206 0 19.891 0 generic.py:1345(_get_item_cache)
255165 1.469 0 17.511 0 internals.py:331(fillna)
255165 0.926 0 12.972 0 indexing.py:101(_get_setitem_indexer)
255165 1.074 0 10.968 0 indexing.py:163(_convert_tuple)
255165 2.689 0 9.782 0 internals.py:628(setitem)
510330 1.61 0 9.764 0 indexing.py:1102(_convert_to_indexer)

Yup - it's those pesky pandas functions again.

Now it's pprofile's turn. The output confirms most of the time is actually spent in pandas library functions. The following shows which of our own code is the cause:

0.00% # create word cooccurrence matrix just for one document, updated to extend beyond immediate neighbour
0.00% def create_cooccurrence_matrix_for_document(content_directory, document_name, doc_words_list, window):
0.00%
0.00% # start with empty matrix
0.00% cooccurrence_matrix = pandas.DataFrame()
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:210 __init__
0.00%
0.00% # work along window
0.00% for ci in range(1, window + 1):
0.00% # first create word-pair list
0.00% word_pair_list = zip(doc_words_list[:-ci], doc_words_list[ci:])
0.00%
0.00% # counts for each pair
0.00% word_pair_ctr = collections.Counter(word_pair_list)
0.02% # /Users/tariq/anaconda3/lib/python3.5/collections/__init__.py:512 __init__
0.00%
0.08% for wp, c in word_pair_ctr.items():
0.11% neighbour_factor = math.exp(- math.pow(ci / window,2))
0.00% # this try-exceptis ugly, needed because pandas doesn't yet have df[wp] += ...
0.06% try:
0.52% cooccurrence_matrix.ix[wp] += (c * neighbour_factor)
0.19% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
18.90% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:65 __getitem__
41.38% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.00% except KeyError:
0.01% cooccurrence_matrix.ix[wp] = (c * neighbour_factor)
3.32% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
0.00% pass
0.00% # replaces any created NaNs with zeros
0.26% cooccurrence_matrix.fillna(0, inplace=True)
28.65% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756 fillna
0.00% pass
0.00%
0.00% pass
0.00%
0.00% # finally save matrix
0.00% cooccurrence_matrix_file = content_directory + document_name + "_matrix.cooccurrence"
0.00% hd5_store = pandas.HDFStore(cooccurrence_matrix_file, mode='w')
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00% hd5_store['doc_matrix'] = cooccurrence_matrix
0.17% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
0.00% hd5_store.close()
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00% pass

You can see the main culprits:

incrementing the co-occurrence matrix contents "in place" .. that seems to take a whopping 60% of the time!
the operation to replace nan's with zero, fillna(), is also expensive .. taking about 29% of the time .. it is puzzling!
even creating a new entry in a pandas dataframe costs is 3% of the time ..

Conclusion

Pandas is slow for cell-level operations .. terribly slow. It's sweet spot is whole-array or column operations.

Pandas -> Dicts / Lists

The next post will report on how well a re-implementation of the text processing functions using lists and dictionaries for item-level operations works.

Intuitive Text Mining

Monday 24 October 2016

Fixed - Faster Indexing with Pandas

Biggest Indexing Bottleneck

Pandas v Numpy? ... Clarity Wins!

New Code

Performance Improvement

Wednesday 19 October 2016

Pandas vs Numpy Performance

Indexing and Merging

cProfile Merging Indices with Numpy

Sunday 16 October 2016

Profiling Indexing, Relevance and Co-occurrence Part II