Intuitive Text Mining

Sunday, 16 October 2016

Profiling Indexing, Relevance and Co-occurrence Part II

In the last post we applied the Python profiling tools to our textminingtoolkit as has evolved up to this point.

The line-by-line pprofile tool allowed us to identify the string.join() method for filtering out non-alphanumeric characters was slow ... and our replacement using the regex engine shaved off a huge 21% off the indexing time for the large-ish Iraq Report corpus.

That was a fun little exercise ... and now that we've implemented that change .. let's look again at profiling the code.

cProfile Indexing Again

Let's look at the function-level cProfile again ... this time we've sorted the stats by the cumtime column which tells us which functions take the longest .. cumtime includes the time taken by functions called by those functions, which is useful for seeing which of our code is responsible.

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.111 0.111 782.002 782.002 <ipython-input-6-89b4c4dd17f2>:4(do_processing)
38/1 0.001 0 782.002 782.002 {built-in
58 2.187 0.038 780.38 13.455 index_search.py:70(create_wordcount_index_for_document)
267167 4.529 0 775.462 0.003 indexing.py:126(__setitem__)
267167 3.446 0 275.737 0.001 frame.py:2743(reindex_axis)
267167 3.793 0 272.291 0.001 generic.py:2307(reindex_axis)
1068900 5.425 0 191.576 0 base.py:1915(get_loc)
1603408 180.826 0 182.049 0 {method
267167 3.078 0 155.164 0.001 base.py:2295(reindex)
267167 4.515 0 144.563 0.001 base.py:2028(get_indexer)
267167 131.558 0 134.735 0.001 {method
267225 3.967 0 121.646 0 base.py:3011(insert)
267167 6.672 0 106.987 0 generic.py:2320(_reindex_with_indexers)
267167 3.829 0 90.711 0 internals.py:3560(reindex_indexer)
267167 1.621 0 68.623 0 indexing.py:101(_get_setitem_indexer)

The do_processing() function which we've wrapped around the indexing takes 782 seconds .. that's long but also has the overhead of profiling too.

More interesting is that inside do_processing() it is the call to create_wordcount_index_for_document() which takes the 780 seconds of that 782. We're not surprised, it's the function that does the indexing.

But what inside that function is taking most time? Well, cProfile isn't a line-by-line profiler .. so we won't know precisely but the next few lines might give us a clue. They tell us that indexing.py (inside pandas?) takes 775s of that 780 ... doing something called __setitem__ whatever that is.

Similarly we can see reindex_axis inside frame.py and generic.py is taking time too ... we're assuming they're inside pandas.

Doing some googling ... suggests that adding values to a pandas and growing it dynamically is ... sloooowww .... and it is better to create a list first, and once all the values have been added, we then create a pandas data frame from that list. This way, there is no need for pandas to be constantly trying to insert values into a 2-d structure, and for each insert, update the indices, etc ..

Ok.. let's use the line-by-line pprofile tool .. to see if that gives any more clues.

pprofile Indexing Again

The pprofile output is huge again but here's the section with our own code relevant to creating the index.

69| 0| 0| 0| 0.00%|# create word count index just for one document
70| 58| 0.000509501| 8.78449e-06| 0.00%|def create_wordcount_index_for_document(content_directory, document_name, doc_words_list):
71| 0| 0| 0| 0.00%| # start with empty index
72| 58| 0.00220084| 3.79456e-05| 0.00%| wordcount_index = pandas.DataFrame()
(call)| 58| 0.306131| 0.00527811| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:210 __init__
73| 0| 0| 0| 0.00%|
74| 0| 0| 0| 0.00%| # create index
75| 0| 0| 0| 0.00%| # (word, [document_name]) dictionary, there can be many [document_names] in list
76| 58| 0.00183034| 3.15576e-05| 0.00%| words_ctr = collections.Counter(doc_words_list)
(call)| 58| 0.44324| 0.00764207| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/collections/__init__.py:512 __init__
77| 267225| 1.69075| 6.32705e-06| 0.05%| for w, c in words_ctr.items():
78| 0| 0| 0| 0.00%| #print("=== ", w, c)
79| 267167| 10.7117| 4.00938e-05| 0.29%| wordcount_index.ix[w, document_name] = c
(call)| 267167| 3692.81| 0.0138221| 99.16%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
(call)| 267167| 5.31215| 1.98833e-05| 0.14%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
80| 267167| 2.18377| 8.17381e-06| 0.06%| pass
81| 0| 0| 0| 0.00%|
82| 0| 0| 0| 0.00%| # replace NaN wirh zeros
83| 58| 0.00133228| 2.29704e-05| 0.00%| wordcount_index.fillna(0, inplace=True)
(call)| 58| 0.167373| 0.00288574| 0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756 fillna
84| 0| 0| 0| 0.00%|
85| 0| 0| 0| 0.00%| # finally save index
86| 58| 0.000444651| 7.66639e-06| 0.00%| wordcount_index_file = content_directory + document_name + "_index.wordcount"
87| 58| 0.00152659| 2.63206e-05| 0.00%| hd5_store = pandas.HDFStore(wordcount_index_file, mode='w')
(call)| 58| 0.253571| 0.00437191| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
88| 58| 0.00135565| 2.33732e-05| 0.00%| hd5_store['doc_index'] = wordcount_index
(call)| 58| 8.71204| 0.150208| 0.23%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
89| 58| 0.00121045| 2.08698e-05| 0.00%| hd5_store.close()
(call)| 58| 0.221662| 0.00382176| 0.01%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
90| 58| 0.000453472| 7.81849e-06| 0.00%| pass

The bit that seems to take time is the creation of new dataframe cells but growing it by extension .. which we already had clues that this was inefficient ... and that we should really do it with lists first before turning the data into a frame.

In the next post, we'll redo the construction of the index dataframe from lists, as that's what these clues point to.

cProfile Merging Index Again

Here's the cProfile output, ordered by cumtime for the merging of the document indices into a single corpus index.

ncalls tottime percal
l cumtime percall filename:lineno(function)
1 0.003 0.003 6.598 6.598 <string>:1(<module>)
1 0 0 6.598 6.598 {built-in
1 0.082 0.082 6.595 6.595 index_search.py:94(merge_wordcount_indices_for_corpus)
58 0.005 0 5.209 0.09 merge.py:30(merge)
58 0.002 0 5.202 0.09 merge.py:212(get_result)
58 0.001 0 4.119 0.071 merge.py:333(_get_join_info)
58 0.004 0 4.118 0.071 base.py:2414(join)
57 0.007 0 2.391 0.042 base.py:1691(union)
57 2.104 0.037 2.104 0.037 {method
229 0.005 0 1.059 0.005 base.py:2028(get_indexer)
58 0.001 0 1.053 0.018 internals.py:4526(concatenate_block_managers)
58 0.012 0 1.011 0.017 internals.py:4542(<listcomp>)
229 0.993 0.004 0.996 0.004 {method
1711 0.011 0 0.97 0.001 internals.py:4630(concatenate_join_units)
58 0 0 0.908 0.016 pytables.py:416(__getitem__)
58 0 0 0.908 0.016 pytables.py:619(get)

Inside the merge_wordcount_indices_for_corpus() which takes 6.595 seconds .. we can see various pandas own merge.py methods taking most of that time.

That's not surprising .. but do we want to do anything about it? If it is only 6 seconds of the bigger 780 then it may not be worth the effort or additional code complexity. Remember, we value simplicity and clarity.

So let's not right now.

Relevance Calculations

We've not looked at the relevance calculations yet. First let's time them ..

%time tmt.index_search.calculate_relevance_index(cr.content_directory)

CPU times: user 1min 14s, sys: 943 ms, total: 1min 15s
Wall time: 1min 17s

That takes 1m 15s for the Iraq Corpus.

Let's look deeper now with the function-level cProfile.

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 134.64 134.64 <string>:1(<module>)
1 0 0 134.64 134.64 {built-in
1 2.787 2.787 134.635 134.635 index_search.py:129(calculate_relevance_index)
45692 0.64 0 40.787 0.001 series.py:580(__getitem__)
137076 0.597 0 33.479 0 indexing.py:1286(__getitem__)
411241 3.939 0 33.196 0 series.py:120(__init__)
137076 0.944 0 32.719 0 indexing.py:1431(_getitem_axis)
91384 0.582 0 31.138 0 indexing.py:126(__setitem__)
45692 0.556 0 29.27 0.001 base.py:1957(get_value)
45692 0.817 0 27.335 0.001 {method
91384 2.381 0 27.317 0 indexing.py:224(_setitem_with_indexer)
137076/45692 0.215 0 26.517 0.001 numeric.py:1835(array_str)
137076/45692 0.647 0 26.404 0.001 arrayprint.py:340(array2string)
45692 1.115 0 25.903 0.001 arrayprint.py:237(_array2string)
137076 0.395 0 25.767 0 indexing.py:80(_get_label)
137076 2.427 0 25.337 0 generic.py:1643(xs)
137076 0.5 0 15.583 0 arrayprint.py:529(__init__)
137076 6.769 0 15.083 0 arrayprint.py:543(fillFormat)
274165 2.157 0 14.33 0 series.py:2787(_sanitize_array)
91384 0.34 0 14.327 0 internals.py:2916(setitem)
91385 1.338 0 13.994 0 internals.py:2811(apply)
91384 1.145 0 13.702 0 ops.py:603(wrapper)

The clues above suggest the same issues with growing pandas dataframes. Let's see what the line-by-line pprofile suggests:

0.00% # calculate relevance index from wordcount index
0.00% def calculate_relevance_index(content_directory):
0.00% # load wordcount index
0.00% wordcount_index_file = content_directory + "index.wordcount"
0.00% hd5_store = pandas.HDFStore(wordcount_index_file, mode='r')
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00% wordcount_index = hd5_store['corpus_index']
0.10% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:416 __getitem__
0.00% hd5_store.close()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00%
0.00% # following is a workaround for a pandas bug
0.00% wordcount_index.index = wordcount_index.index.astype(str)
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:2674 __setattr__
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:757 astype
0.00%
0.00% # word frequency (per document) from wordcount, aka TF
0.00% frequency_index = wordcount_index/wordcount_index.sum()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:5296 stat_func
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:1057 f
0.00%
0.00% # penalise short word length
0.02% for word in frequency_index.index.values:
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:440 values
0.26% frequency_index.loc[word] = frequency_index.loc[word] * numpy.tanh(len(word)/5.0)
10.93% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
5.26% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:603 wrapper
8.52% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.10% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
0.02% pass
0.00%
0.00% # inverse document frequency
0.02% for word in frequency_index.index.values:
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py:440 values
0.11% documents = frequency_index.loc[word]
0.05% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
9.34% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.15% matching_documents = documents[documents > 0]
27.69% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/series.py:580 __getitem__
9.11% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:727 wrapper
0.09% inverse_document_frequency = 1.0 - (len(matching_documents) / len(documents))
0.35% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/series.py:443 __len__
0.00% # print("word =", word, " idf = ", inverse_document_frequency)
0.27% frequency_index.loc[word] = frequency_index.loc[word] * inverse_document_frequency
5.83% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/ops.py:603 wrapper
12.14% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.11% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
9.45% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:1286 __getitem__
0.02% pass
0.00%
0.00% # save relevance index
0.00% relevance_index_file = content_directory + "index.relevance"
0.00% print("saving corpus relevance index ... ", relevance_index_file)
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
0.00% hd5_store = pandas.HDFStore(relevance_index_file, mode='w')
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00% hd5_store['corpus_index'] = frequency_index
0.05% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
0.00% hd5_store.close()
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00% pass

This gives us some insight into the expensive bits of python .. and again it's the pandas operations. Something as innocent as selecting those parts a dataframe which is >0 seems to take about 37% of the time in this relevance calculating function!

There's more horror ... multiplying a row by a constant (which should be super fast) takes about 27% of the time. Those are the operation's we'd expect to be fast.

Bonkers!

But do we want to do anything about it here ... Relevance calculations here take 75 seconds compared to the bigger indexing 780 .. so right now we won't spend any effort on optimising this.

Co-occurrence Calculations

By now we've seen most of the evidence pointing to issues with growing pandas frames dynamically. Let's see if that's the main factor in co-occurrence calculations... we may find a completely different bottleneck.

As usual .. let's time it first ...

CPU times: user 1min 48s, sys: 2.04 s, total: 1min 50s
Wall time: 2min 1s

That's using only the top 100 relevant words for co-occurrence .. otherwise it would take aaages!

Let's see what cprofile points us to. Here's an output snippet of the output ordered by the cumtime:

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.093 0.093 161.219 161.219 <ipython-input-18-62f4de2c7dbf>:4(do_processing)
1 0 0 161.219 161.219 <string>:1(<module>)
1 0 0 161.219 161.219 {built-in
58 2.662 0.046 153.919 2.654 cooccurrence.py:41(create_cooccurrence_matrix_for_document)
255165 1.474 0 72.565 0 indexing.py:126(__setitem__)
510330 5.616 0 63.487 0 internals.py:2811(apply)
258801/255165 7.271 0 57.243 0 indexing.py:224(_setitem_with_indexer)
255165 1.25 0 50.183 0 frame.py:2756(fillna)
255165 3.256 0 48.933 0 generic.py:3096(fillna)
255165 0.817 0 36.011 0 internals.py:2931(fillna)
255165 0.702 0 28.995 0 internals.py:2916(setitem)
514024 2.141 0 28.372 0 internals.py:2578(__init__)
257495/255165 1.293 0 26.723 0 indexing.py:65(__getitem__)
255165 0.839 0 22.868 0 frame.py:1817(get_value)
517602 8.075 0 21.28 0 internals.py:2674(_rebuild_blknos_and_blklocs)
255165 1.206 0 19.891 0 generic.py:1345(_get_item_cache)
255165 1.469 0 17.511 0 internals.py:331(fillna)
255165 0.926 0 12.972 0 indexing.py:101(_get_setitem_indexer)
255165 1.074 0 10.968 0 indexing.py:163(_convert_tuple)
255165 2.689 0 9.782 0 internals.py:628(setitem)
510330 1.61 0 9.764 0 indexing.py:1102(_convert_to_indexer)

Yup - it's those pesky pandas functions again.

Now it's pprofile's turn. The output confirms most of the time is actually spent in pandas library functions. The following shows which of our own code is the cause:

0.00% # create word cooccurrence matrix just for one document, updated to extend beyond immediate neighbour
0.00% def create_cooccurrence_matrix_for_document(content_directory, document_name, doc_words_list, window):
0.00%
0.00% # start with empty matrix
0.00% cooccurrence_matrix = pandas.DataFrame()
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:210 __init__
0.00%
0.00% # work along window
0.00% for ci in range(1, window + 1):
0.00% # first create word-pair list
0.00% word_pair_list = zip(doc_words_list[:-ci], doc_words_list[ci:])
0.00%
0.00% # counts for each pair
0.00% word_pair_ctr = collections.Counter(word_pair_list)
0.02% # /Users/tariq/anaconda3/lib/python3.5/collections/__init__.py:512 __init__
0.00%
0.08% for wp, c in word_pair_ctr.items():
0.11% neighbour_factor = math.exp(- math.pow(ci / window,2))
0.00% # this try-exceptis ugly, needed because pandas doesn't yet have df[wp] += ...
0.06% try:
0.52% cooccurrence_matrix.ix[wp] += (c * neighbour_factor)
0.19% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
18.90% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:65 __getitem__
41.38% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.00% except KeyError:
0.01% cooccurrence_matrix.ix[wp] = (c * neighbour_factor)
3.32% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:126 __setitem__
0.00% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:1312 _indexer
0.00% pass
0.00% # replaces any created NaNs with zeros
0.26% cooccurrence_matrix.fillna(0, inplace=True)
28.65% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py:2756 fillna
0.00% pass
0.00%
0.00% pass
0.00%
0.00% # finally save matrix
0.00% cooccurrence_matrix_file = content_directory + document_name + "_matrix.cooccurrence"
0.00% hd5_store = pandas.HDFStore(cooccurrence_matrix_file, mode='w')
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:383 __init__
0.00% hd5_store['doc_matrix'] = cooccurrence_matrix
0.17% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:419 __setitem__
0.00% hd5_store.close()
0.01% # /Users/tariq/anaconda3/lib/python3.5/site-packages/pandas/io/pytables.py:578 close
0.00% pass

You can see the main culprits:

incrementing the co-occurrence matrix contents "in place" .. that seems to take a whopping 60% of the time!
the operation to replace nan's with zero, fillna(), is also expensive .. taking about 29% of the time .. it is puzzling!
even creating a new entry in a pandas dataframe costs is 3% of the time ..

Conclusion

Pandas is slow for cell-level operations .. terribly slow. It's sweet spot is whole-array or column operations.

Pandas -> Dicts / Lists

The next post will report on how well a re-implementation of the text processing functions using lists and dictionaries for item-level operations works.

Saturday, 15 October 2016

Profiling Indexing, Relevance and Co-occurrence Part I

In the last post we looked at three tools for profiling python code to find performance hotspots .. code that is slow, or is hit lots of times adding up to a long time.

Let's apply these tools to our toolkit, because we know that somethings are slow ... like indexing, and calculating co-occurrence.

Profiling Indexing

Let's try the function-level cProfile first. We'll wrap the indexing code inside a function do_processing() to make it easier to call with cProfile .. as follows ..

def do_processing():
# for all documents in corpus
for document_name in cr.get_documents():
#print("processing ", document_name)

# get document text
document_text = cr.get_text_by_document(document_name)

# simplify whitespace (remove newlines)
b = tmt.text_processing.simplify_whitespace(document_text)

# only keep alphanumeric characters, removes punctuation
c = tmt.text_processing.keep_alphanumeric(b)

# make lowercase
d = tmt.text_processing.to_lowercase(c)

# split into words list
dl = tmt.text_processing.split_text_into_words(d)

# build n-grams
#gl = tmt.word_processing.build_ngrams_from_words(dl,2)

# remove stop words
#el = tmt.word_processing.remove_stop_words(dl, "./stopwords/minimal-stop.txt")

# update index
tmt.index_search.create_wordcount_index_for_document(cr.content_directory, document_name, dl)
pass
pass

# profile code
cProfile.run('do_processing()')

The output is long .. here's the top of it ..

211209192 function calls (206665033 primitive calls) in 565.070 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
2674860 1.814 0.000 5.263 0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
1 0.109 0.109 565.070 565.070 <ipython-input-11-9069811af5cf>:4(do_processing)
1 0.000 0.000 565.070 565.070 <string>:1(<module>)
534856 0.673 0.000 1.040 0.000 __init__.py:168(iteritems)
58 0.000 0.000 0.000 0.000 __init__.py:224(u_safe)
58 0.001 0.000 0.458 0.008 __init__.py:512(__init__)
58 0.002 0.000 0.458 0.008 __init__.py:581(update)
2088 0.002 0.000 0.029 0.000 _methods.py:34(_prod)
534450 0.500 0.000 5.571 0.000 _methods.py:37(_any)
116 0.000 0.000 0.000 0.000 _weakrefset.py:16(__init__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:20(__enter__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:26(__exit__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:70(__contains__)
58 0.000 0.000 0.001 0.000 abc.py:178(__instancecheck__)
267109 0.683 0.000 0.992 0.000 algorithms.py:807(_get_take_nd_function)
267109 4.719 0.000 28.988 0.000 algorithms.py:840(take_nd)
232 0.001 0.000 0.002 0.000 array.py:113(_getrowsize)
232 0.004 0.000 0.154 0.001 array.py:131(__init__)
232 0.003 0.000 0.060 0.000 array.py:190(_g_create)
...
...

Looking through the output, the rows with large times are associated with built-in functions which we can do little about .. we're not going to rewrite the core python functions or the pandas library. Here are some examples:

58 0.000 0.000 0.001 0.000 generic.py:1904(_update_inplace)
267167 2.743 0.000 193.508 0.001 generic.py:2307(reindex_axis)
267167 4.327 0.000 78.374 0.000 generic.py:2320(_reindex_with_indexers)
267167 0.348 0.000 0.400 0.000 generic.py:2641(__finalize__)
...
58 0.000 0.000 0.032 0.001 group.py:917(_f_close)
58 1.595 0.028 554.948 9.568 index_search.py:70(create_wordcount_index_for_document)
267167 1.235 0.000 47.746 0.000 indexing.py:101(_get_setitem_indexer)
534334 4.355 0.000 43.300 0.000 indexing.py:1102(_convert_to_indexer)
267167 3.294 0.000 551.357 0.002 indexing.py:126(__setitem__)
801501 0.897 0.000 1.291 0.000 indexing.py:128(<genexpr>)
267167 1.671 0.000 45.093 0.000 indexing.py:163(_convert_tuple)
58 0.000 0.000 0.001 0.000 indexing.py:1753(convert_to_index_sliceable)
...
928 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
1068900 1.336 0.000 1.336 0.000 {method 'fill' of 'numpy.ndarray' objects}
2939011 0.946 0.000 0.946 0.000 {method 'get' of 'dict' objects}
267167 90.638 0.000 92.661 0.000 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
1603408 128.358 0.000 129.298 0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
638 0.001 0.000 0.001 0.000 {method 'groups' of '_sre.SRE_Match' objects}
..
232 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
267109 12.336 0.000 12.336 0.000 {pandas.algos.take_2d_axis1_float64_float64}
535204 30.105 0.000 31.133 0.000 {pandas.lib.infer_dtype}

267225 0.081 0.000 0.081 0.000 {pandas.lib.is_bool}

Apart from telling us that our create_wordcount_index_for_document() takes most of the time .. which isn't useful as we know that's the indexing function ... there rest is not very insightful as they're internal functions, and we don't know where they're called from.

So let's try pprofile, the line-by-line profiler. Here's how we invoke it:

profiler = pprofile.Profile()
with profiler:
do_processing()
profiler.dump_stats(filename="pprofile_index_1.txt")

The output is even bigger! But it is organised into sections for each python source code file, ordered by the time spent in each file. Here's a sample:

/Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py
File duration: 231.942s (6.50%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
1| 0| 0| 0| 0.00%|# module apply processing to text (not word lists)
2| 0| 0| 0| 0.00%|
3| 0| 0| 0| 0.00%|# import text string stuff
4| 0| 0| 0| 0.00%|import string
5| 0| 0| 0| 0.00%|
6| 0| 0| 0| 0.00%|
7| 0| 0| 0| 0.00%|# split into words
8| 58| 0.00025773| 4.44363e-06| 0.00%|def split_text_into_words(input_text):
9| 58| 0.190883| 0.00329108| 0.01%| output_words_list = input_text.split(" ")
10| 58| 0.00185609| 3.20015e-05| 0.00%| return output_words_list

The most time (28%) is spend in the pandas/indexes/base.py library. We shouldn't be surprised as we're using pandas to do much of the hard work. More interesting will be which bits of our code, ended up calling that code. Next is 22% in pandas/core/internals.py, then 12% in pandas.core/indexing.py, 11% in pandas/core/generic.py, 9% in pandas/core/common.py ...

... so that's already 82% of the total time in pandas!

We then get to our own code:

File: /Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py
File duration: 231.942s (6.50%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
1| 0| 0| 0| 0.00%|# module apply processing to text (not word lists)
2| 0| 0| 0| 0.00%|
3| 0| 0| 0| 0.00%|# import text string stuff
4| 0| 0| 0| 0.00%|import string
5| 0| 0| 0| 0.00%|
6| 0| 0| 0| 0.00%|
7| 0| 0| 0| 0.00%|# split into words
8| 58| 0.00025773| 4.44363e-06| 0.00%|def split_text_into_words(input_text):
9| 58| 0.190883| 0.00329108| 0.01%| output_words_list = input_text.split(" ")
10| 58| 0.00185609| 3.20015e-05| 0.00%| return output_words_list
11| 0| 0| 0| 0.00%|
12| 0| 0| 0| 0.00%|
13| 0| 0| 0| 0.00%|# make lowercase
14| 58| 0.000458717| 7.90892e-06| 0.00%|def to_lowercase(input_text):
15| 58| 0.0161779| 0.000278929| 0.00%| output_text = input_text.lower()
16| 58| 0.00069499| 1.19826e-05| 0.00%| return output_text
17| 0| 0| 0| 0.00%|
18| 0| 0| 0| 0.00%|
19| 0| 0| 0| 0.00%|# remove whitespace, convert to single space
20| 58| 0.000150442| 2.59383e-06| 0.00%|def simplify_whitespace(input_text):
21| 0| 0| 0| 0.00%| #output_text = input_text.translate(str.maketrans(string.whitespace, ' ' * len(string.whitespace)))
22| 58| 0.446623| 0.0077004| 0.01%| output_text = " ".join(input_text.split())
23| 58| 0.00181532| 3.12986e-05| 0.00%| return output_text
24| 0| 0| 0| 0.00%|
25| 0| 0| 0| 0.00%|
26| 0| 0| 0| 0.00%|# remove punctuation
27| 0| 0| 0| 0.00%|def remove_punctuation(input_text):
28| 0| 0| 0| 0.00%| output_text = input_text.translate(str.maketrans({a:None for a in string.punctuation}))
29| 0| 0| 0| 0.00%| return output_text
30| 0| 0| 0| 0.00%|
31| 0| 0| 0| 0.00%|
32| 0| 0| 0| 0.00%|# keep only alphanumeric characters
33| 58| 0.000307798| 5.30687e-06| 0.00%|def keep_alphanumeric(input_text):
34| 32473800| 231.281| 7.12207e-06| 6.49%| output_text = ''.join(c for c in input_text if c in (" " + string.ascii_letters + string.digits))
(call)| 15895288| 111.163| 6.99348e-06| 3.12%|# /Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py:34 <genexpr>
35| 58| 0.00230646| 3.97666e-05| 0.00%| return output_text

That bit only takes 6.50% of the total time, but inside that we can see the string join command takes a long time .. 6.49% of the total time, which is actually very significant. The other string command is also a bit slow taking 0.01% of the time.

String operations do seem slow!

So that's a good result already .. if we can improve the string join operations, or even better, the function which filters out non-alphanumeric characters ... then we've addressed 6% of the performance problem!

Improving The Slow AlphaNumeric Filter

Before we make the change, we should time how long it takes to run the index creation for each document in the big Iraq Report corpus.

# profile code
%time do_processing()

CPU times: user 6min 30s, sys: 9.24 s, total: 6min 39s

Wall time: 6min 44s

And the merging of the index ...

# profile code
%time tmt.index_search.merge_wordcount_indices_for_corpus(cr.content_directory)

CPU times: user 5.82 s, sys: 662 ms, total: 6.48 s
Wall time: 7.54 s

We'll replace the code which strips out all non-alphanumeric characters with one based on the regular expression engine. The following shows simple benchmarks comparing relative performance:

That change from a string.join() method to a regular expression method sped up this example by over 11 times! .... That is a massive speed up!

Let's see how it helps our indexing. The following shows how long the indexing takes after this change has been implemented.

# profile code
%time do_processing()

CPU times: user 5min 6s, sys: 8.47 s, total: 5min 15s
Wall time: 5min 16s

So we have a speedup from 6m 39s to 5m 15s which is 399 seconds down to 315 seconds ... so we've shaved off 21% of the time!

That's not a bad result for a simple 2-line code change .. one of which is an import statement!

21% reduction in time

...

with just 2 lines of changed code!

That's the power of profiling .. it points you to where the problems actually are.

We're not finished yet ... more in the next post!

Wednesday, 12 October 2016

Profiling Python Performance

We've created a good bit of code already .. both as code in Python notebooks .. as well as code in the textminingtoolkit package of libraries we're developing.

But some of it is slooow:

Creating word-document indices is slooow.
Calculating the relevance matrix is slooow.
Calculating the co-occurrence matrix is also slooow.

Ho can we fix it? Well .. we can't fix it unless we can measure it.

We need to find out which bits of code are slow. There are several ways of doing this... let's look at a few, starting super simple.

Super Simple - Time It!

The simplest approach to measuring the performance of bits of code is to run them and record the time they took with a stopwatch.

Too often however, python code runs so fast, we'd never have a chance to get the timing right with a stopwatch! Stuff happens in microseconds .. not seconds or minutes.

So Python itself, the Jupiter notebook, and also 3rd party libraries provide ways to measure how long bits of code take in a much more accurate manner.

Imagine we have a super simple function, called work() which takes a parameter n:

def work(n):

print("hello")

x = n * n

print(x)

pass

You can see it is super simple, it just says hello and calculates the square of n, before printing it out. Simples!

If we call it, here's what happens:

work(3)

hello

In a python notebook, we can use a simple command to time how long some code takes. This command is only available in the notebook, not python in general .. and commands like this begin with a % symbol, and are sometimes called python magics! Don't ask me why ...

Here's the %time magic:

%time work(4)

hello

CPU times: user 48 µs, sys: 0 ns, total: 48 µs

Wall time: 52.9 µs

That is sooooo much easier than the old fashioned method of calling a python command to get the time, running a function or code, then getting the time again, and finding the difference in times ... painful!

The work() function is not taxing at all .. you can see it took 48 microseconds! That wall time is a bit bigger because that's the stopwatch time .. which includes time your computer was doing other stuff that's nothing to do with running work() .. like updating the pointer on your screen or seeing if a button was pressed .. etc

Now let's rewrite our work() function with a fast bit and a slow bit:

def quick(n):

return n*n

def slow(n):

counter = 0

for i in range(n):

for j in range(n):

counter +=1

pass

return counter

def work(n):

print("hello")

a = quick(n)

b = slow(n)

print(a,b)

pass

You can see that work() now calls two other functions, quick() and slow(), both of which calculate the square of n. quick() does it more efficiently than slow().

Let's try again with the magic timer:

%time work(5000)

hello

25000000 25000000

CPU times: user 1.51 s, sys: 22.2 ms, total: 1.54 s

Wall time: 1.57 s

You can see that this now takes 1.54 seconds ... noticeably slower.

But this doesn't tell us why work() is slow. It doesn't tell us whether the quick() function really is quick .. not does it tell us that the slow() function really is slow.

That's what we need .. to be able to find which bits of our functions are slow .. so we can zoom in on the hotspots and try to improve them.

That's where more sophisticated profilers are the right too for the job.

Python Profiler: cProfile

There are a few options for profiling python code. Python itself has built in profilers.

cProfile is the most commonly used built in profiler. Here's how we'd use it ...

import cProfile

cProfile.run('work(5000)')

The results are:

hello
25000000 25000000
56 function calls in 1.637 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <ipython-input-2-152b7fd5fd62>:1(quick)
1 0.000 0.000 1.637 1.637 <ipython-input-2-152b7fd5fd62>:13(work)
1 1.636 1.636 1.636 1.636 <ipython-input-2-152b7fd5fd62>:4(slow)
1 0.000 0.000 1.637 1.637 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 ioloop.py:932(add_callback)
6 0.000 0.000 0.000 0.000 iostream.py:228(_is_master_process)
6 0.000 0.000 0.000 0.000 iostream.py:241(_schedule_flush)
6 0.000 0.000 0.000 0.000 iostream.py:309(write)
1 0.000 0.000 0.000 0.000 posix.py:53(wake)
2 0.000 0.000 0.000 0.000 stack_context.py:253(wrap)
2 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
1 0.000 0.000 1.637 1.637 {built-in method builtins.exec}
2 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}
6 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
2 0.000 0.000 0.000 0.000 {built-in method builtins.print}
6 0.000 0.000 0.000 0.000 {built-in method posix.getpid}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'write' of '_io.FileIO' objects}
6 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}

Whoaahh! What's all that!?

Don't worry .. it' just very detailed listing of what the code is doing and for how long. Let's break it apart..

56 function calls were made in total .. that includes functions that are called inside python by our own code
total time was 1.637 seconds

Now the interesting bit ...

work() took 1.637 seconds .. we know that's the total time .. makes sense because work() is the main function we're calling
slow() takes 1.636 seconds ... which is almost all the time the whole work() function took
quick() is so quick it doesn't even register on the stopwatch!

By the way, tottime is the time in a function excluding any functions called from there .. and cumtime is the time in a function including any functions called from there. ... that's why work() has a high cumtime and a low tottime as all it does is call 2 other functions.

That's what we wanted - to see which bits of code inside a bigger bunch of code was the slow hotspot.

Python Profiler: pprofile

As great as cProfile is .. it only profiles called functions ... which in some cases is not granular enough. Sometimes we want to know how each line of python performs. And sometimes, not everything that is a performance hotspot is in fact a function.

So let's look at pprofile which aims to do profiling per line of python.

To use it, we import it and set up a profiler, and call the function we're interested in:

import pprofile

profiler = pprofile.Profile()
with profiler:
work(5000)
profiler.dump_stats(filename="abc.txt")

Normally we'd only need to print the output statistics with a profiler.print_stats() but right now there is a bug which means it doesn't work properly in a notebook - here's the github issue 19 I raised. The workaround for now is to dump the stats to a file, "abc.txt" here.

The output file we get is huge .. because this profiler also profiles the code called by our own python .. deep into he core python libraries and functions. Luckily the top of the file is most relevant to our own code:

Total duration: 274.336s
File: <ipython-input-2-152b7fd5fd62>
File duration: 274.334s (100.00%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
1| 1| 4.05312e-06| 4.05312e-06| 0.00%|def quick(n):
2| 1| 5.96046e-06| 5.96046e-06| 0.00%| return n*n
3| 0| 0| 0| 0.00%|
(call)| 1| 274.336| 274.336|100.00%|# <ipython-input-2-152b7fd5fd62>:13 work
4| 1| 3.8147e-06| 3.8147e-06| 0.00%|def slow(n):
5| 1| 8.10623e-06| 8.10623e-06| 0.00%| counter = 0
6| 5001| 0.02175| 4.34912e-06| 0.01%| for i in range(n):
7| 25005000| 91.4991| 3.65923e-06| 33.35%| for j in range(n):
8| 25000000| 92.4224| 3.6969e-06| 33.69%| counter +=1
9| 25000000| 90.3722| 3.61489e-06| 32.94%| pass
10| 5000| 0.018151| 3.63021e-06| 0.01%| pass
11| 1| 4.05312e-06| 4.05312e-06| 0.00%| return counter
12| 0| 0| 0| 0.00%|
13| 1| 3.69549e-05| 3.69549e-05| 0.00%|def work(n):
14| 1| 0.000101089| 0.000101089| 0.00%| print("hello")
(call)| 2| 0.00199699| 0.000998497| 0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
15| 1| 2.5034e-05| 2.5034e-05| 0.00%| a = quick(n)
(call)| 1| 1.00136e-05| 1.00136e-05| 0.00%|# <ipython-input-2-152b7fd5fd62>:1 quick
16| 1| 3.38554e-05| 3.38554e-05| 0.00%| b = slow(n)
(call)| 1| 274.334| 274.334|100.00%|# <ipython-input-2-152b7fd5fd62>:4 slow
17| 1| 5.00679e-05| 5.00679e-05| 0.00%| print(a,b)
(call)| 4| 0.000454903| 0.000113726| 0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
18| 1| 5.96046e-06| 5.96046e-06| 0.00%| pass

This is much more insightful ... giving us information line-by-line. Let's break it down:

The total time was 274 seconds .. that's because this method of profiling adds huge overhead. if you're interested, pprofile can also work out performance hotspots through statistical sampling which has less overhead, but is less accurate ... but often accurate enough!
The hits column shows which lines get executed many times .. we can see the lines inside the slow() function being called many many times .. some 5000 times, some 25000000 times! Obviously a hotspot if they're not super-quick.
The time and time per hit column show which lines take most time. We can see those loops consume the most time .. but each individual line doesn't take especially long .. it's just the they're called 5000 or 25000000 times!

Excellent stuff! We've got the tools now to see if we can identify which parts of our textminingtoolkit code are slow .. that's the next post!

Tuesday, 11 October 2016

Co-occurrence .. Refined ... Part 2/2

In the previous post, we improved how we considered to words to be co-occurrent or not.

Instead of a black-n-white yes-or-no .. we now have a spectrum of co-occurrence for words inside a window. This allows related words that do occur frequently together, but may be placed a couple or more words apart, to be recognised as such.

The results looked good .. apart from the problem of boring stop-words which also happen to be very co-occurent.

We've previously developed a way to deal with such boring words by developing a measure of word relevance .. which balances word frequency, commonality across documents, and word length. Such a method won't be perfect, but it'll be better than a brutal application of a word stop-list which can't adapt to each corpus.

Let's see how it works.

Combining Word Relevance & Co-Occurrence

How do we combine these two indictors .. co-occurrence and word relevance? Well, there are a few ways we could do it:

Option 1: Remove low-relevance stop words first, then calculate co-occurrence.

Option 2: Calculate co-occurrence then remove pairs not in most relevant list.

Option 3: Multiply co-occurrence scores with relevance scores to get a new measure.

How do we choose?

Option 2 seems inefficient as we'd end up building word-pairs and calculating co-occurrence for words which we would later discard. For large datasets that would turn into significant waste and time.

Option 1 is the simplest. Option 3 could work, but as always, we prefer to try simpler things first.

Results

We've tweaked the javascript d3 code that creates the network graph plots:

var simulation = d3.forceSimulation(graph.nodes)
    .force("link", d3.forceLink(graph.links).distance(50))
    .force("charge", d3.forceManyBody().strength(-20))
    //.force("radius", d3.forceCollide(15))    .force("center", d3.forceCenter(width / 2.0, height / 2.0));

The length of the links has been increased to make crowded networks clearer. The repulsion between nodes has been reduced to prevent them flying off the edge of the display! We tried but didn't use the enforcement of a collision radius, which effectively ensures nodes don't get too close .. it ended up creating a regular and unnatural pattern.

The following shows the network of nodes connecting co-occuring words (with window 2) which were in the top 100 most relevant for this (small) Italian Recipes dataset. It's quite crowded .. and does still contain some stop-words. If you play with it, you can explore some meaningful connections.

The next plot shows only those nodes with had a weight of more than 3.5 (we tried a few other values too).

It really brings out the key meaningful word pairings ..

grated cheese
tomato sauce
pour over
olive oil
bread crumbs
brown stock

That's a lot of success for now! Great job!

Next Steps - Performance Again

We applied the above ideas to a small dataset ... we really want to apply them to the larger data sets because co-occcurrence with a stretched window doesn't really become useful with small datasets.

However, the performance of indexing and calculating co-occurrence is slow .. so next we need to investigate this and find a fix!

Saturday, 8 October 2016

Co-occurrence .. Refined ... Part 1/2

Much of our recent experiments have been based on the idea of co-occurrence .. two words being next to each other.

We assumed that if two words co-occurred they must be related.

That's a powerful theory. .... but it isn't perfect. Two key problems are:

Interesting words are often sat next to boring words. Apple in .. Slice the ... John and ...
Related words are often not right next to each other, but instead, 2 to 3 words apart .. Apple and banana ... ... Hillary spoke to Obama ...

What do we do?

Longer Reach Co-Occurrence

We don't need to throw away our notion of co-occurrence. All we need to do is make it a tiny bit more sophisticated to take into account the above observation that related words are sometimes a short distance apart.

There are loads of ways we can do this .. but as always, let's start simple, and only make things more complex if we need to.

Option 1: One or Two Words Apart
The simplest idea is to extend the reach of co-occurrence from word being right next to each other to having a word between them:

What we're saying here is that if two words are right next to each other, or they have one word between them, then we consider them to be co-occurrent.

That would help with the problem of some boring words, and also interesting related words being once removed.

Why stop at one word between them? ...

Option 2: One, Two, Three, Four ... N Words Apart
Why not have 2 words between co-occurrent words? Or 3? Or 4 .. maybe N words?

That would deal with related words that were further apart... but... that also begins to strain the original theory that related words are close together.

So we need to somehow give more credibility to the idea that words that are closer together are related, and less credibility to the idea that words that are further apart are related.

That's not to say, words that are far apart can't be related .. it just means that, without any other clues, and given a large corpus of text, we're more likely to believe that more closely co-occurrent words are related.

Option 3: Weights N-Words Apart
We can easily translate this idea into a calculation:

Set a window, of length N , which limits how far apart words can be to be considered co-occurrent.
Use a maths function, like exp(-x²) which can give more numerical weight to closer words than more distant words.

That looks like a good way forward ... let's try it ...

Results

The following shows the resulting force-directed graph for our Italian Recipes mini-corpus, using the above extended-reach co-occurrence with a window of 3 (words could be 3 apart).

Here's a plot of the top 500 words by co-occurrence .. click to enlarge ..

It's a bit busy .. but we can see some nodes which are connected to many others .. but sadly they're the boring stop-words.

Here's the same data but only showing a smaller set of the words with higher co-occurrence value .. more than 4.0 here.

Yup - that makes it even clearer that we need to deal with those pesky stop-words.

As a bit of a short-cut we'll use the rather brutal method of removing stop-words we developed earlier, tmt.word_processing.remove_stop_words(). Here's the result:

This is really much better now that we've clear out the stop words. We can see words that should be related, for example:

bread -- crumbs
salt -- pepper
tomato -- sauce, tomato -- paste, white -- sauce
cut -- pieces
cold -- water
olive -- oil
egg-yolk

That validates our idea .. and gives us he clearest insights yet ... we just need to deal with those stop-words!

Sunday, 16 October 2016

Profiling Indexing, Relevance and Co-occurrence Part II

cProfile Indexing Again

pprofile Indexing Again

cProfile Merging Index Again

Relevance Calculations

Co-occurrence Calculations

Conclusion

Pandas -> Dicts / Lists

Saturday, 15 October 2016

Profiling Indexing, Relevance and Co-occurrence Part I

Profiling Indexing

Improving The Slow AlphaNumeric Filter

Wednesday, 12 October 2016

Profiling Python Performance

Super Simple - Time It!

Python Profiler: cProfile

Python Profiler: pprofile

Next Post

Tuesday, 11 October 2016

Co-occurrence .. Refined ... Part 2/2

Combining Word Relevance & Co-Occurrence

Results

Next Steps - Performance Again

Saturday, 8 October 2016

Co-occurrence .. Refined ... Part 1/2

Longer Reach Co-Occurrence

Results