Intuitive Text Mining: Profiling Indexing, Relevance and Co-occurrence Part I

In the last post we looked at three tools for profiling python code to find performance hotspots .. code that is slow, or is hit lots of times adding up to a long time.

Let's apply these tools to our toolkit, because we know that somethings are slow ... like indexing, and calculating co-occurrence.

Profiling Indexing

Let's try the function-level cProfile first. We'll wrap the indexing code inside a function do_processing() to make it easier to call with cProfile .. as follows ..

def do_processing():
# for all documents in corpus
for document_name in cr.get_documents():
#print("processing ", document_name)

# get document text
document_text = cr.get_text_by_document(document_name)

# simplify whitespace (remove newlines)
b = tmt.text_processing.simplify_whitespace(document_text)

# only keep alphanumeric characters, removes punctuation
c = tmt.text_processing.keep_alphanumeric(b)

# make lowercase
d = tmt.text_processing.to_lowercase(c)

# split into words list
dl = tmt.text_processing.split_text_into_words(d)

# build n-grams
#gl = tmt.word_processing.build_ngrams_from_words(dl,2)

# remove stop words
#el = tmt.word_processing.remove_stop_words(dl, "./stopwords/minimal-stop.txt")

# update index
tmt.index_search.create_wordcount_index_for_document(cr.content_directory, document_name, dl)
pass
pass

# profile code
cProfile.run('do_processing()')

The output is long .. here's the top of it ..

211209192 function calls (206665033 primitive calls) in 565.070 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
2674860 1.814 0.000 5.263 0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
1 0.109 0.109 565.070 565.070 <ipython-input-11-9069811af5cf>:4(do_processing)
1 0.000 0.000 565.070 565.070 <string>:1(<module>)
534856 0.673 0.000 1.040 0.000 __init__.py:168(iteritems)
58 0.000 0.000 0.000 0.000 __init__.py:224(u_safe)
58 0.001 0.000 0.458 0.008 __init__.py:512(__init__)
58 0.002 0.000 0.458 0.008 __init__.py:581(update)
2088 0.002 0.000 0.029 0.000 _methods.py:34(_prod)
534450 0.500 0.000 5.571 0.000 _methods.py:37(_any)
116 0.000 0.000 0.000 0.000 _weakrefset.py:16(__init__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:20(__enter__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:26(__exit__)
116 0.000 0.000 0.000 0.000 _weakrefset.py:70(__contains__)
58 0.000 0.000 0.001 0.000 abc.py:178(__instancecheck__)
267109 0.683 0.000 0.992 0.000 algorithms.py:807(_get_take_nd_function)
267109 4.719 0.000 28.988 0.000 algorithms.py:840(take_nd)
232 0.001 0.000 0.002 0.000 array.py:113(_getrowsize)
232 0.004 0.000 0.154 0.001 array.py:131(__init__)
232 0.003 0.000 0.060 0.000 array.py:190(_g_create)
...
...

Looking through the output, the rows with large times are associated with built-in functions which we can do little about .. we're not going to rewrite the core python functions or the pandas library. Here are some examples:

58 0.000 0.000 0.001 0.000 generic.py:1904(_update_inplace)
267167 2.743 0.000 193.508 0.001 generic.py:2307(reindex_axis)
267167 4.327 0.000 78.374 0.000 generic.py:2320(_reindex_with_indexers)
267167 0.348 0.000 0.400 0.000 generic.py:2641(__finalize__)
...
58 0.000 0.000 0.032 0.001 group.py:917(_f_close)
58 1.595 0.028 554.948 9.568 index_search.py:70(create_wordcount_index_for_document)
267167 1.235 0.000 47.746 0.000 indexing.py:101(_get_setitem_indexer)
534334 4.355 0.000 43.300 0.000 indexing.py:1102(_convert_to_indexer)
267167 3.294 0.000 551.357 0.002 indexing.py:126(__setitem__)
801501 0.897 0.000 1.291 0.000 indexing.py:128(<genexpr>)
267167 1.671 0.000 45.093 0.000 indexing.py:163(_convert_tuple)
58 0.000 0.000 0.001 0.000 indexing.py:1753(convert_to_index_sliceable)
...
928 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
1068900 1.336 0.000 1.336 0.000 {method 'fill' of 'numpy.ndarray' objects}
2939011 0.946 0.000 0.946 0.000 {method 'get' of 'dict' objects}
267167 90.638 0.000 92.661 0.000 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
1603408 128.358 0.000 129.298 0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
638 0.001 0.000 0.001 0.000 {method 'groups' of '_sre.SRE_Match' objects}
..
232 0.000 0.000 0.000 0.000 {method 'write' of '_io.StringIO' objects}
267109 12.336 0.000 12.336 0.000 {pandas.algos.take_2d_axis1_float64_float64}
535204 30.105 0.000 31.133 0.000 {pandas.lib.infer_dtype}

267225 0.081 0.000 0.081 0.000 {pandas.lib.is_bool}

Apart from telling us that our create_wordcount_index_for_document() takes most of the time .. which isn't useful as we know that's the indexing function ... there rest is not very insightful as they're internal functions, and we don't know where they're called from.

So let's try pprofile, the line-by-line profiler. Here's how we invoke it:

profiler = pprofile.Profile()
with profiler:
do_processing()
profiler.dump_stats(filename="pprofile_index_1.txt")

The output is even bigger! But it is organised into sections for each python source code file, ordered by the time spent in each file. Here's a sample:

/Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py
File duration: 231.942s (6.50%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
1| 0| 0| 0| 0.00%|# module apply processing to text (not word lists)
2| 0| 0| 0| 0.00%|
3| 0| 0| 0| 0.00%|# import text string stuff
4| 0| 0| 0| 0.00%|import string
5| 0| 0| 0| 0.00%|
6| 0| 0| 0| 0.00%|
7| 0| 0| 0| 0.00%|# split into words
8| 58| 0.00025773| 4.44363e-06| 0.00%|def split_text_into_words(input_text):
9| 58| 0.190883| 0.00329108| 0.01%| output_words_list = input_text.split(" ")
10| 58| 0.00185609| 3.20015e-05| 0.00%| return output_words_list

The most time (28%) is spend in the pandas/indexes/base.py library. We shouldn't be surprised as we're using pandas to do much of the hard work. More interesting will be which bits of our code, ended up calling that code. Next is 22% in pandas/core/internals.py, then 12% in pandas.core/indexing.py, 11% in pandas/core/generic.py, 9% in pandas/core/common.py ...

... so that's already 82% of the total time in pandas!

We then get to our own code:

File: /Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py
File duration: 231.942s (6.50%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
1| 0| 0| 0| 0.00%|# module apply processing to text (not word lists)
2| 0| 0| 0| 0.00%|
3| 0| 0| 0| 0.00%|# import text string stuff
4| 0| 0| 0| 0.00%|import string
5| 0| 0| 0| 0.00%|
6| 0| 0| 0| 0.00%|
7| 0| 0| 0| 0.00%|# split into words
8| 58| 0.00025773| 4.44363e-06| 0.00%|def split_text_into_words(input_text):
9| 58| 0.190883| 0.00329108| 0.01%| output_words_list = input_text.split(" ")
10| 58| 0.00185609| 3.20015e-05| 0.00%| return output_words_list
11| 0| 0| 0| 0.00%|
12| 0| 0| 0| 0.00%|
13| 0| 0| 0| 0.00%|# make lowercase
14| 58| 0.000458717| 7.90892e-06| 0.00%|def to_lowercase(input_text):
15| 58| 0.0161779| 0.000278929| 0.00%| output_text = input_text.lower()
16| 58| 0.00069499| 1.19826e-05| 0.00%| return output_text
17| 0| 0| 0| 0.00%|
18| 0| 0| 0| 0.00%|
19| 0| 0| 0| 0.00%|# remove whitespace, convert to single space
20| 58| 0.000150442| 2.59383e-06| 0.00%|def simplify_whitespace(input_text):
21| 0| 0| 0| 0.00%| #output_text = input_text.translate(str.maketrans(string.whitespace, ' ' * len(string.whitespace)))
22| 58| 0.446623| 0.0077004| 0.01%| output_text = " ".join(input_text.split())
23| 58| 0.00181532| 3.12986e-05| 0.00%| return output_text
24| 0| 0| 0| 0.00%|
25| 0| 0| 0| 0.00%|
26| 0| 0| 0| 0.00%|# remove punctuation
27| 0| 0| 0| 0.00%|def remove_punctuation(input_text):
28| 0| 0| 0| 0.00%| output_text = input_text.translate(str.maketrans({a:None for a in string.punctuation}))
29| 0| 0| 0| 0.00%| return output_text
30| 0| 0| 0| 0.00%|
31| 0| 0| 0| 0.00%|
32| 0| 0| 0| 0.00%|# keep only alphanumeric characters
33| 58| 0.000307798| 5.30687e-06| 0.00%|def keep_alphanumeric(input_text):
34| 32473800| 231.281| 7.12207e-06| 6.49%| output_text = ''.join(c for c in input_text if c in (" " + string.ascii_letters + string.digits))
(call)| 15895288| 111.163| 6.99348e-06| 3.12%|# /Users/tariq/Desktop/work/python_notebooks/makeyourowntextminingtoolkit/text_mining_toolkit/text_processing.py:34 <genexpr>
35| 58| 0.00230646| 3.97666e-05| 0.00%| return output_text

That bit only takes 6.50% of the total time, but inside that we can see the string join command takes a long time .. 6.49% of the total time, which is actually very significant. The other string command is also a bit slow taking 0.01% of the time.

String operations do seem slow!

So that's a good result already .. if we can improve the string join operations, or even better, the function which filters out non-alphanumeric characters ... then we've addressed 6% of the performance problem!

Improving The Slow AlphaNumeric Filter

Before we make the change, we should time how long it takes to run the index creation for each document in the big Iraq Report corpus.

# profile code
%time do_processing()

CPU times: user 6min 30s, sys: 9.24 s, total: 6min 39s

Wall time: 6min 44s

And the merging of the index ...

# profile code
%time tmt.index_search.merge_wordcount_indices_for_corpus(cr.content_directory)

CPU times: user 5.82 s, sys: 662 ms, total: 6.48 s
Wall time: 7.54 s

We'll replace the code which strips out all non-alphanumeric characters with one based on the regular expression engine. The following shows simple benchmarks comparing relative performance:

That change from a string.join() method to a regular expression method sped up this example by over 11 times! .... That is a massive speed up!

Let's see how it helps our indexing. The following shows how long the indexing takes after this change has been implemented.

# profile code
%time do_processing()

CPU times: user 5min 6s, sys: 8.47 s, total: 5min 15s
Wall time: 5min 16s

So we have a speedup from 6m 39s to 5m 15s which is 399 seconds down to 315 seconds ... so we've shaved off 21% of the time!

That's not a bad result for a simple 2-line code change .. one of which is an import statement!

21% reduction in time

...

with just 2 lines of changed code!

That's the power of profiling .. it points you to where the problems actually are.

We're not finished yet ... more in the next post!

Intuitive Text Mining

Saturday, 15 October 2016

Profiling Indexing, Relevance and Co-occurrence Part I

Profiling Indexing

Improving The Slow AlphaNumeric Filter

No comments:

Post a Comment