Wednesday, 12 October 2016

Profiling Python Performance

We've created a good bit of code already .. both as code in Python notebooks .. as well as code in the textminingtoolkit package of libraries we're developing.

But some of it is slooow:

  • Creating word-document indices is slooow.
  • Calculating the relevance matrix is slooow.
  • Calculating the co-occurrence matrix is also slooow.

Ho can we fix it? Well .. we can't fix it unless we can measure it

We need to find out which bits of code are slow. There are several ways of doing this... let's look at a few, starting super simple.



Super Simple - Time It!

The simplest approach to measuring the performance of bits of code is to run them and record the time they took with a stopwatch.

Too often however, python code runs so fast, we'd never have a chance to get the timing right with a stopwatch! Stuff happens in microseconds .. not seconds or minutes.

So Python itself, the Jupiter notebook, and also 3rd party libraries provide ways to measure how long bits of code take in a much more accurate manner.

Imagine we have a super simple function, called work() which takes a parameter n:

def work(n):
    print("hello")
    x = n * n
    print(x)
    pass

You can see it is super simple, it just says hello and calculates the square of n, before printing it out. Simples!

If we call it, here's what happens:

work(3)
hello
9

In a python notebook, we can use a simple command to time how long some code takes. This command is only available in the notebook, not python in general .. and commands like this begin with a % symbol, and are sometimes called python magics! Don't ask me why ... 

Here's the %time magic:

%time work(4)
hello
16
CPU times: user 48 µs, sys: 0 ns, total: 48 µs
Wall time: 52.9 µs

That is sooooo much easier than the old fashioned method of calling a python command to get the time, running a function or code, then getting the time again, and finding the difference in times ... painful!

The work() function is not taxing at all .. you can see it took 48 microseconds! That wall time is a bit bigger because that's the stopwatch time .. which includes time your computer was doing other stuff that's nothing to do with running work() .. like updating the pointer on your screen or seeing if a button was pressed .. etc 

Now let's rewrite our work() function with a fast bit and a slow bit:

def quick(n):
    return n*n

def slow(n):
    counter = 0
    for i in range(n):
        for j in range(n):
            counter +=1
            pass
        pass
    return counter

def work(n):
    print("hello")
    a = quick(n)
    b = slow(n)
    print(a,b)
    pass

You can see that work() now calls two other functions, quick() and slow(), both of which calculate the square of n. quick() does it more efficiently than slow().

Let's try again with the magic timer:

%time work(5000)
hello
25000000 25000000
CPU times: user 1.51 s, sys: 22.2 ms, total: 1.54 s
Wall time: 1.57 s

You can see that this now takes 1.54 seconds ... noticeably slower.

But this doesn't tell us why work() is slow. It doesn't tell us whether the quick() function really is quick .. not does it tell us that the slow() function really is slow. 

That's what we need .. to be able to find which bits of our functions are slow .. so we can zoom in on the hotspots and try to improve them.

That's where more sophisticated profilers are the right too for the job.



Python Profiler: cProfile

There are a few options for profiling python code. Python itself has built in profilers.

cProfile is the most commonly used built in profiler. Here's how we'd use it ...

import cProfile

cProfile.run('work(5000)')

The results are:

hello
25000000 25000000
         56 function calls in 1.637 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <ipython-input-2-152b7fd5fd62>:1(quick)
        1    0.000    0.000    1.637    1.637 <ipython-input-2-152b7fd5fd62>:13(work)
        1    1.636    1.636    1.636    1.636 <ipython-input-2-152b7fd5fd62>:4(slow)
        1    0.000    0.000    1.637    1.637 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 ioloop.py:932(add_callback)
        6    0.000    0.000    0.000    0.000 iostream.py:228(_is_master_process)
        6    0.000    0.000    0.000    0.000 iostream.py:241(_schedule_flush)
        6    0.000    0.000    0.000    0.000 iostream.py:309(write)
        1    0.000    0.000    0.000    0.000 posix.py:53(wake)
        2    0.000    0.000    0.000    0.000 stack_context.py:253(wrap)
        2    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000    1.637    1.637 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        6    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
        2    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'write' of '_io.FileIO' objects}
        6    0.000    0.000    0.000    0.000 {method 'write' of '_io.StringIO' objects}

Whoaahh! What's all that!?

Don't worry .. it' just very detailed listing of what the code is doing and for how long. Let's break it apart..
  • 56 function calls were made in total .. that includes functions that are called inside python by our own code
  • total time was 1.637 seconds
Now the interesting bit ...
  • work() took 1.637 seconds .. we know that's the total time .. makes sense because work() is the main function we're calling
  • slow() takes 1.636 seconds ... which is almost all the time the whole work() function took
  • quick() is so quick it doesn't even register on the stopwatch!

By the way, tottime is the time in a function excluding any functions called from there .. and cumtime is the time in a function including any functions called from there. ... that's why work() has a high cumtime and a low tottime as all it does is call 2 other functions.

That's what we wanted - to see which bits of code inside a bigger bunch of code was the slow hotspot.



Python Profiler: pprofile

As great as cProfile is .. it only profiles called functions ... which in some cases is not granular enough. Sometimes we want to know how each line of python performs. And sometimes, not everything that is a performance hotspot is in fact a function.

So let's look at pprofile which aims to do profiling per line of python.

To use it, we import it and set up a profiler, and call the function we're interested in:

import pprofile

profiler = pprofile.Profile()
with profiler:
    work(5000)
profiler.dump_stats(filename="abc.txt")

Normally we'd only need to print the output statistics with a profiler.print_stats() but right now there is a bug which means it doesn't work properly in a notebook - here's the github issue 19 I raised. The workaround for now is to dump the stats to a file, "abc.txt" here.

The output file we get is huge .. because this profiler also profiles the code called by our own python .. deep into he core python libraries and functions. Luckily the top of the file is most relevant to our own code:

Total duration: 274.336s
File: <ipython-input-2-152b7fd5fd62>
File duration: 274.334s (100.00%)
Line #|      Hits|         Time| Time per hit|      %|Source code
------+----------+-------------+-------------+-------+-----------
     1|         1|  4.05312e-06|  4.05312e-06|  0.00%|def quick(n):
     2|         1|  5.96046e-06|  5.96046e-06|  0.00%|    return n*n
     3|         0|            0|            0|  0.00%|
(call)|         1|      274.336|      274.336|100.00%|# <ipython-input-2-152b7fd5fd62>:13 work
     4|         1|   3.8147e-06|   3.8147e-06|  0.00%|def slow(n):
     5|         1|  8.10623e-06|  8.10623e-06|  0.00%|    counter = 0
     6|      5001|      0.02175|  4.34912e-06|  0.01%|    for i in range(n):
     7|  25005000|      91.4991|  3.65923e-06| 33.35%|        for j in range(n):
     8|  25000000|      92.4224|   3.6969e-06| 33.69%|            counter +=1
     9|  25000000|      90.3722|  3.61489e-06| 32.94%|            pass
    10|      5000|     0.018151|  3.63021e-06|  0.01%|        pass
    11|         1|  4.05312e-06|  4.05312e-06|  0.00%|    return counter
    12|         0|            0|            0|  0.00%|
    13|         1|  3.69549e-05|  3.69549e-05|  0.00%|def work(n):
    14|         1|  0.000101089|  0.000101089|  0.00%|    print("hello")
(call)|         2|   0.00199699|  0.000998497|  0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
    15|         1|   2.5034e-05|   2.5034e-05|  0.00%|    a = quick(n)
(call)|         1|  1.00136e-05|  1.00136e-05|  0.00%|# <ipython-input-2-152b7fd5fd62>:1 quick
    16|         1|  3.38554e-05|  3.38554e-05|  0.00%|    b = slow(n)
(call)|         1|      274.334|      274.334|100.00%|# <ipython-input-2-152b7fd5fd62>:4 slow
    17|         1|  5.00679e-05|  5.00679e-05|  0.00%|    print(a,b)
(call)|         4|  0.000454903|  0.000113726|  0.00%|# /Users/tariq/anaconda3/lib/python3.5/site-packages/ipykernel/iostream.py:309 write
    18|         1|  5.96046e-06|  5.96046e-06|  0.00%|    pass

This is much more insightful ... giving us information line-by-line. Let's break it down:

  • The total time was 274 seconds .. that's because this method of profiling adds huge overhead. if you're interested, pprofile can also work out performance hotspots through statistical sampling which has less overhead, but is less accurate ... but often accurate enough!
  • The hits column shows which lines get executed many times .. we can see the lines inside the slow() function being called many many times .. some 5000 times, some 25000000 times! Obviously a hotspot if they're not super-quick.
  • The time and time per hit column show which lines take most time. We can see those loops consume the most time .. but each individual line doesn't take especially long .. it's just the they're called 5000 or 25000000 times!




Next Post

Excellent stuff! We've got the tools now to see if we can identify which parts of our textminingtoolkit code are slow .. that's the next post!

No comments:

Post a Comment