Wednesday 19 October 2016

Pandas vs Numpy Performance

So we found previously that pandas is slow.

Let's see how promising coding with numpy is. Code in development is at github.



Indexing and Merging

The Italian recipes corpus we created is small, and is processed quickly. The Iraq Inquiry Report corpus is bigger and takes longer.

Let's compare the times taken for indexing individual documents in the corpus, and the merging of these into a corpus word count index.

 Pandas  Numpy  
Index  474  2.22   
Merge  7.32    23.9   
Total  481.32  26.12  

Some interesting results:

  • The indexing with numpy is 200x faster than pandas!
  • Merging is slower than indexing with numpy, about 3x slower.
  • Overall, indexing and merging with numpy is about 20x faster than pandas!


Pretty impressive!




cProfile Merging Indices with Numpy

But the apparent anomaly of merging being slower in numpy is worth looking at .. here's what cprofile has to say:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 9513/568    0.009    0.000    0.011    0.000 core.py:1282(_recursive_make_descr)
      568    0.004    0.000    0.015    0.000 core.py:1303(make_mask_descr)
      513    0.000    0.000    0.000    0.000 core.py:1339(getmask)
      171    0.000    0.000    0.002    0.000 core.py:1403(getmaskarray)
...
...
       59    0.001    0.000    0.001    0.000 group.py:33(__init__)
        1    0.000    0.000    0.010    0.010 group.py:52(create_dataset)
        1    0.130    0.130   22.307   22.307 index_search.py:130(merge_wordcount_indices_for_corpus2)
        1    0.000    0.000    0.000    0.000 ioloop.py:932(add_callback)
        2    0.000    0.000    0.000    0.000 iostream.py:228(_is_master_process)
        2    0.000    0.000    0.000    0.000 iostream.py:241(_schedule_flush)
...
...
       57    0.001    0.000    0.001    0.000 recfunctions.py:315(_fix_defaults)
      114    0.031    0.000    0.033    0.000 recfunctions.py:34(recursive_fill_fields)
      114    0.001    0.000    0.037    0.000 recfunctions.py:466(drop_fields)
      114    0.000    0.000    0.000    0.000 recfunctions.py:506(_drop_descr)
       57    0.106    0.002   22.030    0.386 recfunctions.py:823(join_by)
       57    0.000    0.000    0.000    0.000 recfunctions.py:906(<listcomp>)
       57    0.000    0.000    0.000    0.000 recfunctions.py:907(<listcomp>)
       57    0.000    0.000    0.000    0.000 recfunctions.py:937(<listcomp>)

We can see that the join_by() function for merging numpy structured arrays is slow. If you look at the source code, you can see it python code not C, and it looks not that well maintained, and hardly documented.

If we wanted to optimise further .. we could look again at this, but for now .. we can be happy with a 20x performance boost!

No comments:

Post a Comment