Let's see how promising coding with numpy is. Code in development is at github.
Indexing and Merging
The Italian recipes corpus we created is small, and is processed quickly. The Iraq Inquiry Report corpus is bigger and takes longer.Let's compare the times taken for indexing individual documents in the corpus, and the merging of these into a corpus word count index.
Pandas Numpy
Index 474 2.22
Merge 7.32 23.9
Total 481.32 26.12
Some interesting results:
- The indexing with numpy is 200x faster than pandas!
- Merging is slower than indexing with numpy, about 3x slower.
- Overall, indexing and merging with numpy is about 20x faster than pandas!
Pretty impressive!
cProfile Merging Indices with Numpy
But the apparent anomaly of merging being slower in numpy is worth looking at .. here's what cprofile has to say:ncalls tottime percall cumtime percall filename:lineno(function)
9513/568 0.009 0.000 0.011 0.000 core.py:1282(_recursive_make_descr)
568 0.004 0.000 0.015 0.000 core.py:1303(make_mask_descr)
513 0.000 0.000 0.000 0.000 core.py:1339(getmask)
171 0.000 0.000 0.002 0.000 core.py:1403(getmaskarray)
...
...
59 0.001 0.000 0.001 0.000 group.py:33(__init__)
1 0.000 0.000 0.010 0.010 group.py:52(create_dataset)
1 0.130 0.130 22.307 22.307 index_search.py:130(merge_wordcount_indices_for_corpus2)
1 0.000 0.000 0.000 0.000 ioloop.py:932(add_callback)
2 0.000 0.000 0.000 0.000 iostream.py:228(_is_master_process)
2 0.000 0.000 0.000 0.000 iostream.py:241(_schedule_flush)
...
...
57 0.001 0.000 0.001 0.000 recfunctions.py:315(_fix_defaults)
114 0.031 0.000 0.033 0.000 recfunctions.py:34(recursive_fill_fields)
114 0.001 0.000 0.037 0.000 recfunctions.py:466(drop_fields)
114 0.000 0.000 0.000 0.000 recfunctions.py:506(_drop_descr)
57 0.106 0.002 22.030 0.386 recfunctions.py:823(join_by)
57 0.000 0.000 0.000 0.000 recfunctions.py:906(<listcomp>)
57 0.000 0.000 0.000 0.000 recfunctions.py:907(<listcomp>)
57 0.000 0.000 0.000 0.000 recfunctions.py:937(<listcomp>)
We can see that the join_by() function for merging numpy structured arrays is slow. If you look at the source code, you can see it python code not C, and it looks not that well maintained, and hardly documented.
If we wanted to optimise further .. we could look again at this, but for now .. we can be happy with a 20x performance boost!
No comments:
Post a Comment