Tuesday 15 November 2016

More Performance (Memory) Stuff to Fix

We've done loads of work already identifying and fixing performance issues, which had a previous finale here.

Most of those issues were related to CPU inefficiency ... doing lots of stuff in a way that wasn't the best in terms of actual steps to be done, or instructions to be executed on the processor (CPU).

In trying to index the bigger Clinton emails data set ... we hit a problem .. on MacOS the Python kernel crashes .. and a quick check on Linux seemed to show huge memory use leading to memory swapping ... yuk!

So let's look into this a bit more.



Watching Memory Use

The Clinton email data set has 7945 documents, totalling about 40Mb.

That's not big data but it's not trivially small either .. and can highlight code that isn't efficient.

The indexing of each document, through tmt.index_wordcount.create_wordcount_index_for_document(), is now very fast .. 200x faster than our early code.

We can use the MacOS visual tool called Activity Monitor for monitoring CPU and memory consumption .. here's how it looks while the Clinton emails were being indexed. You can see that the python process is only consuming about 128M .. less than the web browser on this laptop!


A more insightful tool is the command line vm_stat (vmstat on Linux). You can see that there is no swapping happening (the rightmost two columns). What's swapping? When normal memory runs out many computers swap out content from memory to disk/storage and swap it back in when needed. Swapping is slow, and using storage is way slower than normal super fast memory .. we we should avoid it as much as possible.



The merging of those per-document indices into a bigger corpus index is done by growing a data frame in memory using tmt.index_wordcount.merge_wordcount_indices_for_corpus(). We can see the memory consumption of Python grow beyond 300M .. towards 400M ... 600M ... 900M ... and it'll keep growing...

The following snapshot shows python's memory climbing past 1.21G ... and it'll keep climbing!


Here's the vmstat from Linux showing memory being swapped out one merging gets going .. baad!


Here's the Mac OS vm_stat output showing the memory overfilling into swapping ...


And here's the crash! Oh dear.




What Can We Do?

What can do .. that's a good question.

One thing we can't do is to reduce the size of data. Well, we could by stripping out content like stop-words .. but right now we want to work with the full text.

That text needs to contribute to a data frame which is the word-document matrix. So the size of that matrix won't go down.

So we have to think not about the full size of data but the efficiency with which is it created. Let's remind ourselves how merging indices works...
  • start with an empty corpus index pandas dataframe
  • for each document index ...
  • merge() it into the corpus data frame

This hits one of the weak spots of pandas.. the repeated extension of data frames .. rather than filling in a pre-allocated dataframe. Under the hood, merge() creates a new data frame in addition to the two it is merging ... you can see how memory consumption explodes!

Instead off merging heavy pandas dataframes as we work through each document index .. maybe we should defer the pandas dataframe .. and for each document index we instead merge a dictionaries instead? Let's say that again more clearly:

  • start with an empty dictionary
  • for each document index ...
  • ... turn the data frame into a dictionary, and update the empty dictionary
  • finally turn the dictionary into the corpus dataframe


Let's try it ...



Performance Improved Again

Here's the previous performance improvements we achieved:

Pandas Numpy Better Pandas
Index 474 2.22 2.29         
Merge 7.32   23.9 4.34         
Total 481.32 26.12 6.63         

The merging of the Iraq corpus took 7.32 seconds. Our new code takes ... 1.74 second! That's a speed up of over 4x !!

Pandas Numpy Better Pandas
Index 474 2.22 2.29         
Merge 7.32   23.9 1.74         
Total 481.32 26.12 4.03         

Overall .. from our initial code taking 481.32 seconds to our new code taking 4.03 .. we have an overall speedup of 120x !!


This validates out idea about deferring the use of pandas dataframes, avoiding their repeated updating.



Tough Test - Clinton Emails

Let's see how it works with the bigger Clinton emails corpus which crashed as we ran out of memory.

Crunching through the emails causes a warning ...


It took me a while to work out what was causing this ... the division by zero shouldn't happen .. because each vector has a non-zero length (because all the documents contain at least 1 word!) ...

It turned out the problem was an empty document C05762481.txt which was removed from the corpus.

That fixed .. we start to get results..

Actually we don't .. calculations seem to take forever! Remember that to do document similarity calculations we have to compare documents with every other document .. the combinations quickly explode! The following shows how ling it takes to calculate a number of combinations from 10 to 100,000. Remember it only takes 1000 documents to create half a million combinations!


The relationship seems linear .. 1200 seconds to crunch through 100,000 combinations..  at that rate, it would take 5 days to crunch through all the Clinton emails (31,549,596 combinations)

Ok ... even of we tweak the code, the fact that the algorithmic complexity is combinatorial will be difficult to mitigate. That is, shaving off a bit of time through quicker code will be negated by growing the dataset size by only a small amount.

So another approach is needed. We could just random sample the documents to create a smaller subset to work with .. it's not perfect as we might miss some key relationships .. but it's better than no insights. The new create_doc_similarity_matrix(content_directory, sample_fraction) function now takes a sample_fraction which determines how much of the documents to sample .. the fraction is between 0 and 1.

Let's try a small fraction of 0.01 .... hat took 50 seconds.
Let's try a fraction of 0.02 .. that took 169 seconds .... and the results look like:


We have a result! We can see clusters of documents... and if we run the process again so a different subset of documents are analysed for similarity we get ..


We could look at the documents in each cluster to determine what they're about .. but that's the subject of our other work .. automatically working out what the key topics are ...

No comments:

Post a Comment