The core calculation is the cosine distance between two document vectors (in word relevance space).
After we've tried to avoid the usual pandas cell-level problems .. we're not left with many options. One of these few options is compiling the code.
Compiling Code?
Compiling code? What's that? When computers run your programs, they ultimately have to execute instructions on the metal-and-wire CPU. Those instructions are very low-level .. they used to be called assembly instruction but no-one seems to call them that anymore.Back in the old days with slower smaller computers, people used to write some of their code in assembly language .. so that they squeezed every last bit of power out of their computers. The problems is .. assembly language programming is hard and slow and arduous. And prone to error.
These days, computers are fast enough for us to write in nicer, more readable and understandable languages, .. like Python! These languages are nicer for us .. but actually need to be translated into machine instructions so that a CPU can execute them.
Python is interpreted .. that means when you run a program, Python takes the nice easy to understand code and translates it into lower level code (called byte code) .. which itself eventually gets turned into machine instructions run on the metal and wires CPU.
This is fine in the vast majority of cases .. but you can see that it might not be efficient if we need tight code to manipulate an array of numbers ... instead of machine instructions to do just that .. what really happens is the overhead of these translations steps ...
.. and if we're in a loop .. as we often are ... then this translation gets done again and again, repeating the same translation work for no good reason ... because it gets translated to the same thing again and again.
Some languages like C are compiled deliberately before you run the translated code. It's not as fun, or easy, or interactive ... but it does generate efficient code. And because the translation to machine instructions - called compilation - is done once and looks at the whole code .. it can make judgements about optimising the code ... so leaving out code that doesn't get execute .. or moving things outside a loop if they are only calculated once ...
Numba is JIT :)
There are several ways of compiling Python .. but an easier to use way is to do what is called Just In Time (JIT) compilation.That means, Python will recognise functions that you have marked as needing compilation .. and it will compile them the first time they are executed ... and that efficient compiled set of instructions is kept for the many more times that function is called in a program.
Numba is the name of the Python extension that does this ..
Applying Numba
Numba is really easy to use. Often all you have to do is signal a function to be JIT compiled by adding a decorator @jit just above the function declaration .. like ...@jit
def myfunction(x,y):
do something
return z
For our own code we have the following function pulled out:
# create document similarity matrix, the relevance matrix needs to already exist
# sample_fraction is useful to select a subset of documents as the combinations get large quickly
# this version is jitted with numba
@numba.njit((numba.float64[:], numba.float64[:]))
def jit_cosine_distance(x,y):
#z = 1.0 - numpy.dot(x,y) / (numpy.linalg.norm(x) * numpy.linalg.norm(y))
z = numpy.dot(x, y)
xn = numpy.sqrt(numpy.sum(numpy.square(x)))
yn = numpy.sqrt(numpy.sum(numpy.square(y)))
z = 1.0 - (z / (xn * yn))
return z
That's the core calculation of the cosine similarity using a dot product and then normalising. The @njit is enforcing strict rules because a normal @jit tries to be lenient with code that it can't make efficient. This way we either fail compilation .. or generate efficient code.
We call this function as we loop through all the document pair combinations:
# combinations (not permutations) of length 2, also avoid same-same combinations
docs_combinations = itertools.combinations(docs_sample, 2)
for doc1, doc2 in docs_combinations:
# scipy cosine similarity function includes normalising the vectors but is a distance .. so we need to take it from 1.0
doc_similarity_dict[doc2].update({doc1: jit_cosine_distance(relevance_index[doc1].values,relevance_index[doc2].values)})
pass
So .. how does it perform?
Results
The following shows how long it takes to make the document similarity matrix .. with the following variables:- varying document corpus size (fraction)
- previous function using scipy.spatial.distance.cosine()
- new function with calculation separated out and JIT turned on
- new function with calculation separated out and JIT turned off
Oh dear.
It looks like ...
- The previous simple function is the fastest.
- Separating out the calculation into new code is slower.
- Applying JIT makes it slower still!
You might look at that code and say that the new code is more lines of instructions ... dot product, length for each of the two vectors, division, subtraction from 1.0 ... but the reason for that is numba can't JIT code like numpy.linalg.norm or scipy.spatial.distance.cosine.
In fact. .. it turns out the actual code that is executed when you call scipy.spatial.distance.cosine is already very optimised .. so we've only added more layers ...
Try Again
I asked about the performance of this similarity calculation on stackoverflow .. and had great suggestions.One suggestion was that I shouldn't just extract the cosine distance calculation into a JIT-able function but instead I should extract out the loop that works over the relevance matrix. The following shows the bigger chunk that has been extracted out to a JIT-able function:
That does need a bit of rework to the code .. we can't use the itertools anymore but have to find a new way to find the right combinations of documents for the similarity calculation. It's not that hard .. its every document in the list combined with every document beyond it. We also can't pass a pandas data frame to the JIT function as numba can't yet work with pandas data frames .. instead we pass the underlying numpy array .. which is easily accessed like df.values. We also have re-add the columns and index when the JIT functin returns a plain numpy array.
Here's the code for the JIT function:
@numba.njit
def similarity_jit(m):
l = m.shape[1]
s = numpy.zeros((l-1,l-1))
for i in range(l):
for j in range(i+1,l):
s[i,j-1]= numpy.dot(m[:,i], m[:,j])
pass
pass
return s
Not that complicated at all ... and the funny business around s[I,j-1] just places the results into the right place for the returned similarity matrix.
Does this work better?
Results Take 2
The following shows how long a normal python calculation for similarity takes, compared with the new JIT function.That vertical scale is plotted logarithmically (1, 10, 100, 1000, 10000 etc). You can see now that the numba JIT code is much faster than the plain python code. It looks like the suggestion to include the loop in the JIT extract was right!
The following shows the speedup factor:
That looks like a rough speedup of 2.5 time ... not bad at all.
But that 5 days to process the entire Clinton email corpus would now take 2.5 days .. still impractical!
One More Idea
John Zwinck replied to my StackOverflow question with a brilliant idea.He noticed that the core calculation is essentially the dot product of two vectors .. vectors which are columns in the relevance matrix. Each column represents a document .. with values being the relevance scores (or word count if we're being simple).
He suggested that the numpy array passed to the JIT function should be arranged so that picking out columns was more efficient .. because the computer memory holding that column was contiguous (and not broken over different locations).
A numpy array can be transformed into this column-oriented format as follows. The format is the way Fortran commonly stored arrays, as opposed to format='C' for example.
new_array = numpy.array(old_array, format='F')
Let's see the effect of this:
WOW!
The effect of re-ordering the array is dramatic! Look at how badly the plain python gets... 20 minutes to calculate the similarities of just 0.05 of the Clinton emails! And we saw the performance gets exponentially worse ... with 5 days projected to complete the corpus.
The following graph shows the dramatic difference this makes.
Look at the speed up factor .. it keeps climbing from 1.3 to 18.5 .. to show how much better this approach is.. and it'll keep climbing with larger data sets.
This now makes analysing larger datasets practical .. when they weren't before.
On a Google cloud virtual machine (which I could run overnight), calculating the document similarities for the entire Clinton email dataset took only 4 hours, down from a projected 5 days!
In the next post we'll work through the Clinton email data set fully compared for document similarity.
You should also normalize all document vectors beforehand in O(n), and only do the dot product for all combinations.
ReplyDeletegood point - if we di dthe normalisation once and before all this qierying it makes the similairty calculation simpler and quicker.
Delete