Saturday 10 September 2016

Pandas DataFrame HDFStore Bug

Took me a whole week to narrow down and find this bug!


Some Context

We are using pandas dataframes as indices. The row labels are the word, and the columns are the documents the words occur in. The cell content is the wordcount, or relevance, depending on which index we're working with.

We wanted to save the index by simply pickling it using pandas.to_pickle(). That failed for large dataframes.

So we chose the more mature, and designed for larger datasets, HDF format. That's available officially in pandas, too.

That seemed to work ... until ...


The Bug

Saving dataframes into a HDF store is easy. Let's create a small dataframe representing one of our indices:

import pandas
df1 = pandas.DataFrame()
df1.ix['apple', '001'] = 0.1
df1.ix['banana', '001'] = 0.2
df1.ix['apple', '002'] = 0.3
df1.ix['banana', '002'] = 0.7
df1.ix['nan', '001'] = 0.5

df1
        001  002
apple   0.1  0.3
banana  0.2  0.7
nan     0.5  NaN


So we've created a super simple dataframe. It refers to the words "apple", "banana" and "nan".

Let's save it to a HDF store:

s = pandas.HDFStore('test')
s['index'] = df1
s.close()

That's nice an easy. The HDF file is called test, and the object inside the file is called index .. you can have many objects in a single HDF store, if you wanted to.

Let's exit out of python, and restart it, to make sure we're truly bringing back the dataframe from the HDF file, and not accidentally bringing back from a memory still hanging around in a variable.

Let's now reopen the store:

import pandas
s = pandas.HDFStore('test')
s
<class 'pandas.io.pytables.HDFStore'>
File path: test
/index            frame        (shape->[3,2])



Here we've opened the HDF file called test, and listed what's inside it. You can see it contains an object called index. Let's bring that into python as a dataframe.

df2 = s['index']
s.close()
del s
df2
        001  002
apple   0.1  0.3
banana  0.2  0.7
NaN     0.5  NaN


You can see the problem! The words "apple" and "banana" are fine, but the word "nan" has been turned into a NaN (not a number) .. it should be a string.

That's the bug!

And it leads to all kinds of problems .. like not being able to find the word "nan", and other stuff with not being able to our relevance calculations. Even storing the index gives a warning as Python says the index label isn't orderable, and sometimes python tries to cast NaN to a float.


Workaround

A temporary workaround is to force the index values to be recast as strings, every time you retrieve an index back from a HD5 store:

df2.index = df2.index.astype(str)
df2
        001  002
apple   0.1  0.3
banana  0.2  0.7
nan     0.5  NaN



Fix?

I'm hoping the to_pickle() method is fixed ... not just this error.

This bug has been reported on the github pandas issues tracker.

No comments:

Post a Comment