Wednesday, February 20, 2013

Anatomy of a coding error

A few days ago, one of the students who I collaborate with found a very serious mistake in some code that I had written.  The code (which is openly available through my github repo) performed a classification analysis using the data from a number of studies from the openfmri project, and the results are included in a paper that is currently under review.  None of us likes to admit mistakes, but it's clear that they happen often, and the only way to learn from them is to talk about them. This is why I strongly encourage my students to tell me about their mistakes and discuss them in our lab meeting.  This particular mistake highlights several important points:
  1. Sharing code is good, but only if someone else actually looks at it very closely.
  2. You can't rely on tools to fail when you make a mistake.
  3. Classifiers are very good at finding information, even if it's not the information you had in mind.

The code in question is which reads in the processed data (saved in a numpy file) and classifies each dataset (with about 184K features and 400 observations) into one of 23 different classes (representing different tasks). The code was made publicly available before submitting the paper; while I have no way of knowing whether the reviewers have examined it, it's fair to say that even if they did, they would most likely not have caught this particular bug unless they were very eagle-eyed.  As it happens, a student here was trying to reproduce my analyses independently, and was finding much lower classification accuracies than the ones I had reported.  As he dug into my code, it became clear that this difference was driven by a (lazy, in hindsight) coding mistake on my part.

The original code can be viewed here - the snippet in question (cleaned up a bit) is:


if trainsvm:
    for train,test in skf:

Pretty simple - it creates a crossvalidation object using sklearn, then loops through, fitting to the train folds and computing the predicted class for the test fold.  Running this, I got about 93% test accuracy on the multiclass problem; had I gotten 100% accuracy I would have been sure that there was a problem, but given that we have previously gotten around 80% for similar problems, I was not terribly shocked by the high accuracy. Here is the problem:

In [9]: data.shape
Out[9]: (182609, 400)

When I put the data into the numpy object, I had voxels as the first dimension, whereas for classification analysis one would usually put the observations in rows rather than columns.  Now, numpy is smart enough that when I give it the train list as an array index, it uses it as an index on the first dimension.  However, because of the transposition of the dimensions in the data, the effect was to classify voxels, rather than subjects:

In [10]: data[train].shape
Out[10]: (350, 400)

In [11]: data[test].shape
Out[11]: (50, 400)

When I fix this by using the proper data reference (as in the current revision of the code on the repo), then it looks as it should (i.e. all voxels included for the subjects in the train or test folds):

In [12]: data[:,train].T.shape
Out[12]: (350, 182609)

In [14]: data[:,test].T.shape
Out[14]: (50, 182609)

When I run this with the fixed code I get about 53% accuracy; still well above chance (remember that it's a 23-class problem), but much less than the 93% we had gotten previously.

It's worth noting that randomization tests with the flawed code showed the expected null distribution; the source of the information being used by the classifier is a bit of a mystery, but likely reflects the fact that the distance of the voxels in the matrix is related to their distance in space in the brain, and the labels were grouped together sequentially in the label file, such that they were correlated with physical distance in the brain and thus provided information that could drive the classification.

This is clearly a worst-case scenario for anyone who codes up their own analyses; the paper has already been submitted and you find an error that greatly changes the results. Fortunately, the exact level of classification accuracy is not central to the paper in question, but it's worrisome nonetheless.

What are the lessons to be learned here?  Most concretely, it's important to check the size of data structures whenever you are slicing arrays.  I was lazy in my coding of the crossvalidation loop, and I should have checked that the size of the dataset being fed into the classifier was what I expected it to be (the difference between 400 and 182609 would be pretty obvious).  It might have added an extra 30 seconds to my initial coding time but would have saved me from a huge headache and hours of time needed to rerun all of the analyses.

Second, sharing code is necessary but not sufficient for finding problems.  Someone could have grabbed my code and gotten exactly the same results that I got; only if they looked at the shape of the sliced arrays would they have noticed a problem.  I am becoming increasingly convinced that if you really want to believe a computational result, the strongest way to do that is to have an independent person try to replicate it without using your shared code.  Failing that, one really wants to have a validation dataset that one can feed into the program where you know exactly what the output should be; randomization of labels is one way of doing this (i.e., where the outcome should be chance) but you also want to do this with real signal as well.  Unfortunately this is not trivial for the kinds of analyses that we do, but perhaps some better data simulators would help make it easier.

Finally, there is a meta-point about talking openly about these kinds of errors. We know that they happen all the time, yet few people ever talk openly about their errors.  I hope that others will take my lead in talking openly about errors they have made so that people can learn from them and be more motivated to spend the extra time to write robust code.