tag:blogger.com,1999:blog-8374109812192174281.post8159048865876491344..comments2018-05-21T01:05:19.125-07:00Comments on russpoldrack.org: The perils of leave-one-out crossvalidation for individual difference analysesRuss Poldrackhttp://www.blogger.com/profile/03305657400941743430noreply@blogger.comBlogger14125tag:blogger.com,1999:blog-8374109812192174281.post-75046225940637737592017-04-20T08:55:15.973-07:002017-04-20T08:55:15.973-07:00This is an interesting post! I observed this as w...This is an interesting post! I observed this as well, and puzzled over it, when doing analyses for our 2011 J Neuro paper predicting placebo responses. I arrived at the conclusion that the correlation metric is not a good outcome metric for optimizing models, for two reasons. <br />- Assessing the correlation involves fitting another model on the cross-validated predictive estimates, with one slope parameter and two variance parameters<br />- The null hypothesis is not exactly zero, for reasons you and other commentators have raised. (Yes, I think ithe same fundamental principle applies to cases with unbalanced LOO cross-validation with two groups and LOO with regression, as Niko pointed out).<br /><br />For that reason, we optimized the models minimizing absolute prediction error, and reported error. We've been using permutation tests to check/validate what the null distribution is, and for prediction error (not correlation) these are zero with a valid LOO cross-validation on independent observations. But correlations are intuitive and people generally understand what the values mean (though point taken that they can be somewhat misleading if the null is not zero!), so we reported those as well and continue to do so for ease of communication when we feel that it's an adequate representation of prediction performance.<br /><br />A few other reflections and beliefs, appreciated after reading the Hastie book and some other literature:<br />- All cross-validated estimates are biased towards zero relative to the performance of the full sample, because you're excluding data points<br />- Among cross-val strategies, LOO is minimum bias but max variance, because the training samples are so similar (there is dependency among training samples across folds). The "minimum bias" here is independent of any bias induced by estimating correlations post hoc on cross-validated estimates, which does indeed induce a bias from a different source.<br />- k-fold with fewer folds biases accuracy estimates more strongly towards zero, but reduces variance. So there is a bias-variance tradeoff.<br />- The optimal cross-validation holdout set depends on sample size, and probably other things. With larger samples, you can get away with fewer folds without damaging model development (and thus performance) as much.<br />- Correlations are unbiased if only a single measure is tested, without any fishing.<br />- I generally like the idea of 5-fold or so, but for small samples, LOO seems OK to me, with appropriate checks and caveats -- i.e., observations must be truly independent, permutation test-based checks of the null distribution, and more. These days I figure that unless one tests peformance prospectively in truly independent datasets, trust in how well the model performs is limited anyway. So we've been trying to focus more on that.<br />Tor Wagerhttps://www.blogger.com/profile/01690528654742947755noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-28597604341892567962013-07-27T14:23:54.584-07:002013-07-27T14:23:54.584-07:00Trevor - many thanks for digging into this!Trevor - many thanks for digging into this!Russ Poldrackhttps://www.blogger.com/profile/03305657400941743430noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-37582269370953224902013-07-27T14:16:16.866-07:002013-07-27T14:16:16.866-07:00My student Will Fithian and I have responded to th...My student Will Fithian and I have responded to this interesting phenomenon. Since I could not figure out how to use Mathjax in blogger, I ended up entering the material in tumblr. Here is the url: http://not2hastie.tumblr.com/<br /><br />Sorry for the inconvenience, but please have a look<br /><br />Trevor Hastie<br />Stanford Statisticsnot2Hastiehttps://www.blogger.com/profile/15524949813403202887noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-56047743308785889722013-01-12T03:44:14.399-08:002013-01-12T03:44:14.399-08:00Hi, Russ,
I've also run into a similar issue w...Hi, Russ,<br />I've also run into a similar issue with the LOO procedure, after I was alerted to the potential bias by Lucas Parra. In our case we are using LOO with matched filtering (essentially Fisher's linear discriminant, but without taking the noise covariance into account). We are *not* doing classification (as in LDA), but just projecting onto the (normalized) discriminant vector and then looking at the continuous valued result. In this case the mean is unbiased (as best I can tell after lots of testing), but the distribution is skewed to the left, so the median is always > 0. As a result a sign test or signed rank test is highly significant even for gaussian random data. This is especially pronounced when there are very few dimensions (like < 10), but for about 20 - 100 dimensions the skew becomes negligible (things might be different when you have even more dimensions - I haven't tried). In any event, because the mean seems to remain unbiased (in this case at least - not for correlation as you've found), then analyses across subjects should be OK. Within subject, you would want to do a resampling test (which I guess you should always do anyway). Again, thanks to Lucas Parra for pointing this out to us.Aaron Schurgerhttps://www.blogger.com/profile/13592674149775288836noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-9644210227689230362013-01-12T03:38:55.656-08:002013-01-12T03:38:55.656-08:00This comment has been removed by the author.Cotton Candyhttps://www.blogger.com/profile/13592674149775288836noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-47474088294773990932012-12-20T07:31:51.647-08:002012-12-20T07:31:51.647-08:00Niko - thanks for your comments. The exploration i...Niko - thanks for your comments. The exploration issue is completely separate from what I am talking about here - in fact this problem initially arose for us in the context of whole-brain analyses where we were trying to predict some behavioral measure from whole-brain activation. Some further explorations (based on suggestions by Sanmi Koyejo, mentioned in Yarick's comment and implemented I think in the latest code on the repo, but not really discussed explicitly) show that the problem is due to the intercept term in the linear regression. I was not suggesting that the correlation estimates are biased across separate samples from a population; rather, I was highlighting the negative dependency that you mention, which is seen in a bias in the correlation between predicted and actual outcomes across folds. <br /><br />I'm not sure that "crossvalidated correlation" is the right term for what I computed. I basically perform a linear regression on the training set and then use that regression solution to predict the values of the test set. I then compute the correlation between the predicted and actual values - it's this correlation that I find to be biased, but whose bias appears to largely go away when the intercept is fixed to zero in the regression. You may be right that a Bayesian approach is a better way to address this - if only I had time to give it a try!Russ Poldrackhttps://www.blogger.com/profile/03305657400941743430noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-5045002019699779812012-12-20T07:10:19.578-08:002012-12-20T07:10:19.578-08:00this is thought-provoking. but i’m not sure i unde...this is thought-provoking. but i’m not sure i understand why the sample correlation is misleading in the absence of any exploration.<br /><br />crossvalidation serves to estimate out-of-sample performance of a predictive model. crossvalidation is needed to account for the exploration process, e.g. when searching for a predictive variable (such as a brain location) or fitting parameters.<br /><br />i've seen the bias of leave-one-out crossvalidation in this context. the cross-subject correlation of brain activity with some subject covariate will obviously be highly biased when selecting the peaks of a brain map. the selection is the exploration (or ROI fitting) process and we need some form of crossvalidation. consistent with the discussion here, leaving one subject out is not a good way of getting an estimate of the out of sample correlation.<br /><br />i'm also familiar with the negative accuracy bias incurred by leaving one data point out in classification -- which can be understood as reflecting the fact that the frequencies of different classes are slightly biased against the left-out point, because that point is missing.<br /><br />if you split a set of numbers into two halves, one half is likely to be higher and the other lower than the average. similarly, if you split a set of points from an isotropic bivariate gaussian (i.e. no correlation), one half is likely to have a higher sample correlation than the correlation for the entire sample and the other half a lower one. so splitting creates a negative dependency between the samples.<br /><br />however, i'm puzzled by this post, because you seem to be suggesting that even without any exploration or fitting, the sample correlation is a biased estimate of the correlation expected for another sample of the same size from the same population. <br /><br />correlation can be understood as fitting a line. if the line fitted to the sample were used to predict out of sample, it would do worse in terms of the sum of squared errors. however, if we construe correlation simply as a measure of association, and use the correlation in one sample as an estimate of the correlation in another sample of the same size, it should be a good estimate.<br /><br />imagine i take two samples from a population, measure two variables (height and weight, say). now i choose one of the samples at random, give it to you and ask you to give me your best estimate of the correlation in the other sample. what should you do?<br /><br />the samples are exchangeable, so the correlation in your sample is not expected to be either higher or lower than the correlation in the other sample. this justifies its use as an estimate in my mind.<br /><br />am i missing something?<br /><br />of course, even under the null hypothesis of no correlation we will see deviations from 0 in the sample correlation. so the absolute value of any sample correlation is necessarily biased. a bayesian analysis aiming to estimate the expected or the maximum a posteriori population correlation will lead you to shrink toward zero. perhaps the methods you suggest serve a similar purpose? but i don't understand the full motivation for the crossvalidation approach here – e.g. as opposed to the bayesian approach. (i also don’t understand exactly how you compute the crossvalidated correlation. i couldn't see this quickly in the code link, perhaps you can explain briefly?)<br /><br />from a quick look, the cited paper seemed to use a prior hypothesis about the brain region and an anatomical ROI. so if we assume there was no exploration at all, then i don’t see what’s wrong here.<br /><br />i do expect the correlation estimate to decline in replications. but only because of the file drawer problem, i.e. similar attempts may have been made before, but when they didn’t yield significant results they were forgotten about – that is, because of the exploration process at the level of the whole field followed by selective publication.<br /><br />Niko Kriegeskorte<br /><br /><br /><br /><br />futureofscipubhttp://futureofscipub.wordpress.com/noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-53499299631080330322012-12-18T13:03:42.908-08:002012-12-18T13:03:42.908-08:00Thanks Yarick - what do you think the takeaway is...Thanks Yarick - what do you think the takeaway is? Perhaps that correlation is a the wrong measure to use for assessing predictive accuracy of regression analyses? Russ Poldrackhttps://www.blogger.com/profile/03305657400941743430noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-71354918498673815592012-12-18T12:59:00.917-08:002012-12-18T12:59:00.917-08:00ok -- played with it more and now negative bias (w...ok -- played with it more and now negative bias (when you fit with<br />fitting intercept) makes perfect sense to me if you compute<br />correlation between full original time series and target one,<br />predicted in cross-validated fashion.<br /><br />Reason is actually quite simple: in every cross-validation fold, by<br />taking "testing" set out, mean of the training data changes from the<br />'grand mean' in the opposite direction to the mean of the taken out<br />(for testing) data [e.g. for split-half, grand mean was = 0, mean of<br />testing data 0.1, mean of training data becomes -0.1]<br /><br />By training you are fitting the line to the training data, which is<br />"offset" from the grand mean so it is likely to obtain a line offset<br />in the same direction [in our example it is likely to be a line BELOW<br />"grand" line, and having negative intercept]. And per "construction"<br />it would be in the opposite direction from the grand mean than the<br />left-out testing samples.<br /><br />For another split of the data, you are likely to get offset in the<br />opposite direction [.e.g in our example it would be testing gets -0.1,<br />while training 0.1], and result would be as before -- predicted data<br />has an opposing offset from grand mean than testing data.<br /><br />Therefore if you are computing correlation later on the whole series<br />of predicted/original values (not like I suggested -- mean of<br />estimates in each split) -- you are likely to obtain negative<br />correlation due to the tendency of predicted data being in opposite<br />direction from the original one merely due to difference in the means.<br /><br />Without intercept linear model looses this "flexibility" of choosing<br />"the other side", so it becomes less prominent (but imho it is still<br />present one way (mean) or another (variance)).<br /><br />Really crude example is here (disregard initial scatter plots --<br />absent shuffle for cross-validation somehow plays an interesting<br />role. and I had to disable shuffling for my example below)<br /><br />http://nbviewer.ipython.org/url/www.onerussian.com/tmp/regression_cv_demos_normal_noisy100_nofit_intercept.ipynb<br /><br />It is left to figure out on the strong bimodal distribution of the<br />means. They are somewhat surprising to me since I haven't observed<br />them before when in a searchlight using cross-validation on<br />correlations [e.g. figure 5 in<br />http://onerussian.com/tmp/transfusion-20120411.pdf , disregard the<br />footnotes -- it wasn't submitted]. But probably that was because data<br />was not actually random and did have a strong positive bias ;)yarikoptichttps://www.blogger.com/profile/06648933072290144986noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-86327345658130090802012-12-17T11:20:55.509-08:002012-12-17T11:20:55.509-08:00Agreed, there are many different senses of the ter...Agreed, there are many different senses of the term "predict". My main point here is that correlation (even if one variable precedes the other in time) does not imply predictive accuracy in a statistical sense. I think a lot of people in our field don't appreciate the "out-of-sample" problem (discussed very nicely, BTW, in Nate Silver's book)Russ Poldrackhttps://www.blogger.com/profile/03305657400941743430noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-52083997119353689002012-12-17T09:58:26.131-08:002012-12-17T09:58:26.131-08:00I think they were using "prediction" in ...I think they were using "prediction" in the temporal sense...<br />(i.e., the sampling occurred before the behavior). It would <br />be great though to explicitly specify what one means by "prediction"<br />(for instance, within or out of sample).Brian Knutsonhttps://www.blogger.com/profile/08897736583154567827noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-2069623510635661872012-12-17T06:44:14.829-08:002012-12-17T06:44:14.829-08:00Thanks Yarick - nbviewer seems a bit flaky. I'...Thanks Yarick - nbviewer seems a bit flaky. I've added the ipynb file to the git repo.Russ Poldrackhttps://www.blogger.com/profile/03305657400941743430noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-88513220415846613332012-12-17T06:11:35.277-08:002012-12-17T06:11:35.277-08:00nbviewer link seems to be 404
why not to add the ....nbviewer link seems to be 404<br />why not to add the .ipynb into the git repository?yarikoptichttps://www.blogger.com/profile/06648933072290144986noreply@blogger.comtag:blogger.com,1999:blog-8374109812192174281.post-66006894282741470052012-12-17T02:52:49.280-08:002012-12-17T02:52:49.280-08:00Really interesting post, which I think goes beyond...Really interesting post, which I think goes beyond problems with leave-one out cross-validation and highlights the general pitfalls of relying exclusively on statistical methods to infer predictive power from a set of exploratory data. What is needed is replication with an independent sample. (As geneticists have been requiring for quite a while now). Kevin Mitchellhttps://www.blogger.com/profile/07172255754953214162noreply@blogger.com