Response to “Perils of LOO crossvalidation”

In a blog posted last year Russ Poldrack draws attention to a negative correlation phenomenon when using leave-one-out (LOO) cross-validation. Suppose we fit a linear regression, and compute the LOO predictions $\hat y_{-i}$ at each point $x_i$, for $i=1,\ldots,N$. Poldrack shows that if you plot $y_i$ versus $\hat y_{-i}$, you see negative correlation.

I must confess I was surprised by the negative correlation phenomenon.
However, after closer examination by me and my Stanford Statistics graduate student Will Fithian, it’s not as bad as it looks. We give some observations below. Our presentation here refers to Poldracks blog post.

The overall summary is that its all to do with the mean and hence the intercept in the regression.  Leave out a large $y_i$, and the intercept for the LOO fit moves in the opposite direction.  Vice versa for a small $y_i$. The most dramatic case is that of a 0-variable regression (i.e. the fit is the mean of the $y_i$), where the correlation is -1! However, in all cases this correlation does not bias the LOO estimate of error.

  1. LOO is intended to estimate prediction error.  The estimate is $$err=\frac1N \sum_{i=1}^N (y_i  -\hat{y}_{-i})^2.$$ Here $\hat{y}_{-i}$ refers to the LOO  prediction for observation $i$, using all the data except observation $i$. If the data are i.i.d., then $y_i$ really is independent of $\hat{y}_{-i}$, a function of the other observations, so each term in the sum is nearly unbiased for out-of-sample prediction error; hence so is the average.  The correlations between these $N$ terms is not really relevant, although may impact the variance of the average.  The accompanying figure (right plot) shows the average value of $err$  over $B=500$ simulations from null models. Each symbol refers to a  particular sample size $N$ and the number of variables in the linear  regression.  The response $y$ and the predictor variables are  all iid standard Gaussian. image

    The lines in these plots are the true expected prediction errors for this scenario, so we see that LOO is doing a good job.

  2. The left plot shows the correlations from these simulations - that is, the sample correlation between the $N$ pairs $(y_i,\hat{y}_{-i})$. As in Russ Poldrack’s blog article, there is negative correlation here. Notice that the negative correlation goes away as the number of variables increases, and as $N$ increases.
    This may appear highly counterintuitive since we know that as random variables each  $y_i$ is independent of $\hat y_{-i}$.  However the negative sample correlations arise because $y_i$ is not independent of $\hat y_{-j}$ for $i\neq j$.
  3. We have included the 0-variable (intercept-only) case. This is the LOO estimate for the sample mean’s prediction error.  Here the sample correlation of $y_i$ with $\hat y_{-i}$ is always exactly -1!
    The behavior is simple to explain in this case.  It is easy to show that the average of the $\hat y_{-i}$ is just $\bar y=\frac1N\sum_{i=1}^N y_i$, and ${\hat y_{-i}-\bar{y} = -\frac{1}{N-1}(y_i-\bar{y})}$.  Here $y_i$ is independent of $\hat y_{-i}$, but nevertheless $y_i-\bar y$ and $\hat y_{-i}-\bar y$ are perfectly dependent. In effect, a big $y_i$ pushes up all the other $\hat y_{-j}$ but leaves $\hat y_{-i}$ unchanged.
  4.  We have done some additional analysis of the phenomenon, and can explain the behavior in the left hand plot.  If one looks at the formula for the correlation coefficient, we might try and compute its expectation under the null model (our gaussian variables situation). This is hard to do (ratios of nasty terms), so instead we look at the numerator and denominator terms separately.  It turns out that the numerator has expectation $-\frac1N$. The first term in the denominator is the (square-root of) the sample variance of the $y_i$, which has expectation 1. The second term is the (square-root of) the sample variance of the set of values $\hat{y}_{-i}$. These expressions don’t simplify exactly except in the case of zero variables.  But as variances of regression fits, they increase as the number of predictors increase, and hence cause the correlation to approach zero.

Trevor Hastie and Will Fithian
Stanford Statistics July 22 2013