Exercise 4
2024-02-05
Cross-validation
(a) Simulate data
# simulate data
set.seed(1)
y <- rnorm(100)
x <- rnorm(100)
y <- x - 2 * x^2 + rnorm(100)
# scatter plot
plot(x, y)
We simulate from the following model:
\[ y_i =\beta_0 + \beta_1*x_i + \beta_2*x_i^2 + \epsilon_i, \quad \epsilon_i \sim N(0, 1) \\ \beta_0 = 0, \beta_1 = 1, \beta_2 = -2, \]
(c)-(d) Run LOOCV
Here, I have written a function that performs k-fold CV for the linear regression case. It poperates on a data frame and use the stats::lm functionality:
# function that computes the CV-error of LS-estimator:
# - k: (integer) number of folds
# - df: (data.frame) training data, includes both outcome and predictors
# - yvar: (character) name of outcome variable in df
# - formula: (formula) formula argument to stats::lm. Defaults to "yvar ~."
cv_using_lm <- function(k, df, yvar, formula = as.formula(paste0(yvar, "~."))) {
# compute fold-size
N <- nrow(df)
# assign the observations randomly k test sets
fold <- sample(rep(seq_len(k), length.out = N))
#table(fold)
# init vector for storing errors
err <- numeric(k)
for (kk in seq_len(k)) {
# indicator for the test-observations in current fold
inFold <- fold == kk
# fit model to current train-set
fit <- lm(formula, data = df[!inFold, ])
# predict current test-set outcomes
yhat <- predict(fit, newdata = df[inFold, ])
# compute loss
y <- df[[yvar]][inFold]
err[kk] <- sum((y-yhat)**2)/length(y)
}
return(err)
}
# for each model, store the outcome and predictors in a data frame
dfs <- list(data.frame(y = y, x))
dfs[[2]] <- cbind(dfs[[1]], x**2)
dfs[[3]] <- cbind(dfs[[2]], x**3)
dfs[[4]] <- cbind(dfs[[3]], x**4)
# estimate prediction error with N-fold CV
n <- length(y)
set.seed(007)
sapply(dfs, function(df) mean(cv_using_lm(n, df, "y")))
# estimate prediction error with N-fold CV
set.seed(007+1)
sapply(dfs, function(df) mean(cv_using_lm(n, df, "y")))
## [1] 5.890979 1.086596 1.102585 1.114772
## [1] 5.890979 1.086596 1.102585 1.114772
There is only one way to split the data into N folds of size 1, therefore the error estimates are exactly the same.
(e) Which model has the lowest error?
The model fitting a second order poly has the lowest error. Since we have simulated the data, we know that this is the true model and expect therefore this model to be a good fit to the test-set in each fold. The training error, on the other hand, will always decrease by including additional regressors.
(f) Compare significance of coefficient estimates
# compute p-vals for the coefficients and store them in a matrix
p_vals <- matrix(nrow = 5, ncol = length(dfs))
for (i in seq_along(dfs)) {
fit <- lm(y~., data = dfs[[i]])
coeffs <- coefficients(fit)
sds <- sqrt(diag(vcov(fit)))
t_vals <- coeffs/sds
p_vals[seq_len(i+1), i] <- 2*(1-pt(abs(t_vals), n-length(coeffs)))
}
rownames(p_vals) <- names(coeffs)
colnames(p_vals) <- paste0("Model", seq_along(dfs))
round(p_vals, 3)
## Model1 Model2 Model3 Model4
## (Intercept) 0.000 0.476 0.465 0.386
## x 0.329 0.000 0.000 0.000
## `x^2` NA 0.000 0.000 0.000
## `x^3` NA NA 0.769 0.948
## `x^4` NA NA NA 0.637
We see that the coefficient on the first and second order term is significant, except from the first model. The intercept is significant only in this first model. The higher order terms is not significant. From the scatterplot in (b) wee see that there is a clear non-linear relationship between y and x. No straight line gives a good fit to these data. Therefore we wrongly conclude in model 1 that the intercept is different from 0 and the slope is not.
Ex 7.9 (Hastie 2009, p.?259)
The exercise reads ˇ°Ex. 7.9 For the prostate data of Chapter 3, carry out a best-subset linear regression analysis, as in Table 3.3 (third column from left). Compute the AIC, BIC, five- and tenfold cross-validation, and bootstrap .632 estimates of prediction error. Discuss the results.ˇ±
Data can be downloaded from book-website: https://hastie.su.domains/ElemStatLearn/: - column 1-8 are predictors - column 9 is the outcome (lpsa) - column 10 divides the data into a test and train set
Import data
# I let the -here- package help me with relative paths
# this command defines the root-directory
here::i_am("exercises/exercises_4_sol.Rmd")
# read data from local file
data <- read.csv(here::here("./data/prostate.txt"),
sep = "\t",
header = TRUE,
stringsAsFactors = TRUE)
data[1] <- NULL # remove first column (row-numbers)
head(data)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
## train
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
As described in ch.?3.2.1, the predictors are scaled to a variance of 1 and the data is divided into test-and-train set.
# construct data.frame including scaled predictors and the outcome
df <- data.frame(lapply(data[, 1:8], function(x) (x-mean(x))/sd(x)),
lpsa = data$lpsa[])
sapply(df, var) # variance of each scaled variable
# store test/train indicator as a variable
dfs <- list(train = df[data$train == TRUE, ],
test = df[data$train == FALSE, ])
## lcavol lweight age lbph svi lcp gleason pgg45
## 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
## lpsa
## 1.332476
Now, we can reproduce the LS-estimates in table 3.3:
# fit linear regression model
fit <- lm(lpsa ~ ., data = dfs$train)
summary(fit)
##
## Call:
## lm(formula = lpsa ~ ., data = dfs$train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64870 -0.34147 -0.05424 0.44941 1.48675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.46493 0.08931 27.598 < 2e-16 ***
## lcavol 0.67953 0.12663 5.366 1.47e-06 ***
## lweight 0.26305 0.09563 2.751 0.00792 **
## age -0.14146 0.10134 -1.396 0.16806
## lbph 0.21015 0.10222 2.056 0.04431 *
## svi 0.30520 0.12360 2.469 0.01651 *
## lcp -0.28849 0.15453 -1.867 0.06697 .
## gleason -0.02131 0.14525 -0.147 0.88389
## pgg45 0.26696 0.15361 1.738 0.08755 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7123 on 58 degrees of freedom
## Multiple R-squared: 0.6944, Adjusted R-squared: 0.6522
## F-statistic: 16.47 on 8 and 58 DF, p-value: 2.042e-12
Best subset selection
We want to find the best subset of predictors to include in a linear regression model, using different criteria (AIC, BIC, CV, .632-bootstrap).
We can re-use the CV-function we defined in problem 1 to apply 5- and 10-fold CV. To apply the .632 boostrap estimator, we need a similar function, for example:
# function that computes the .632boostrap error of the LS-estimator:
# - k: (integer) number of folds
# - df: (data.frame) training data, includes both outcome and predictors
# - yvar: (character) name of outcome variable in df
# - formula: (formula) formula argument to stats::lm. Defaults to "yvar ~."
bootstrap_632_err <- function(B, df, yvar, formula = as.formula(paste0(yvar, "~."))) {
N <- nrow(df)
# compute bootstrap error
bootstrap.err.b <- matrix(NA, nrow = B, ncol = 1)
for (b in 1:B) {
set.seed(b) # set seed, useful for data replication
index <- sample(N, replace = TRUE)
# the selected observations form the training set
temp.train.data <- df[index, ]
# the rest the test set
temp.test.data <- df[-index, ]
# fit model on temporary train set
fit <- lm(as.formula(paste0(yvar, "~.")), data = temp.train.data)
# compute the error
computeError <- function(model, newdata)
mean((newdata[ , yvar] - predict(model, newdata = newdata))^2)
# save the error in the matrix initialized above
bootstrap.err.b[b] <- computeError(fit, temp.test.data)
#if(b%%100 == 0) print(b)
}
# compute training error:
fit <- lm(as.formula(paste0(yvar, "~.")), data = df)
err.train <- computeError(fit, df)
# return 0.632 bootstrap error
return(0.368*err.train + 0.632*mean(bootstrap.err.b))
}
# test function on full training data set
bootstrap_632_err(1000, dfs$train, "lpsa")
## [1] 0.5910481
Now, we can loop through every possible subset of the predictors and estimate the prediction error with the different methods. Note: There are probably more clever way to do this, but this what the first that come to my mind during the unexpected live-coding-session in class.
set.seed(007)
p <- 8
N <- nrow(dfs$train)
err <- list()
# for each subset of size m ...
for (m in seq_len(p)) {
# list all subset of size m of the p variables
all_comb <- combn(p, m)
# initialize a matrix for storing the errors
tmp <- matrix(nrow = 5, ncol = ncol(all_comb))
rownames(tmp) <- c("aic", "bic", "cv5", "cv10", "boostrap632")
err[[m]] <- tmp
# for each size m subset...
for (i in seq_len(ncol(all_comb))){
# construct a data.frame containint the outcome and current predictor variables
vars <- all_comb[, i]
temp_df <- data.frame(lpsa = dfs$train$lpsa, dfs$train[, vars])
# fit the model
fit <- lm(lpsa~., data = temp_df)
d <- m + 2 # number of estimated params: number of coefficients + intercept + residual variance
# evaluate log-lik
logLik <- logLik(fit)
# aic
r <- 1
err[[m]][r, i] <- 2*d/n-2/N*logLik
# bic
r <- r+1
err[[m]][r, i] <- d*log(N) - 2*logLik
# cv 5-fold
r <- r+1
err[[m]][r, i] <- mean(cv_using_lm(5, temp_df, "lpsa"))
# cv 10-fold
r <- r+1
err[[m]][r, i] <- mean(cv_using_lm(10, temp_df, "lpsa"))
# bootstrap .632
# (note that I sat B = 100 to reduce computation time)
r <- r+1
err[[m]][r, i] <- bootstrap_632_err(B = 100, df = temp_df, yvar = "lpsa")
}
}
Now, we want to find which subset is the best according to the different criteria. Here, I do so by collecting the estimates in a data frame in order to use the dplyr package to find the minimum over all subsets.
library(dplyr)
df <- setNames(reshape2::melt(err), c("crit", "comb", "value", "size")) %>%
select(crit, size, comb, value)
head(df)
## crit size comb value
## 1 aic 1 1 2.4893157
## 2 bic 1 1 175.3782322
## 3 cv5 1 1 0.7372446
## 4 cv10 1 1 0.7277851
## 5 boostrap632 1 1 0.7073414
## 6 aic 1 2 2.9920134
Then, I can group the data by each criteria, and find the minimum error.
res <- df %>%
group_by(crit) %>%
arrange(crit) %>%
filter(value == min(value))
head(res)
## # A tibble: 5 ˇÁ 4
## # Groups: crit [5]
## crit size comb value
## <fct> <int> <int> <dbl>
## 1 aic 7 2 2.20
## 2 bic 2 1 167.
## 3 cv5 6 21 0.575
## 4 cv10 7 2 0.628
## 5 boostrap632 5 18 0.587
We see that the different subsets (identified by size and comb variables) are associated with the lowest errors depending on the criteria used. Finally, we can re-fit each of the ˇ°best subsampleˇ±-sets chosen by each criteria on the full training data set.
# fit linear regresion full model
fit <- lm(lpsa ~., dfs$train)
beta <- coefficients(fit)
# init matrix for storing the coefficient estimates of each subset
coeffs <- matrix(nrow = length(beta), ncol = nrow(res) +1)
coeffs[, 1] <- beta
colnames(coeffs) <- c("LS", levels(res$crit)[res$crit])
rownames(coeffs) <- names(beta)
# fit models to each of the chosen subsets
for (i in 1:nrow(res)) {
m <- res$size[i]
comb <- res$comb[i]
# find the variables included in the subset
vars <- combn(p, m)[, comb]
# construct a data frame with this subset of variables
temp_df <- data.frame(lpsa = dfs$train$lpsa, dfs$train[, vars])
# fit the model and collect coefficient estimates
fit <- lm(lpsa~., data = temp_df)
coeffs[c(1, vars+1), i+1] <- coefficients(fit)
}
#
coeffs
## LS aic bic cv5 cv10 boostrap632
## (Intercept) 2.46493292 2.4668675 2.4773573 2.4464512 2.4668675 2.4478730
## lcavol 0.67952814 0.6764486 0.7397137 0.7238923 0.6764486 0.6700038
## lweight 0.26305307 0.2652760 0.3163282 NA 0.2652760 0.3173287
## age -0.14146483 -0.1450300 NA NA -0.1450300 NA
## lbph 0.21014656 0.2095349 NA 0.3025434 0.2095349 NA
## svi 0.30520060 0.3070936 NA 0.3419951 0.3070936 0.2688665
## lcp -0.28849277 -0.2872242 NA -0.2768828 -0.2872242 -0.2926930
## gleason -0.02130504 NA NA -0.1109199 NA NA
## pgg45 0.26695576 0.2522850 NA 0.2703702 0.2522850 0.2277896
Extra: *Compare estimates of the prediction error*
I first misread the exercise, and ignored the ˇ°best-subsetˇ± part (sorry). Where we in the above use the prediction error estimates for model selection (to find the best predictor subset), we could also be interested in estimating the prediction error for a fixed model. In the following I apply the different procedures to estimate the prediction error in the full regression model, and compare these to the test error we obtain on the given test set.
Test and train:
# initialize list for storing the error estimates
err <- list()
# compute error in training set
y <- dfs$train$lpsa # observed training outcomes
yhat <- fitted(fit) # fitted values
err$train <- mean((y-yhat)**2)
# compute error in test set
y <- dfs$test$lpsa
yhat <- predict(fit, newdata = dfs$test)
err$test <- mean((y-yhat)**2)
AIC and BIC:
We estimate the AIC and BIC by computing the log-likelihood of the fitted model:
N <- nrow(dfs$train) # number of training samples
d <- 8 +2 # number of estimated params (incl. intercept and res variance)
# compute log-likelihood of fitted model
logL <- as.numeric(logLik(fit))
cat("logLik:", logL)
# compare with "by-hand"-computation
y <- dfs$train$lpsa # observed training outcomes
yhat <- fitted(fit) # fitted values
s2 <- mean((y-yhat)**2) # MLE of the res. variance (biased estimator)
cat("by-hand:", sum(log(dnorm(y, mean = yhat, sd = sqrt(s2)))))
err$aic <- 2*d/N - 2/N*logL
err$bic <- d*log(N) - 2*logL
## logLik: -70.54079by-hand: -70.54079
Cross-validation:
Use the function from problem 1 to apply CV:
yvar <- "lpsa"
err$cv5 <- mean(cv_using_lm(5, dfs$train, yvar))
err$cv10 <- mean(cv_using_lm(10, dfs$train, yvar))
Bootstrap .632:
The .632 bootstrap estimator estiamtes the prediction error as a
weighted average of the training error and the leave-one-out bootstrap
estimator. Here, I have re-used the code in
r_code_lecture_6.r
to compute the bootstrap error:
B <- 1000
bootstrap.err.b <- matrix(NA, nrow = B, ncol = 1)
for (b in 1:B) {
set.seed(b) # set seed, useful for data replication
index <- sample(N, replace = TRUE)
# the selected observations form the training set
temp.train.data <- dfs$train[index, ]
# the rest the test set
temp.test.data <- dfs$train[-index, ]
fit <- lm(lpsa ~ ., data = temp.train.data)
# compute the error
computeError <- function(model, newdata)
mean((newdata[ , 'lpsa'] - predict(model, newdata = newdata))^2)
# save the error in the matrix initialized above
bootstrap.err.b[b] <- computeError(fit, temp.test.data)
if(b%%100 == 0) print(b)
}
# compute the average error
err$bootstrap <- mean(bootstrap.err.b)
# 0.632 bootstrap
err$bootstrap_632 <- 0.368*err$train + 0.632*err$bootstrap
## [1] 100
## [1] 200
## [1] 300
## [1] 400
## [1] 500
## [1] 600
## [1] 700
## [1] 800
## [1] 900
## [1] 1000
Compare the error estimates:
err
## $train
## [1] 0.4808587
##
## $test
## [1] 0.5098823
##
## $aic
## [1] 2.404203
##
## $bic
## [1] 183.1285
##
## $cv5
## [1] 0.6002474
##
## $cv10
## [1] 0.6697008
##
## $bootstrap
## [1] 0.6794661
##
## $bootstrap_632
## [1] 0.6063786
The different error estimates are quite different. Partly this is due to which error they do actually estimate. Normally, we are interested in obtaining the lowest test error, that is expected prediction error conditional on the particular training data we have (\(Err_\mathcal{T}\) in the lingua of Hastie et al.). Both cross-validation methods and the bootstrap estimates are typically closer to the unconditional error, the expected test error \(Err = Err_\mathcal{T}\). See ch.?7.12 in the ELS book for a discussion. The AIC estimates again estimates another error, namely the in-sample error (\(Err_{in}\)), while the BIC criteria has a Bayesian motivation: finding the most likely model given the data. In other words, direct comparison of these errors is difficult.