All the numbered exercises are from the course book (ESL).
Exercise set 1
Pen-and-paper
Coding
- Problem 2.8: In the case of classification with linear regression: (i) fit a linear model on the response values 2 and 3, (ii) use a classification rule that assigns class 2 if the predicted response is less than 2.5, and class 3 otherwise. In R, you can use the function \(\texttt{lm()}\) for linear regression and the \(\texttt{knn()}\) function from the R package \(\texttt{class}\).
Exercise set 2
Pen-and-paper
Coding
- We are going to consider a data set consisting of 252 observations of an estimated percentage of body fat along with 13 continuous input variables (age, weight, height and 10 body circumference measurements). You can find the data in edu_bodyfat_both > edu_bodyfat > edu_bodyfat.csv in the unzipped file download from this link. The aim is then to predict the percentage body fat (variable "$\texttt{pcfat}$") based on the input variables using a linear model with subset selection. More specifically, apply best-subset, forward and backward selection and plot the RSS for each method against the number of included predictors and comment on the results. Is there any clear difference between the methods in terms of RSS? Which predictors appear to be the most important?.
Exercise set 3
Pen-and-paper
Coding
- Problem 3.17: The dataset can be downloaded from here (and here are some info about it). Treat the binary 0/1 spam indicator as a continuous outcome. Summarize the analysis by computing the training and test error for each method. An indicator variable for splitting the data into a train and test set can be downloaded here. Use e.g. cross-validation to select the value of any tuning parameter. In R, you can use the package \(\texttt{glmnet}\) for ridge/lasso regression and the package \(\texttt{pls}\) for PCR/PLS regression. There are functions for doing cross-validation in both packages.
-
Download the handwritten ZIP code data from the ELS repository (note that there are separate training and test sets). Use the glmnet package to fit a multinomial logistic regression model on the training set using lasso (\(\alpha = 1\)), ridge regression (\(\alpha = 0 \)) and elastic net with \(\alpha = 0.5\). Use cross-validation to select a value on the penalty parameter \(\lambda\). Compare the prediction error of the different methods using the test set and comment on the results. Is there any particular pair of digits that appear more difficult to distinguish from each other?
Exercise set 4
Pen-and-paper
Coding
- Problem 7.9: Only consider AIC and BIC at this time. In R, you can use the package \(\texttt{leaps}\) for best subset selection, and for computing BIC and the \(C_p\) statistic (which is equivalent to AIC in the considered setting).