9 Predictive modelling and machine learning

In predictive modelling, we fit statistical models that use historical data to make predictions about future (or unknown) outcomes. This practice is a cornerstone of modern statistics, and includes methods ranging from classical parametric linear regression to black-box machine learning models.

After reading this chapter, you will be able to use R to:

Fit predictive models for regression and classification,
Evaluate predictive models,
Use cross-validation and the bootstrap for out-of-sample evaluations,
Handle imbalanced classes in classification problems,
Fit regularised (and possibly also generalised) linear models, e.g. using the lasso,
Fit a number of machine learning models, including kNN, decision trees, random forests, and boosted trees.
Make forecasts based on time series data.

9.1 Evaluating predictive models

In many ways, modern predictive modelling differs from the more traditional inference problems that we studied in the previous chapter. The goal of predictive modelling is (usually) not to test whether some variable affects another or to study causal relationships. Instead, our only goal is to make good predictions. It is little surprise then that the tools we use to evaluate predictive models differ from those used to evaluate models used for other purposes, like hypothesis testing. In this section, we will have a look at how to evaluate predictive models.

The terminology used in predictive modelling differs a little from that used in traditional statistics. For instance, explanatory variables are often called features or predictors, and predictive modelling is often referred to as supervised learning. We will stick with the terms used in Section 7, to keep the terminology consistent within the book.

Predictive models can be divided into two categories:

Regression, where we want to make predictions for a numeric variable,
Classification, where we want to make predictions for a categorical variable.

There are many similarities between these two, but we need to use different measures when evaluating their predictive performance. Let’s start with models for numeric predictions, i.e. regression models.

9.1.1 Evaluating regression models

Let’s return to the mtcars data that we studied in Section 8.1. There, we fitted a linear model to explain the fuel consumption of cars:

m <- lm(mpg ~ ., data = mtcars)

(Recall that the formula mpg ~ . means that all variables in the dataset, except mpg, are used as explanatory variables in the model.)

A number of measures of how well the model fits the data have been proposed. Without going into details (it will soon be apparent why), we can mention examples like the coefficient of determination \(R^2\), and information criteria like \(AIC\) and \(BIC\). All of these are straightforward to compute for our model:

summary(m)$r.squared     # R^2
summary(m)$adj.r.squared # Adjusted R^2
AIC(m)                   # AIC
BIC(m)                   # BIC

\(R^2\) is a popular tool for assessing model fit, with values close to 1 indicating a good fit and values close to 0 indicating a poor fit (i.e. that most of the variation in the data isn’t accounted for).

It is nice if our model fits the data well, but what really matters in predictive modelling is how close the predictions from the model are to the truth. We therefore need ways to measure the distance between predicted values and observed values - ways to measure the size of the average prediction error. A common measure is the root-mean-square error (RMSE). Given \(n\) observations \(y_1,y_2,\ldots,y_n\) for which our model makes the predictions \(\hat{y}_1,\ldots,\hat{y}_n\), this is defined as \[RMSE = \sqrt{\frac{\sum_{i=1}^n(\hat{y}_i-y_i)^2}{n}},\] that is, as the named implies, the square root of the mean of the squared errors \((\hat{y}_i-y_i)^2\).

Another common measure is the mean absolute error (MAE):

\[MAE = \frac{\sum_{i=1}^n|\hat{y}_i-y_i|}{n}.\]

Let’s compare the predicted values \(\hat{y}_i\) to the observed values \(y_i\) for our mtcars model m:

rmse <- sqrt(mean((predict(m) - mtcars$mpg)^2))
mae <- mean(abs(predict(m) - mtcars$mpg))
rmse; mae

There is a problem with this computation, and it is a big one. What we just computed was the difference between predicted values and observed values for the sample that was used to fit the model. This doesn’t necessarily tell us anything about how well the model will fare when used to make predictions about new observations. It is, for instance, entirely possible that our model has overfitted to the sample, and essentially has learned the examples therein by heart, ignoring the general patterns that we were trying to model. This would lead to a small \(RMSE\) and \(MAE\), and a high \(R^2\), but would render the model useless for predictive purposes.

All the computations that we’ve just done - \(R^2\), \(AIC\), \(BIC\), \(RMSE\) and \(MAE\) - were examples of in-sample evaluations of our model. There are a number of problems associated with in-sample evaluations, all of which have been known for a long time - see e.g. Picard & Cook (1984). In general, they tend to be overly optimistic and overestimate how well the model will perform for new data. It is about time that we got rid of them for good.

A fundamental principle of predictive modelling is that the model chiefly should be judged on how well it makes predictions for new data. To evaluate its performance, we therefore need to carry out some form of out-of-sample evaluation, i.e. to use the model to make predictions for new data (that weren’t used to fit the model). We can then compare those predictions to the actual observed values for those data, and e.g. compute the \(RMSE\) or \(MAE\) to measure the size of the average prediction error. Out-of-sample evaluations, when done right, are less overoptimistic than in-sample evaluations, and are also better in the sense that they actually measure the right thing.

\[\sim\]

Exercise 9.1 To see that a high \(R^2\) and low p-values say very little about the predictive performance of a model, consider the following dataset with 30 randomly generated observations of four variables:

exdata <- data.frame(x1 = c(0.87, -1.03, 0.02, -0.25, -1.09, 0.74,
          0.09, -1.64, -0.32, -0.33, 1.40, 0.29, -0.71, 1.36, 0.64,
          -0.78, -0.58, 0.67, -0.90, -1.52, -0.11, -0.65, 0.04,
          -0.72, 1.71, -1.58, -1.76, 2.10, 0.81, -0.30),
          x2 = c(1.38, 0.14, 1.46, 0.27, -1.02, -1.94, 0.12, -0.64,
          0.64, -0.39, 0.28, 0.50, -1.29, 0.52, 0.28, 0.23, 0.05,
          3.10, 0.84, -0.66, -1.35, -0.06, -0.66, 0.40, -0.23,
          -0.97, -0.78, 0.38, 0.49, 0.21),
          x3 = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
          1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1),
          y = c(3.47, -0.80, 4.57, 0.16, -1.77, -6.84, 1.28, -0.52,
          1.00, -2.50, -1.99, 1.13, -4.26, 1.16, -0.69, 0.89, -1.01,
          7.56, 2.33, 0.36, -1.11, -0.53, -1.44, -0.43, 0.69, -2.30,
          -3.55, 0.99, -0.50, -1.67))

The true relationship between the variables, used to generate the y variables, is \(y = 2x_1-x_2+x_3\cdot x_2\). Plot the y values in the data against this expected value. Does a linear model seem appropriate?
Fit a linear regression model with x1, x2 and x3 as explanatory variables (without any interactions) using the first 20 observations of the data. Do the p-values and \(R^2\) indicate a good fit?
Make predictions for the remaining 10 observations. Are the predictions accurate?
A common (mal)practice is to remove explanatory variables that aren’t significant from a linear model (see Section 8.1.9 for some comments on this). Remove any variables from the regression model with a p-value above 0.05, and refit the model using the first 20 observations. Do the p-values and \(R^2\) indicate a good fit? Do the predictions for the remaining 10 observations improve?
Finally, fit a model with x1, x2 and x3*x2 as explanatory variables (i.e. a correctly specified model) to the first 20 observations. Do the predictions for the remaining 10 observations improve?

9 Predictive modelling and machine learning

9.1 Evaluating predictive models

9.1.1 Evaluating regression models

9.1.2 Test-training splits

9.1.3 Leave-one-out cross-validation and caret

9.1.4 k-fold cross-validation

9.1.5 Twinned observations

9.1.6 Bootstrapping

9.1.7 Evaluating classification models

9.1.8 Visualising decision boundaries

9.2 Ethical issues in predictive modelling

9.3 Challenges in predictive modelling

9.3.1 Handling class imbalance

9.3.2 Assessing variable importance

9.3.3 Extrapolation

9.3.4 Missing data and imputation

9.3.5 Endless waiting

9.3.6 Overfitting to the test set

9.4 Regularised regression models

9.4.1 Ridge regression

9.4.2 The lasso

9.4.3 Elastic net

9.4.4 Choosing the best model

9.4.5 Regularised mixed models

9.5 Machine learning models

9.5.1 Decision trees

9.5.2 Random forests

9.5.3 Boosted trees

9.5.4 Model trees

9.5.5 Discriminant analysis

9.5.6 Support vector machines

9.5.7 Nearest neighbours classifiers

9.6 Forecasting time series

9.6.1 Decomposition

9.6.2 Forecasting using ARIMA models

9.7 Deploying models

9.7.1 Creating APIs with plumber

9.7.2 Different types of output

9.1.3 Leave-one-out cross-validation and `caret`

9.7.1 Creating APIs with `plumber`