11 Predictive modelling and machine learning

In predictive modelling, we fit statistical models that use historical data to make predictions about future (or unknown) outcomes. This practice is a cornerstone of modern statistics and includes methods ranging from classical parametric linear regression to black-box machine learning models.

After reading this chapter, you will be able to use R to:

Fit predictive models for regression and classification,
Evaluate predictive models,
Use cross-validation and the bootstrap for out-of-sample evaluations,
Handle imbalanced classes in classification problems,
Fit regularised (and possibly also generalised) linear models, e.g., using the lasso,
Fit a number of machine learning models, including k-nearest neighbours (kNN), decision trees, random forests, and boosted trees, and
Make forecasts based on time series data.

11.1 Evaluating predictive models

In many ways, modern predictive modelling differs from the more traditional inference problems that we studied in the previous chapter. The goal of predictive modelling is (usually) not to test whether some variable affects another or to study causal relationships. Instead, our only goal is to make good predictions. It is little surprise then that the tools we use to evaluate predictive models differ from those used to evaluate models used for other purposes, like hypothesis testing. In this section, we will have a look at how to evaluate predictive models.

The terminology used in predictive modelling differs a little from that used in traditional statistics. For instance, explanatory variables are often called features or predictors, and predictive modelling is often referred to as supervised learning. We will stick with the terms used in Chapter 7, to keep the terminology consistent within the book.

Predictive models can be divided into two categories:

Regression, where we want to make predictions for a numeric variable,
Classification, where we want to make predictions for a categorical variable.

There are many similarities between these two, but we need to use different measures when evaluating their predictive performance. Let’s start with models for numeric predictions, i.e., regression models.

11.1.1 Evaluating regression models

Let’s return to the mtcars data that we studied in Section 8.1. There, we fitted a linear model to explain the fuel consumption of cars:

m <- lm(mpg ~ ., data = mtcars)

(Recall that the formula mpg ~ . means that all variables in the dataset, except mpg, are used as explanatory variables in the model.)

A number of measures of how well the model fits the data have been proposed. Without going into details (it will soon be apparent why), we can mention examples like the coefficient of determination \(R^2\), and information criteria like \(AIC\) and \(BIC\). All of these are straightforward to compute for our model:

summary(m)$r.squared     # R^2
summary(m)$adj.r.squared # Adjusted R^2
AIC(m)                   # AIC
BIC(m)                   # BIC

\(R^2\) is a popular tool for assessing model fit, with values close to 1 indicating a good fit and values close to 0 indicating a poor fit (i.e., that most of the variation in the data isn’t accounted for).

It is nice if our model fits the data well, but what really matters in predictive modelling is how close the predictions from the model are to the truth. We therefore need ways to measure the distance between predicted values and observed values – ways to measure the size of the average prediction error. A common measure is the root mean square error (RMSE). Given \(n\) observations \(y_1,y_2,\ldots,y_n\) for which our model makes the predictions \(\hat{y}_1,\ldots,\hat{y}_n\), this is defined as \[RMSE = \sqrt{\frac{\sum_{i=1}^n(\hat{y}_i-y_i)^2}{n}},\] that is, as the named implies, the square root of the mean of the squared errors \((\hat{y}_i-y_i)^2\).

Another common measure is the mean absolute error (MAE):

\[MAE = \frac{\sum_{i=1}^n|\hat{y}_i-y_i|}{n}.\]

Let’s compare the predicted values \(\hat{y}_i\) to the observed values \(y_i\) for our mtcars model m:

rmse <- sqrt(mean((predict(m) - mtcars$mpg)^2))
mae <- mean(abs(predict(m) - mtcars$mpg))
rmse; mae

There is a problem with this computation, and it is a big one. What we just computed was the difference between predicted values and observed values for the sample that was used to fit the model. This doesn’t necessarily tell us anything about how well the model will fare when used to make predictions about new observations. It is, for instance, entirely possible that our model has overfitted to the sample, and essentially has learned the examples therein by heart, ignoring the general patterns that we were trying to model. This would lead to a small \(RMSE\) and \(MAE\), and a high \(R^2\), but would render the model useless for predictive purposes.

All the computations that we’ve just done – \(R^2\), \(AIC\), \(BIC\), \(RMSE\), and \(MAE\) – were examples of in-sample evaluations of our model. There are a number of problems associated with in-sample evaluations, all of which have been known for a long time; see, e.g., Picard & Cook (1984). In general, they tend to be overly optimistic and overestimate how well the model will perform for new data. It is about time that we got rid of them for good.

A fundamental principle of predictive modelling is that the model chiefly should be judged on how well it makes predictions for new data. To evaluate its performance, we therefore need to carry out some form of out-of-sample evaluation, i.e., to use the model to make predictions for new data (that weren’t used to fit the model). We can then compare those predictions to the actual observed values for those data, and, e.g., compute the \(RMSE\) or \(MAE\) to measure the size of the average prediction error. Out-of-sample evaluations, when done right, are less overoptimistic than in-sample evaluations, and they are also better in the sense that they actually measure the right thing.

\[\sim\]

Exercise 11.1 To see that a high \(R^2\) and low p-values say very little about the predictive performance of a model, consider the following dataset with 30 randomly generated observations of four variables:

exdata <- data.frame(x1 = c(0.87, -1.03, 0.02, -0.25, -1.09, 0.74,
          0.09, -1.64, -0.32, -0.33, 1.40, 0.29, -0.71, 1.36, 0.64,
          -0.78, -0.58, 0.67, -0.90, -1.52, -0.11, -0.65, 0.04,
          -0.72, 1.71, -1.58, -1.76, 2.10, 0.81, -0.30),
          x2 = c(1.38, 0.14, 1.46, 0.27, -1.02, -1.94, 0.12, -0.64,
          0.64, -0.39, 0.28, 0.50, -1.29, 0.52, 0.28, 0.23, 0.05,
          3.10, 0.84, -0.66, -1.35, -0.06, -0.66, 0.40, -0.23,
          -0.97, -0.78, 0.38, 0.49, 0.21),
          x3 = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
          1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1),
          y = c(3.47, -0.80, 4.57, 0.16, -1.77, -6.84, 1.28, -0.52,
          1.00, -2.50, -1.99, 1.13, -4.26, 1.16, -0.69, 0.89, -1.01,
          7.56, 2.33, 0.36, -1.11, -0.53, -1.44, -0.43, 0.69, -2.30,
          -3.55, 0.99, -0.50, -1.67))

The true relationship between the variables, used to generate the y variables, is \(y = 2x_1-x_2+x_3\cdot x_2\). Plot the y values in the data against this expected value. Does a linear model seem appropriate?
Fit a linear regression model with x1, x2, and x3 as explanatory variables (without any interactions) using the first 20 observations of the data. Do the p-values and \(R^2\) indicate a good fit?
Make predictions for the remaining 10 observations. Are the predictions accurate?
A common (mal)practice is to remove explanatory variables that aren’t significant from a linear model (see Section 8.2.7 for some comments on this). Remove any variables from the regression model with a p-value above 0.05, and refit the model using the first 20 observations. Do the p-values and \(R^2\) indicate a good fit? Do the predictions for the remaining 10 observations improve?
Finally, fit a model with x1, x2, and x3*x2 as explanatory variables (i.e., a correctly specified model) to the first 20 observations. Do the predictions for the remaining 10 observations improve?

11 Predictive modelling and machine learning

11.1 Evaluating predictive models

11.1.1 Evaluating regression models

11.1.2 Test-training splits

11.1.3 Leave-one-out cross-validation and caret

11.1.4 k-fold cross-validation

11.1.5 Twinned observations

11.1.6 Bootstrapping

11.1.7 Evaluating classification models

11.1.8 Visualising decision boundaries

11.2 Ethical issues in predictive modelling

11.3 Challenges in predictive modelling

11.3.1 Handling class imbalance

11.3.2 Assessing variable importance

11.3.3 Extrapolation

11.3.4 Missing data and imputation

11.3.5 Endless waiting

11.3.6 Overfitting to the test set

11.4 Regularised regression models

11.4.1 Ridge regression

11.4.2 The lasso

11.4.3 Elastic net

11.4.4 Choosing the best model

11.4.5 Regularised mixed models

11.5 Machine learning models

11.5.1 Decision trees

11.5.2 Random forests

11.5.3 Boosted trees

11.5.4 Model trees

11.5.5 Discriminant analysis

11.5.6 Support vector machines

11.5.7 Nearest neighbours classifiers

11.6 Forecasting time series

11.6.1 Decomposition

11.6.2 Forecasting using ARIMA models

11.7 Deploying models

11.7.1 Creating APIs with plumber

11.7.2 Different types of output

11.1.3 Leave-one-out cross-validation and `caret`

11.7.1 Creating APIs with `plumber`