8 Regression models
Regression models, in which explanatory variables are used to model the behaviour of a response variable, is without a doubt the most commonly used class of models in the statistical toolbox. In this chapter, we will have a look at different types of regression models tailored to many different sorts of data and applications.
After reading this chapter, you will be able to use R to:
- Fit and evaluate linear and generalised linear models,
- Fit and evaluate mixed models,
- Fit survival analysis models,
- Analyse data with left-censored observations,
- Create matched samples.
8.1 Linear models
Being flexible enough to handle different types of data, yet simple enough to be useful and interpretable, linear models are among the most important tools in the statistics toolbox. In this section, we’ll discuss how to fit and evaluate linear models in R.
8.1.1 Fitting linear models
We had a quick glance at linear models in Section 3.7. There we used the mtcars
data:
?mtcarsView(mtcars)
First, we plotted fuel consumption (mpg
) against gross horsepower (hp
):
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
geom_point()
Given \(n\) observations of \(p\) explanatory variables (also known as predictors, covariates, independent variables, and features), the linear model is:
\[y_i=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip} + \epsilon_i,\qquad i=1,\ldots,n\] where \(\epsilon_i\) is a random error with mean 0, meaning that the model also can be written as:
\[E(y_i)=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\]
We fitted a linear model using lm
, with mpg
as the response variable and hp
as the explanatory variable:
<- lm(mpg ~ hp, data = mtcars)
m summary(m)
We added the fitted line to the scatterplot by using geom_abline
:
# Check model coefficients:
coef(m)
# Add regression line to plot:
ggplot(mtcars, aes(hp, mpg)) +
geom_point() +
geom_abline(aes(intercept = coef(m)[1], slope = coef(m)[2]),
colour = "red")
We had a look at some diagnostic plots given by applying plot
to our fitted model m
:
plot(m)
Finally, we added another variable, the car weight wt
, to the model:
<- lm(mpg ~ hp + wt, data = mtcars)
m summary(m)
Next, we’ll look at what more R has to offer when it comes to regression. Before that though, it’s a good idea to do a quick exercise to make sure that you remember how to fit linear models.
\[\sim\]
Exercise 8.1 The sales-weather.csv
data from Section 5.12 describes the weather in a region during the first quarter of 2020. Download the file from the book’s web page. Fit a linear regression model with TEMPERATURE
as the response variable and SUN_HOURS
as an explanatory variable. Plot the results. Is there a connection?
You’ll return to and expand this model in the next few exercises, so make sure to save your code.
(Click here to go to the solution.)
Exercise 8.2 Fit a linear model to the mtcars
data using the formula mpg ~ .
. What happens? What is ~ .
a shorthand for?
8.1.2 Interactions and polynomial terms
It seems plausible that there could be an interaction between gross horsepower and weight. We can include an interaction term by adding hp:wt
to the formula:
<- lm(mpg ~ hp + wt + hp:wt, data = mtcars)
m summary(m)
Alternatively, to include the main effects of hp
and wt
along with the interaction effect, we can use hp*wt
as a shorthand for hp + wt + hp:wt
to write the model formula more concisely:
<- lm(mpg ~ hp*wt, data = mtcars)
m summary(m)
It is often recommended to centre the explanatory variables in regression models, i.e. to shift them so that they all have mean 0. There are a number of benefits to this: for instance that the intercept then can be interpreted as the expected value of the response variable when all explanatory variables are equal to their means, i.e. in an average case56. It can also reduce any multicollinearity in the data, particularly when including interactions or polynomial terms in the model. Finally, it can reduce problems with numerical instability that may arise due to floating point arithmetics. Note however, that there is no need to centre the response variable57.
Centring the explanatory variables can be done using scale
:
# Create a new data frame, leaving the response variable mpg
# unchanged, while centring the explanatory variables:
<- data.frame(mpg = mtcars[,1],
mtcars_scaled scale(mtcars[,-1], center = TRUE,
scale = FALSE))
<- lm(mpg ~ hp*wt, data = mtcars_scaled)
m summary(m)
If we wish to add a polynomial term to the model, we can do so by wrapping the polynomial in I()
. For instance, to add a quadratic effect in the form of the square weight of a vehicle to the model, we’d use:
<- lm(mpg ~ hp*wt + I(wt^2), data = mtcars_scaled)
m summary(m)
8.1.3 Dummy variables
Categorical variables can be included in regression models by using dummy variables. A dummy variable takes the values 0 and 1, indicating that an observation either belongs to a category (1) or not (0). If the original categorical variable has more than two categories, \(c\) categories, say, the number of dummy variables included in the regression model should be \(c-1\) (with the last category corresponding to all dummy variables being 0). R does this automatically for us if we include a factor
variable in a regression model:
# Make cyl a categorical variable:
$cyl <- factor(mtcars$cyl)
mtcars
<- lm(mpg ~ hp*wt + cyl, data = mtcars)
m summary(m)
Note how only two categories, 6 cylinders and 8 cylinders, are shown in the summary table. The third category, 4 cylinders, corresponds to both those dummy variables being 0. Therefore, the coefficient estimates for cyl6
and cyl8
are relative to the remaining reference category cyl4
. For instance, compared to cyl4
cars, cyl6
cars have a higher fuel consumption, with their mpg
being \(1.26\) lower.
We can control which category is used as the reference category by setting the order of the factor
variable, as in Section 5.4. The first factor
level is always used as the reference, so if for instance we want to use cyl6
as our reference category, we’d do the following:
# Make cyl a categorical variable with cyl6 as
# reference variable:
$cyl <- factor(mtcars$cyl, levels =
mtcarsc(6, 4, 8))
<- lm(mpg ~ hp*wt + cyl, data = mtcars)
m summary(m)
Dummy variables are frequently used for modelling differences between different groups. Including only the dummy variable corresponds to using different intercepts for different groups. If we also include an interaction with the dummy variable, we can get different slopes for different groups. Consider the model \[E(y_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\beta_{12} x_{i1}x_{i2},\qquad i=1,\ldots,n\] where \(x_1\) is numeric and \(x_2\) is a dummy variable. Then the intercept and slope changes depending on the value of \(x_2\) as follows:
\[E(y_i)=\beta_0+\beta_1 x_{i1},\qquad \mbox{if } x_2=0,\] \[E(y_i)=(\beta_0+\beta_2)+(\beta_1+\beta_{12}) x_{i1},\qquad \mbox{if } x_2=1.\] This yields a model where the intercept and slope differs between the two groups that \(x_2\) represents.
\[\sim\]
Exercise 8.3 Return to the weather model from Exercise 8.1. Create a dummy variable for precipitation (zero precipitation or non-zero precipitation) and add it to your model. Also include an interaction term between the precipitation dummy and the number of sun hours. Are any of the coefficients significantly non-zero?
8.1.4 Model diagnostics
There are a few different ways in which we can plot the fitted model. First, we can of course make a scatterplot of the data and add a curve showing the fitted values corresponding to the different points. These can be obtained by running predict(m)
with our fitted model m
.
# Fit two models:
$cyl <- factor(mtcars$cyl)
mtcars<- lm(mpg ~ hp + wt, data = mtcars) # Simple model
m1 <- lm(mpg ~ hp*wt + cyl, data = mtcars) # Complex model
m2
# Create data frames with fitted values:
<- data.frame(hp = mtcars$hp, mpg_pred = predict(m1))
m1_pred <- data.frame(hp = mtcars$hp, mpg_pred = predict(m2))
m2_pred
# Plot fitted values:
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
geom_point() +
geom_line(data = m1_pred, aes(x = hp, y = mpg_pred),
colour = "red") +
geom_line(data = m2_pred, aes(x = hp, y = mpg_pred),
colour = "blue")
We could also plot the observed values against the fitted values:
<- nrow(mtcars)
n <- data.frame(Observed = rep(mtcars$mpg, 2),
models Fitted = c(predict(m1), predict(m2)),
Model = rep(c("Model 1", "Model 2"), c(n, n)))
ggplot(models, aes(Fitted, Observed)) +
geom_point(colour = "blue") +
facet_wrap(~ Model, nrow = 3) +
geom_abline(intercept = 0, slope = 1) +
xlab("Fitted values") + ylab("Observed values")
Linear models are fitted and analysed using a number of assumptions, most of which are assessed by looking at plots of the model residuals, \(y_i-\hat{y}_i\), where \(\hat{y}_i\) is the fitted value for observation \(i\). Some important assumptions are:
- The model is linear in the parameters: we check this by looking for non-linear patterns in the residuals, or in the plot of observed against fitted values.
- The observations are independent: which can be difficult to assess visually. We’ll look at models that are designed to handle correlated observations in Sections 8.4 and 9.6.
- Homoscedasticity: that the random errors all have the same variance. We check this by looking for non-constant variance in the residuals. The opposite of homoscedasticity is heteroscedasticity.
- Normally distributed random errors: this assumption is important if we want to use the traditional parametric p-values, confidence intervals and prediction intervals. If we use permutation p-values or bootstrap intervals (as we will later in this chapter), we no longer need this assumption.
Additionally, residual plots can be used to find influential points that (possibly) have a large impact on the model coefficients (influence is measured using Cook’s distance and potential influence using leverage). We’ve already seen that we can use plot(m)
to create some diagnostic plots. To get more and better-looking plots, we can use the autoplot
function for lm
objects from the ggfortify
package:
library(ggfortify)
autoplot(m1, which = 1:6, ncol = 2, label.size = 3)
In each of the plots, we look for the following:
- Residuals versus fitted: look for patterns that can indicate non-linearity, e.g. that the residuals all are high in some areas and low in others. The blue line is there to aid the eye - it should ideally be relatively close to a straight line (in this case, it isn’t perfectly straight, which could indicate a mild non-linearity).
- Normal Q-Q: see if the points follow the line, which would indicate that the residuals (which we for this purpose can think of as estimates of the random errors) follow a normal distribution.
- Scale-Location: similar to the residuals versus fitted plot, this plot shows whether the residuals are evenly spread for different values of the fitted values. Look for patterns in how much the residuals vary - if they e.g. vary more for large fitted values, then that is a sign of heteroscedasticity. A horizontal blue line is a sign of homoscedasticity.
- Cook’s distance: look for points with high values. A commonly-cited rule-of-thumb (Cook & Weisberg, 1982) says that values above 1 indicate points with a high influence.
- Residuals versus leverage: look for points with a high residual and high leverage. Observations with a high residual but low leverage deviate from the fitted model but don’t affect it much. Observations with a high residual and a high leverage likely have a strong influence on the model fit, meaning that the fitted model could be quite different if these points were removed from the dataset.
- Cook’s distance versus leverage: look for observations with a high Cook’s distance and a high leverage, which are likely to have a strong influence on the model fit.
A formal test for heteroscedasticity, the Breusch-Pagan test, is available in the car
package as a complement to graphical inspection. A low p-value indicates statistical evidence for heteroscedasticity. To run the test, we use ncvTest
(where “ncv” stands for non-constant variance):
install.packages("car")
library(car)
ncvTest(m1)
A common problem in linear regression models is multicollinearity, i.e. explanatory variables that are strongly correlated. Multicollinearity can cause your \(\beta\) coefficients and p-values to change greatly if there are small changes in the data, rendering them unreliable. To check if you have multicollinearity in your data, you can create a scatterplot matrix of your explanatory variables, as in Section 4.8.1:
library(GGally)
ggpairs(mtcars[, -1])
In this case, there are some highly correlated pairs, hp
and disp
among them. As a numerical measure of collinearity, we can use the generalised variance inflation factor (GVIF), given by the vif
function in the car
package:
library(car)
<- lm(mpg ~ ., data = mtcars)
m vif(m)
A high GVIF indicates that a variable is highly correlated with other explanatory variables in the dataset. Recommendations for what a “high GVIF” is varies, from 2.5 to 10 or more.
You can mitigate problems related to multicollinearity by:
- Removing one or more of the correlated variables from the model (because they are strongly correlated, they measure almost the same thing anyway!),
- Centring your explanatory variables (particularly if you include polynomial terms),
- Using a regularised regression model (which we’ll do in Section 9.4).
\[\sim\]
y
as the response variable and x
as the explanatory variable for each dataset, and make some residual plots. Which dataset suffers from which problem?
<- data.frame(
exdata1 x = c(2.99, 5.01, 8.84, 6.18, 8.57, 8.23, 8.48, 0.04, 6.80,
7.62, 7.94, 6.30, 4.21, 3.61, 7.08, 3.50, 9.05, 1.06,
0.65, 8.66, 0.08, 1.48, 2.96, 2.54, 4.45),
y = c(5.25, -0.80, 4.38, -0.75, 9.93, 13.79, 19.75, 24.65,
6.84, 11.95, 12.24, 7.97, -1.20, -1.76, 10.36, 1.17,
15.41, 15.83, 18.78, 12.75, 24.17, 12.49, 4.58, 6.76,
-2.92))
<- data.frame(
exdata2 x = c(5.70, 8.03, 8.86, 0.82, 1.23, 2.96, 0.13, 8.53, 8.18,
6.88, 4.02, 9.11, 0.19, 6.91, 0.34, 4.19, 0.25, 9.72,
9.83, 6.77, 4.40, 4.70, 6.03, 5.87, 7.49),
y = c(21.66, 26.23, 19.82, 2.46, 2.83, 8.86, 0.25, 16.08,
17.67, 24.86, 8.19, 28.45, 0.52, 19.88, 0.71, 12.19,
0.64, 25.29, 26.72, 18.06, 10.70, 8.27, 15.49, 15.58,
19.17))
(Click here to go to the solution.)
Exercise 8.5 We continue our investigation of the weather models from Exercises 8.1 and 8.3.
Plot the observed values against the fitted values for the two models that you’ve fitted. Does either model seem to have a better fit?
Create residual plots for the second model from Exercise 8.3. Are there any influential points? Any patterns? Any signs of heteroscedasticity?
8.1.5 Transformations
If your data displays signs of heteroscedasticity or non-normal residuals, you can sometimes use a Box-Cox transformation (Box & Cox, 1964) to mitigate those problems. The Box-Cox transformation is applied to your dependent variable \(y\). What it looks like is determined by a parameter \(\lambda\). The transformation is defined as \(\frac{y_i^\lambda-1}{\lambda}\) if \(\lambda\neq 0\) and \(\ln(y_i)\) if \(\lambda=0\). \(\lambda=1\) corresponds to no transformation at all. The boxcox
function in MASS
is useful for finding an appropriate choice of \(\lambda\). Choose a \(\lambda\) that is close to the peak (inside the interval indicated by the outer dotted lines) of the curve plotted by boxcox
:
<- lm(mpg ~ hp + wt, data = mtcars)
m
library(MASS)
boxcox(m)
In this case, the curve indicates that \(\lambda=0\), which corresponds to a log-transformation, could be a good choice. Let’s give it a go:
$logmpg <- log(mtcars$mpg)
mtcars<- lm(logmpg ~ hp + wt, data = mtcars)
m_bc summary(m_bc)
library(ggfortify)
autoplot(m_bc, which = 1:6, ncol = 2, label.size = 3)
The model fit seems to have improved after the transformation. The downside is that we now are modelling the log-mpg rather than mpg, which make the model coefficients a little difficult to interpret.
\[\sim\]
Exercise 8.6 Run boxcox
with your model from Exercise 8.3. Does it indicate that a transformation can be useful for your model?
8.1.6 Alternatives to lm
Non-normal regression errors can sometimes be an indication that you need to transform your data, that your model is missing an important explanatory variable, that there are interaction effects that aren’t accounted for, or that the relationship between the variables is non-linear. But sometimes, you get non-normal errors simply because the errors are non-normal.
The p-values reported by summary
are computed under the assumption of normally distributed regression errors, and can be sensitive to deviations from normality. An alternative is to use the lmp
function from the lmPerm
package, which provides permutation test p-values instead. This doesn’t affect the model fitting in any way - the only difference is how the p-values are computed. Moreover, the syntax for lmp
is identical to that of lm
:
# First, install lmPerm:
install.packages("lmPerm")
# Get summary table with permutation p-values:
library(lmPerm)
<- lmp(mpg ~ hp + wt, data = mtcars)
m summary(m)
In some cases, you need to change the arguments of lmp
to get reliable p-values. We’ll have a look at that in Exercise 8.12. Relatedly, in Section 8.1.7 we’ll see how to construct bootstrap confidence intervals for the parameter estimates.
Another option that does affect the model fitting is to use a robust regression model based on M-estimators. Such models tend to be less sensitive to outliers, and can be useful if you are concerned about the influence of deviating points. The rlm
function in MASS
is used for this. As was the case for lmp
, the syntax for rlm
is identical to that of lm
:
library(MASS)
<- rlm(mpg ~ hp + wt, data = mtcars)
m summary(m)
Another option is to use Bayesian estimation, which we’ll discuss in Section 8.1.13.
\[\sim\]
Exercise 8.7 Refit your model from Exercise 8.3 using lmp
. Are the two main effects still significant?
8.1.7 Bootstrap confidence intervals for regression coefficients
Assuming normality, we can obtain parametric confidence intervals for the model coefficients using confint
:
<- lm(mpg ~ hp + wt, data = mtcars)
m
confint(m)
I usually prefer to use bootstrap confidence intervals, which we can obtain using boot
and boot.ci
, as we’ll do next. Note that the only random part in the linear model
\[y_i=\beta_0 +\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip} + \epsilon_i,\qquad i=1,\ldots,n\]
is the error term \(\epsilon_i\). In most cases, it is therefore this term (and this term only) that we wish to resample. The explanatory variables should remain constant throughout the resampling process; the inference is conditioned on the values of the explanatory variables.
To achieve this, we’ll resample from the model residuals, and add those to the values predicted by the fitted function, which creates new bootstrap values of the response variable. We’ll then fit a linear model to these values, from which we obtain observations from the bootstrap distribution of the model coefficients.
It turns out that the bootstrap performs better if we resample not from the original residuals \(e_1,\ldots,e_n\), but from scaled and centred residuals \(r_i-\bar{r}\), where each \(r_i\) is a scaled version of residual \(e_i\), scaled by the leverage \(h_i\):
\[r_i=\frac{e_i}{\sqrt{1-h_i}},\]
see Chapter 6 of Davison & Hinkley (1997) for details. The leverages can be computed using lm.influence
.
We implement this procedure in the code below (and will then have a look at convenience functions that help us achieve the same thing more easily). It makes use of formula
, which can be used to extract the model formula from regression models:
library(boot)
<- function(formula, data, i, predictions, residuals) {
coefficients # Create the bootstrap value of response variable by
# adding a randomly drawn scaled residual to the value of
# the fitted function for each observation:
all.vars(formula)[1]] <- predictions + residuals[i]
data[,
# Fit a new model with the bootstrap value of the response
# variable and the original explanatory variables:
<- lm(formula, data = data)
m return(coef(m))
}
# Fit the linear model:
<- lm(mpg ~ hp + wt, data = mtcars)
m
# Compute scaled and centred residuals:
<- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)
res
# Run the bootstrap, extracting the model formula and the
# fitted function from the model m:
<- boot(data = mtcars, statistic = coefficients,
boot_res R = 999, formula = formula(m),
predictions = predict(m),
residuals = res)
# Compute 95 % confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # hp
boot.ci(boot_res, type = "perc", index = 3) # wt
The argument index
in boot.ci
should be the row number of the parameter in the table given by summary
. The intercept is on the first row, and so its index
is 1, hp
is on the second row and its index
is 2, and so on.
Clearly, the above code is a little unwieldy. Fortunately, the car
package contains a function called Boot
that can be used to bootstrap regression models in the exact same way:
library(car)
<- Boot(m, method = "residual", R = 9999)
boot_res
# Compute 95 % confidence intervals:
confint(boot_res, type = "perc")
Finally, the most convenient approach is to use boot_summary
from the boot.pval
package. It provides a data frame with estimates, bootstrap confidence intervals, and bootstrap p-values (computed using interval inversion) for the model coefficients. The arguments specify what interval type and resampling strategy to use (more on the latter in Exercise 8.9):
library(boot.pval)
boot_summary(m, type = "perc", method = "residual", R = 9999)
\[\sim\]
Exercise 8.8 Refit your model from Exercise 8.3 using a robust regression estimator with rlm
. Compute confidence intervals for the coefficients of the robust regression model.
(Click here to go to the solution.)
Exercise 8.9 In an alternative bootstrap scheme for regression models, often referred to as case resampling, the observations (or cases) \((y_i, x_{i1},\ldots,x_{ip})\) are resampled instead of the residuals. This approach can be applied when the explanatory variables can be treated as being random (but measured without error) rather than fixed. It can also be useful for models with heteroscedasticity, as it doesn’t rely on assumptions about constant variance (which, on the other hand, makes it less efficient if the errors actually are homoscedastic).
Read the documentation for boot_summary
to see how you can compute confidence intervals for the coefficients in the model m <- lm(mpg ~ hp + wt, data = mtcars)
using case resampling. Do they differ substantially from those obtained using residual resampling in this case?
8.1.8 Alternative summaries with broom
The broom
package contains some useful functions when working with linear models (and many other common models), which allow us to get various summaries of the model fit in useful formats. Let’s install it:
install.packages("broom")
A model fitted with m
is stored as a list
with lots of elements:
<- lm(mpg ~ hp + wt, data = mtcars)
m str(m)
How can we access the information about the model? For instance, we may want to get the summary table from summary
, but as a data frame rather than as printed text. Here are two ways of doing this, using summary
and the tidy
function from broom
:
# Using base R:
summary(m)$coefficients
# Using broom:
library(broom)
tidy(m)
tidy
is the better option if you want to retrieve the table as part of a pipeline. For instance, if you want to adjust the p-values for multiplicity using Bonferroni correction (Section 7.2.5), you could do as follows:
library(magrittr)
%>%
mtcars lm(mpg ~ hp + wt, data = .) %>%
tidy() %$%
p.adjust(p.value, method = "bonferroni")
If you prefer bootstrap p-values, you can use boot_summary
from boot.pval
similarly. That function also includes an argument for adjusting the p-values for multiplicity:
library(boot.pval)
lm(mpg ~ hp + wt, data = mtcars) %>%
boot_summary(adjust.method = "bonferroni")
Another useful function in broom
is glance
, which lets us get some summary statistics about the model:
glance(m)
Finally, augment
can be used to add predicted values, residuals, and Cook’s distances to the dataset used for fitting the model, which of course can be very useful for model diagnostics:
# To get the data frame with predictions and residuals added:
augment(m)
# To plot the observed values against the fitted values:
library(ggplot2)
%>%
mtcars lm(mpg ~ hp + wt, data = .) %>%
augment() %>%
ggplot(aes(.fitted, mpg)) +
geom_point() +
xlab("Fitted values") + ylab("Observed values")
8.1.9 Variable selection
A common question when working with linear models is what variables to include in your model. Common practices for variable selection include stepwise regression methods, where variables are added to or removed from the model depending on p-values, \(R^2\) values, or information criteria like AIC or BIC.
Don’t ever do this if your main interest is p-values. Stepwise regression increases the risk of type I errors, renders the p-values of your final model invalid, and can lead to over-fitting; see e.g. Smith (2018). Instead, you should let your research hypothesis guide your choice of variables, or base your choice on a pilot study.
If your main interest is prediction, then that is a completely different story. For predictive models, it is usually recommended that variable selection and model fitting should be done simultaneously. This can be done using regularised regression models, to which Section 9.4 is devoted.
8.1.10 Prediction
An important use of linear models is prediction. In R, this is done using predict
. By providing a fitted model and a new dataset, we can get predictions.
Let’s use one of the models that we fitted to the mtcars
data to make predictions for two cars that aren’t from the 1970’s. Below, we create a data frame with data for a 2009 Volvo XC90 D3 AWD (with a fuel consumption of 29 mpg) and a 2019 Ferrari Roma (15.4 mpg):
<- data.frame(hp = c(161, 612), wt = c(4.473, 3.462),
new_cars row.names = c("Volvo XC90", "Ferrari Roma"))
To get the model predictions for these new cars, we run the following:
predict(m, new_cars)
predict
also lets us obtain prediction intervals for our prediction, under the assumption of normality58. To get 90 % prediction intervals, we add interval = "prediction"
and level = 0.9
:
<- lm(mpg ~ hp + wt, data = mtcars)
m predict(m, new_cars,
interval = "prediction",
level = 0.9)
If we were using a transformed \(y\)-variable, we’d probably have to transform the predictions back to the original scale for them to be useful:
$logmpg <- log(mtcars$mpg)
mtcars<- lm(logmpg ~ hp + wt, data = mtcars)
m_bc
<- predict(m_bc, new_cars,
preds interval = "prediction",
level = 0.9)
# Predictions for log-mpg:
preds# Transform back to original scale:
exp(preds)
The lmp
function that we used to compute permutation p-values does not offer confidence intervals. We can however compute bootstrap prediction intervals using the code below. Prediction intervals try to capture two sources of uncertainty:
- Model uncertainty, which we will capture by resampling the data and make predictions for the expected value of the observation,
- Random noise, i.e. that almost all observations deviate from their expected value. We will capture this by resampling residuals from the fitted bootstrap models.
Consequently, the value that we generate in each bootstrap replication will be the sum of a prediction and a resampled residual (see Davison & Hinkley (1997), Section 6.3, for further details):
<- function(data, new_data, model, i,
boot_pred
formula, predictions, residuals){# Resample residuals and fit new model:
all.vars(formula)[1]] <- predictions + residuals[i]
data[,<- lm(formula, data = data)
m_boot
# We use predict to get an estimate of the
# expectation of new observations, and then
# add resampled residuals to also include the
# natural variation around the expectation:
predict(m_boot, newdata = new_data) +
sample(residuals, nrow(new_data))
}
library(boot)
<- lm(mpg ~ hp + wt, data = mtcars)
m
# Compute scaled and centred residuals:
<- residuals(m)/sqrt(1 - lm.influence(m)$hat)
res <- res - mean(res)
res
<- boot(data = m$model,
boot_res statistic = boot_pred,
R = 999,
model = m,
new_data = new_cars,
formula = formula(m),
predictions = predict(m),
residuals = res)
# 90 % bootstrap prediction intervals:
boot.ci(boot_res, type = "perc", index = 1, conf = 0.9) # Volvo
boot.ci(boot_res, type = "perc", index = 2, conf = 0.9) # Ferrari
\[\sim\]
Exercise 8.10 Use your model from Exercise 8.3 to compute a bootstrap prediction interval for the temperature on a day with precipitation but no sun hours.
8.1.11 Prediction for multiple datasets
In certain cases, we wish to fit different models to different subsets of the data. Functionals like apply
and map
(Section 6.5) are handy when you want to fit several models at once. Below is an example of how we can use split
(Section 5.2.1) and tools from the purrr
package (Section 6.5.3) to fit the models simultaneously, as well as for computing the fitted values in a single line of code:
# Split the dataset into three groups depending on the
# number of cylinders:
library(magrittr)
<- mtcars %>% split(.$cyl)
mtcars_by_cyl
# Fit a linear model to each subgroup:
library(purrr)
<- mtcars_by_cyl %>% map(~ lm(mpg ~ hp + wt, data = .))
models
# Compute the fitted values for each model:
map2(models, mtcars_by_cyl, predict)
We’ll make use of this approach when we study linear mixed models in Section 8.4.
8.1.12 ANOVA
Linear models are also used for analysis of variance (ANOVA) models to test whether there are differences among the means of different groups. We’ll use the mtcars
data to give some examples of this. Let’s say that we want to investigate whether the mean fuel consumption (mpg
) of cars differs depending on the number of cylinders (cyl
), and that we want to include the type of transmission (am
) as a blocking variable.
To get an ANOVA table for this problem, we must first convert the explanatory variables to factor
variables, as the variables in mtcars
all numeric
(despite some of them being categorical). We can then use aov
to fit the model, and then summary
:
# Convert variables to factors:
$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)
mtcars
# Fit model and print ANOVA table:
<- aov(mpg ~ cyl + am, data = mtcars)
m summary(m)
(aov
actually uses lm
to fit the model, but by using aov
we specify that we want an ANOVA table to be printed by summary
.)
When there are different numbers of observations in the groups in an ANOVA, so that we have an unbalanced design, the sums of squares used to compute the test statistics can be computed in at least three different ways, commonly called type I, II and III. See Herr (1986) for an overview and discussion of this.
summary
prints a type I ANOVA table, which isn’t the best choice for unbalanced designs. We can however get type II or III tables by instead using Anova
from the car
package to print the table:
library(car)
Anova(m, type = "II")
Anova(m, type = "III") # Default in SAS and SPSS.
As a guideline, for unbalanced designs, you should use type II tables if there are no interactions, and type III tables if there are interactions. To look for interactions, we can use interaction.plot
to create a two-way interaction plot:
interaction.plot(mtcars$am, mtcars$cyl, response = mtcars$mpg)
In this case, there is no sign of an interaction between the two variables, as the lines are more or less parallel. A type II table is therefore probably the best choice here.
We can obtain diagnostic plots the same way we did for other linear models:
library(ggfortify)
autoplot(m, which = 1:6, ncol = 2, label.size = 3)
To find which groups that have significantly different means, we can use a post hoc test like Tukey’s HSD, available through the TukeyHSD
function:
TukeyHSD(m)
We can visualise the results of Tukey’s HSD with plot
, which shows 95 % confidence intervals for the mean differences:
# When the difference isn't significant, the dashed line indicating
# "no differences" falls within the confidence interval for
# the difference:
plot(TukeyHSD(m, "am"))
# When the difference is significant, the dashed line does not
# fall within the confidence interval:
plot(TukeyHSD(m, "cyl"))
\[\sim\]
Exercise 8.11 Return to the residual plots that you created with autoplot
. Figure out how you can plot points belonging to different cyl
groups in different colours.
(Click here to go to the solution.)
Exercise 8.12 The aovp
function in the lmPerm
package can be utilised to perform permutation tests instead of the classical parametric ANOVA tests. Rerun the analysis in the example above, using aovp
instead. Do the conclusions change? What happens if you run your code multiple times? Does using summary
on a model fitted using aovp
generate a type I, II or III table by default? Can you change what type of table it produces?
(Click here to go to the solution.)
Exercise 8.13 In the case of a one-way ANOVA (i.e. ANOVA with a single explanatory variable), the Kruskal-Wallis test can be used as a nonparametric option. It is available in kruskal.test
. Use the Kruskal-Wallis test to run a one-way ANOVA for the mtcars
data, with mpg
as the response variable and cyl
as an explanatory variable.
8.1.13 Bayesian estimation of linear models
We can fit Bayesian linear models using the rstanarm
package. To fit a model to the mtcars
data using all explanatory variables, we can use stan_glm
in place of lm
as follows:
library(rstanarm)
<- stan_glm(mpg ~ ., data = mtcars)
m
# Print the estimates:
coef(m)
Next, we can plot the posterior distributions of the effects:
plot(m, "dens", pars = names(coef(m)))
To get 95 % credible intervals for the effects, we can use posterior_interval
:
posterior_interval(m,
pars = names(coef(m)),
prob = 0.95)
We can also plot them using plot
:
plot(m, "intervals",
pars = names(coef(m)),
prob = 0.95)
Finally, we can use \(\hat{R}\) to check model convergence. It should be less than 1.1 if the fitting has converged:
plot(m, "rhat")
Like for lm
, residuals(m)
provides the model residuals, which can be used for diagnostics. For instance, we can plot the residuals against the fitted values to look for signs of non-linearity, adding a curve to aid the eye:
<- data.frame(Fitted = predict(m),
model_diag Residual = residuals(m))
library(ggplot2)
ggplot(model_diag, aes(Fitted, Residual)) +
geom_point() +
geom_smooth(se = FALSE)
For fitting ANOVA models, we can instead use stan_aov
with the argument prior = R2(location = 0.5)
to fit the model.
8.2 Ethical issues in regression modelling
The p-hacking problem, discussed in Section 7.4, is perhaps particularly prevalent in regression modelling. Regression analysis often involves a large number of explanatory variables, and practitioners often try out several different models (e.g. by performing stepwise variable selection; see Section 8.1.9). Because so many hypotheses are tested, often in many different but similar models, there is a large risk of false discoveries.
In any regression analysis, there is a risk of finding spurious relationships. These are dependencies between the response variable and an explanatory variable that either are non-causal or are purely coincidental. As an example of the former, consider the number of deaths by drowning, which is strongly correlated with ice cream sales. Not because ice cream cause people to drown, but because both are affected by the weather: we are more likely to go swimming or buy ice cream on hot days. Lurking variables, like the temperature in the ice cream-drowning example, are commonly referred to as confounding factors. An effect may be statistically significant, but that does not necessarily mean that it is meaningful.
\[\sim\]
8.3 Generalised linear models
Generalised linear models, abbreviated GLM, are (yes) a generalisation of the linear model, that can be used when your response variable has a non-normal error distribution. Typical examples are when your response variable is binary (only takes two values, e.g. 0 or 1), or a count of something. Fitting GLM’s is more or less entirely analogous to fitting linear models in R, but model diagnostics are very different. In this section we will look at some examples of how it can be done.
8.3.1 Modelling proportions: Logistic regression
As the first example of binary data, we will consider the wine quality dataset wine
from Cortez et al. (2009), which is available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/Wine+Quality. It contains measurements on white and red vinho verde wine samples from northern Portugal.
We start by loading the data. It is divided into two separate .csv
files, one for white wines and one for red, which we have to merge:
# Import data about white and red wines:
<- read.csv("https://tinyurl.com/winedata1",
white sep = ";")
<- read.csv("https://tinyurl.com/winedata2",
red sep = ";")
# Add a type variable:
$type <- "white"
white$type <- "red"
red
# Merge the datasets:
<- rbind(white, red)
wine $type <- factor(wine$type)
wine
# Check the result:
summary(wine)
We are interested in seeing if measurements like pH (pH
) and alcohol content (alcohol
) can be used to determine the colours of the wine. The colour is represented by the type
variable, which is binary.
Our model is that the type
of a randomly selected wine is binomial \(Bin(1, \pi_i)\)-distributed (Bernoulli distributed), where \(\pi_i\) depends on explanatory variables like pH and alcohol content. A common model for this situation is a logistic regression model. Given \(n\) observations of \(p\) explanatory variables, the model is:
\[\log\Big(\frac{\pi_i}{1-\pi_i}\Big)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\] Where we in linear regression models model the expected value of the response variable as a linear function of the explanatory variables, we now model the expected value of a function of the expected value of the response variable (that is, a function of \(\pi_i\)). In GLM terminology, this function is known as a link function.
Logistic regression models can be fitted using the glm
function. To specify what our model is, we use the argument family = binomial
:
<- glm(type ~ pH + alcohol, data = wine, family = binomial)
m summary(m)
The p-values presented in the summary table are based on a Wald test known to have poor performance unless the sample size is very large (Agresti, 2013). In this case, with a sample size of 6,497, it is probably safe to use, but for smaller sample sizes, it is preferable to use a bootstrap test instead, which you will do in Exercise 8.18.
The coefficients of a logistic regression model aren’t as straightforward to interpret as those in a linear model. If we let \(\beta\) denote a coefficient corresponding to an explanatory variable \(x\), then:
- If \(\beta\) is positive, then \(\pi_i\) increases when \(x_i\) increases.
- If \(\beta\) is negative, then \(\pi_i\) decreases when \(x_i\) increases.
- \(e^\beta\) is the odds ratio, which shows how much the odds \(\frac{\pi_i}{1-\pi_1}\) change when \(x_i\) is increased 1 step.
We can extract the coefficients and odds ratios using coef
:
coef(m) # Coefficients, beta
exp(coef(m)) # Odds ratios
To find the fitted probability that an observation belongs to the second class we can use predict(m, type = "response")
:
# Check which class is the second one:
levels(wine$type)
# "white" is the second class!
# Get fitted probabilities:
<- predict(m, type = "response")
probs
# Check what the average prediction is for
# the two groups:
mean(probs[wine$type == "red"])
mean(probs[wine$type == "white"])
It turns out that the model predicts that most wines are white - even the red ones! The reason may be that we have more white wines (4,898) than red wines (1,599) in the dataset. Adding more explanatory variables could perhaps solve this problem. We’ll give that a try in the next section.
\[\sim\]
Exercise 8.16 Download sharks.csv
file from the book’s web page. It contains information about shark attacks in South Africa. Using data on attacks that occurred in 2000 or later, fit a logistic regression model to investigate whether the age and sex of the individual that was attacked affect the probability of the attack being fatal.
Note: save the code for your model, as you will return to it in the subsequent exercises.
(Click here to go to the solution.)
Exercise 8.17 In Section 8.1.8 we saw how some functions from the broom
package could be used to get summaries of linear models. Try using them with the wine
data model that we created above. Do the broom
functions work for generalised linear models as well?
8.3.2 Bootstrap confidence intervals
In a logistic regression, the response variable \(y_i\) is a binomial (or Bernoulli) random variable with success probability \(\pi_i\). In this case, we don’t want to resample residuals to create confidence intervals, as it turns out that this can lead to predicted probabilities outside the range \((0,1)\). Instead, we can either use the case resampling strategy described in Exercise 8.9 or use a parametric bootstrap approach where we generate new binomial variables (Section 7.1.2) to construct bootstrap confidence intervals.
To use case resampling, we can use boot_summary
from boot.pval
:
library(boot.pval)
<- glm(type ~ pH + alcohol, data = wine, family = binomial)
m
boot_summary(m, type = "perc", method = "case")
In the parametric approach, for each observation, the fitted success probability from the logistic model will be used to sample new observations of the response variable. This method can work well if the model is well-specified but tends to perform poorly for misspecified models, so make sure to carefully perform model diagnostics (as described in the next section) before applying it. To use the parametric approach, we can do as follows:
library(boot)
<- function(formula, data, predictions, ...) {
coefficients # Check whether the response variable is a factor or
# numeric, and then resample:
if(is.factor(data[,all.vars(formula)[1]])) {
# If the response variable is a factor:
all.vars(formula)[1]] <-
data[,factor(levels(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
1, predictions)]) } else {
# If the response variable is numeric:
all.vars(formula)[1]] <-
data[,unique(data[,all.vars(formula)[1]])[1 + rbinom(nrow(data),
1, predictions)] }
<- glm(formula, data = data, family = binomial)
m return(coef(m))
}
<- glm(type ~ pH + alcohol, data = wine, family = binomial)
m
<- boot(data = wine, statistic = coefficients,
boot_res R = 999,
formula = formula(m),
predictions = predict(m, type = "response"))
# Compute confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # pH
boot.ci(boot_res, type = "perc", index = 3) # Alcohol
\[\sim\]
Exercise 8.18 Use the model that you fitted to the sharks.csv
data in Exercise 8.16 for the following:
When the
MASS
package is loaded, you can useconfint
to obtain (asymptotic) confidence intervals for the parameters of a GLM. Use it to compute confidence intervals for the parameters of your model for thesharks.csv
data.Compute parametric bootstrap confidence intervals and p-values for the parameters of your logistic regression model for the
sharks.csv
data. Do they differ from the intervals obtained usingconfint
? Note that there are a lot of missing values for the response variable. Think about how that will affect your bootstrap intervals and adjust your code accordingly.Use the confidence interval inversion method of Section 7.7.3 to compute bootstrap p-values for the effect of age.
8.3.3 Model diagnostics
It is notoriously difficult to assess model fit for GLM’s, because the behaviour of the residuals is very different from residuals in ordinary linear models. In the case of logistic regression, the response variable is always 0 or 1, meaning that there will be two bands of residuals:
# Store deviance residuals:
<- glm(type ~ pH + alcohol, data = wine, family = binomial)
m <- data.frame(Predicted <- predict(m),
res <- residuals(m, type ="deviance"),
Residuals Index = 1:nrow(m$data),
CooksDistance = cooks.distance(m))
# Plot fitted values against the deviance residuals:
library(ggplot2)
ggplot(res, aes(Predicted, Residuals)) +
geom_point()
# Plot index against the deviance residuals:
ggplot(res, aes(Index, Residuals)) +
geom_point()
Plots of raw residuals are of little use in logistic regression models. A better option is to use a binned residual plot, in which the observations are grouped into bins based on their fitted value. The average residual in each bin can then be computed, which will tell us if which parts of the model have a poor fit. A function for this is available in the arm
package:
install.packages("arm")
library(arm)
binnedplot(predict(m, type = "response"),
residuals(m, type = "response"))
The grey lines show confidence bounds which are supposed to contain about 95 % of the bins. If too many points fall outside these bounds, it’s a sign that we have a poor model fit. In this case, there are a few points outside the bounds. Most notably, the average residuals are fairly large for the observations with the lowest fitted values, i.e. among the observations with the lowest predicted probability of being white wines.
Let’s compare the above plot to that for a model with more explanatory variables:
<- glm(type ~ pH + alcohol + fixed.acidity + residual.sugar,
m2 data = wine, family = binomial)
binnedplot(predict(m2, type = "response"),
residuals(m2, type = "response"))
This looks much better - adding more explanatory variable appears to have improved the model fit.
It’s worth repeating that if your main interest is hypothesis testing, you shouldn’t fit multiple models and then pick the one that gives the best results. However, if you’re doing an exploratory analysis or are interested in predictive modelling, you can and should try different models. It can then be useful to do a formal hypothesis test of the null hypothesis that m
and m2
fit the data equally well, against the alternative that m2
has a better fit. If both fit the data equally well, we’d prefer m
, since it is a simpler model. We can use anova
to perform a likelihood ratio deviance test (see Section 12.4 for details), which tests this:
anova(m, m2, test = "LRT")
The p-value is very low, and we conclude that m2
has a better model fit.
Another useful function is cooks.distance
, which can be used to compute the Cook’s distance for each observation, which is useful for finding influential observations. In this case, I’ve chosen to print the row numbers for the observations with a Cook’s distance greater than 0.004 - this number has been arbitrarily chosen in order only to highlight the observations with the highest Cook’s distance.
<- data.frame(Index = 1:length(cooks.distance(m)),
res CooksDistance = cooks.distance(m))
# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
geom_point() +
geom_text(aes(label = ifelse(CooksDistance > 0.004,
rownames(res), "")),
hjust = 1.1)
\[\sim\]
Exercise 8.19 Investigate the residuals for your sharks.csv
model. Are there any problems with the model fit? Any influential points?
8.3.4 Prediction
Just as for linear models, we can use predict
to make predictions for new observations using a GLM. To begin with, let’s randomly sample 10 rows from the wine
data and fit a model using all data except those ten observations:
# Randomly select 10 rows from the wine data:
<- sample(1:nrow(wine), 10)
rows
<- glm(type ~ pH + alcohol, data = wine[-rows,], family = binomial) m
We can now use predict
to make predictions for the ten observations:
<- predict(m, wine[rows,])
preds preds
Those predictions look a bit strange though - what are they? By default, predict
returns predictions on the scale of the link function. That’s not really what we want in most cases - instead, we are interested in the predicted probabilities. To get those, we have to add the argument type = "response"
to the call:
<- predict(m, wine[rows,], type = "response")
preds preds
Logistic regression models are often used for prediction, in what is known as classification. Section 9.1.7 is concerned with how to evaluate the predictive performance of logistic regression and other classification models.
8.3.5 Modelling count data
Logistic regression is but one of many types of GLM’s used in practice. One important example is Cox regression, which is used for survival data. We’ll return to that model in Section 8.5. For now, we’ll consider count data instead. Let’s have a look at the shark attack data in sharks.csv
, available on the book’s website. It contains data about shark attacks in South Africa, downloaded from The Global Shark Attack File (http://www.sharkattackfile.net/incidentlog.htm). To load it, we download the file and set file_path
to the path of sharks.csv
:
<- read.csv(file_path, sep =";")
sharks
# Compute number of attacks per year:
<- aggregate(Type ~ Year, data = sharks, FUN = length)
attacks
# Keep data for 1960-2019:
<- subset(attacks, Year >= 1960) attacks
The number of attacks in a year is not binary but a count that, in principle, can take any non-negative integer as its value. Are there any trends over time for the number of reported attacks?
# Plot data from 1960-2019:
library(ggplot2)
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks")
No trend is evident. To confirm this, let’s fit a regression model with Type
(the number of attacks) as the response variable and Year
as an explanatory variable. For count data like this, a good first model to use is Poisson regression. Let \(\mu_i\) denote the expected value of the response variable given the explanatory variables. Given \(n\) observations of \(p\) explanatory variables, the Poisson regression model is:
\[\log(\mu_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n\]
To fit it, we use glm
as before, but this time with family = poisson
:
<- glm(Type ~ Year, data = attacks, family = poisson)
m summary(m)
We can add the curve corresponding to the fitted model to our scatterplot as follows:
<- data.frame(Year = attacks$Year, at_pred =
attacks_pred predict(m, type = "response"))
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks") +
geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
colour = "red")
The fitted model seems to confirm our view that there is no trend over time in the number of attacks.
For model diagnostics, we can use a binned residual plot and a plot of Cook’s distance to find influential points:
# Binned residual plot:
library(arm)
binnedplot(predict(m, type = "response"),
residuals(m, type = "response"))
# Plot index against the Cook's distance to find
# influential points:
<- data.frame(Index = 1:nrow(m$data),
res CooksDistance = cooks.distance(m))
ggplot(res, aes(Index, CooksDistance)) +
geom_point() +
geom_text(aes(label = ifelse(CooksDistance > 0.1,
rownames(res), "")),
hjust = 1.1)
A common problem in Poisson regression models is excess zeros, i.e. more observations with value 0 than what is predicted by the model. To check the distribution of counts in the data, we can draw a histogram:
ggplot(attacks, aes(Type)) +
geom_histogram(binwidth = 1, colour = "black")
If there are a lot of zeroes in the data, we should consider using another model, such as a hurdle model or a zero-inflated Poisson regression. Both of these are available in the pscl
package.
Another common problem is overdispersion, which occurs when there is more variability in the data than what is predicted by the GLM. A formal test of overdispersion (Cameron & Trivedi, 1990) is provided by dispersiontest
in the AER
package. The null hypothesis is that there is no overdispersion, and the alternative that there is overdispersion:
install.packages("AER")
library(AER)
dispersiontest(m, trafo = 1)
There are several alternative models that can be considered in the case of overdispersion. One of them is negative binomial regression, which uses the same link function as Poisson regression. We can fit it using the glm.nb
function from MASS
:
library(MASS)
<- glm.nb(Type ~ Year, data = attacks)
m_nb
summary(m_nb)
For the shark attack data, the predictions from the two models are virtually identical, meaning that both are equally applicable in this case:
<- data.frame(Year = attacks$Year, at_pred =
attacks_pred predict(m, type = "response"))
<- data.frame(Year = attacks$Year, at_pred =
attacks_pred_nb predict(m_nb, type = "response"))
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks") +
geom_line(data = attacks_pred, aes(x = Year, y = at_pred),
colour = "red") +
geom_line(data = attacks_pred_nb, aes(x = Year, y = at_pred),
colour = "blue", linetype = "dashed")
Finally, we can obtain bootstrap confidence intervals e.g. using case resampling, using boot_summary
:
library(boot.pval)
boot_summary(m_nb, type = "perc", method = "case")
\[\sim\]
Exercise 8.20 The quakes
dataset, available in base R, contains information about seismic events off Fiji. Fit a Poisson regression model with stations
as the response variable and mag
as an explanatory variable. Are there signs of overdispersion? Does using a negative binomial model improve the model fit?
8.3.6 Modelling rates
Poisson regression models, and related models like negative binomial regression, can not only be used to model count data. They can also be used to model rate data, such as the number of cases per capita or the number of cases per unit area. In that case, we need to include an exposure variable \(N\) that describes e.g. the population size or area corresponding to each observation. The model will be that:
\[\log(\mu_i/N_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip},\qquad i=1,\ldots,n.\] Because \(\log(\mu_i/N_i)=\log(\mu_i)-\log(N_i)\), this can be rewritten as:
\[\log(\mu_i)=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip}+\log(N_i),\qquad i=1,\ldots,n.\]
In other words, we should include \(\log(N_i)\) on the right-hand side of our model, with a known coefficient equal to 1. In regression, such a term is known as an offset. We can add it to our model using the offset
function.
As an example, we’ll consider the ships
data from the MASS
package. It describes the number of damage incidents for different ship types operating in the 1960’s and 1970’s, and includes information about how many months each ship type was in service (i.e. each ship type’s exposure):
library(MASS)
?shipsView(ships)
For our example, we’ll use ship type
as the explanatory variable, incidents
as the response variable and service
as the exposure variable. First, we remove observations with 0 exposure (by definition, these can’t be involved in incidents, and so there is no point in including them in the analysis). Then, we fit the model using glm
and offset
:
<- ships[ships$service != 0,]
ships
<- glm(incidents ~ type + offset(log(service)),
m data = ships,
family = poisson)
summary(m)
Model diagnostics can be performed as in the previous sections.
Rate models are usually interpreted in terms of the rate ratios \(e^{\beta_j}\), which describe the multiplicative increases of the intensity of rates when \(x_j\) is increased by one unit. To compute the rate ratios for our model, we use exp
:
exp(coef(m))
\[\sim\]
ships
data.
8.3.7 Bayesian estimation of generalised linear models
We can fit a Bayesian GLM with the rstanarm
package, using stan_glm
in the same way we did for linear models. Let’s look at an example with the wine
data. First, we load and prepare the data:
# Import data about white and red wines:
<- read.csv("https://tinyurl.com/winedata1",
white sep = ";")
<- read.csv("https://tinyurl.com/winedata2",
red sep = ";")
$type <- "white"
white$type <- "red"
red<- rbind(white, red)
wine $type <- factor(wine$type) wine
Now, we fit a Bayesian logistic regression model:
library(rstanarm)
<- stan_glm(type ~ pH + alcohol, data = wine, family = binomial)
m
# Print the estimates:
coef(m)
Next, we can plot the posterior distributions of the effects:
plot(m, "dens", pars = names(coef(m)))
To get 95 % credible intervals for the effects, we can use posterior_interval
. We can also use plot
to visualise them:
posterior_interval(m,
pars = names(coef(m)),
prob = 0.95)
plot(m, "intervals",
pars = names(coef(m)),
prob = 0.95)
Finally, we can use \(\hat{R}\) to check model convergence. It should be less than 1.1 if the fitting has converged:
plot(m, "rhat")
8.4 Mixed models
Mixed models are used in regression problems where measurements have been made on clusters of related units. As the first example of this, we’ll use a dataset from the lme4
package, which also happens to contain useful methods for mixed models. Let’s install it:
install.packages("lme4")
The sleepstudy
dataset from lme4
contains data from a study on reaction times in a sleep deprivation study. The participants were restricted to 3 hours of sleep per night, and their average reaction time on a series of tests were measured each day during the 9 days that the study lasted:
library(lme4)
?sleepstudystr(sleepstudy)
Let’s start our analysis by making boxplots showing reaction times for each subject. We’ll also superimpose the observations for each participant on top of their boxplots:
library(ggplot2)
ggplot(sleepstudy, aes(Subject, Reaction)) +
geom_boxplot() +
geom_jitter(aes(colour = Subject),
position = position_jitter(0.1))
We are interested in finding out if the reaction times increase when the participants have been starved for sleep for a longer period. Let’s try plotting reaction times against days, adding a regression line:
ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", se = FALSE)
As we saw in the boxplots, and can see in this plot too, some participants always have comparatively high reaction times, whereas others always have low values. There are clear differences between individuals, and the measurements for each individual will be correlated. This violates a fundamental assumption of the traditional linear model, namely that all observations are independent.
In addition to this, it also seems that the reaction times change in different ways for different participants, as can be seen if we facet the plot by test subject:
ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
geom_point() +
theme(legend.position = "none") +
facet_wrap(~ Subject, nrow = 3) +
geom_smooth(method = "lm", colour = "black", se = FALSE)
Both the intercept and the slope of the average reaction time differs between individuals. Because of this, the fit given by the single model can be misleading. Moreover, the fact that the observations are correlated will cause problems for the traditional intervals and tests. We need to take this into account when we estimate the overall intercept and slope.
One approach could be to fit a single model for each subject. That doesn’t seem very useful though. We’re not really interested in these particular test subjects, but in how sleep deprivation affects reaction times in an average person. It would be much better to have a single model that somehow incorporates the correlation between measurements made on the same individual. That is precisely what a linear mixed regression model does.
8.4.1 Fitting a linear mixed model
A linear mixed model (LMM) has two types of effects (explanatory variables):
- Fixed effects, which are non-random. These are usually the variables of primary interest in the data. In the
sleepstudy
example,Days
is a fixed effect. - Random effects, which represent nuisance variables that cause measurements to be correlated. These are usually not of interest in and of themselves, but are something that we need to include in the model to account for correlations between measurements. In the
sleepstudy
example,Subject
is a random effect.
Linear mixed models can be fitted using lmer
from the lme4
package. The syntax is the same as for lm
, with the addition of random effects. These can be included in different ways. Let’s have a look at them.
First, we can include a random intercept, which gives us a model where the intercept (but not the slope) varies between test subjects. In our example, the formula for this is:
library(lme4)
<- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy) m1
Alternatively, we could include a random slope in the model, in which case the slope (but not the intercept) varies between test subjects. The formula would be:
<- lmer(Reaction ~ Days + (0 + Days|Subject), data = sleepstudy) m2
Finally, we can include both a random intercept and random slope in the model. This can be done in two different ways, as we can model the intercept and slope as being correlated or uncorrelated:
# Correlated random intercept and slope:
<- lmer(Reaction ~ Days + (1 + Days|Subject), data = sleepstudy)
m3
# Uncorrelated random intercept and slope:
<- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
m4 data = sleepstudy)
Which model should we choose? Are the intercepts and slopes correlated? It could of course be the case that individuals with a high intercept have a smaller slope - or a greater slope! To find out, we can fit different linear models to each subject, and then make a scatterplot of their intercepts and slopes. To fit a model to each subject, we use split
and map
as in Section 8.1.11:
# Collect the coefficients from each linear model:
library(purrr)
%>% split(.$Subject) %>%
sleepstudy map(~ lm(Reaction ~ Days, data = .)) %>%
map(coef) -> coefficients
# Convert to a data frame:
<- data.frame(matrix(unlist(coefficients),
coefficients nrow = length(coefficients),
byrow = TRUE),
row.names = names(coefficients))
names(coefficients) <- c("Intercept", "Days")
# Plot the coefficients:
ggplot(coefficients, aes(Intercept, Days,
colour = row.names(coefficients))) +
geom_point() +
geom_smooth(method = "lm", colour = "black", se = FALSE) +
labs(fill = "Subject")
# Test the correlation:
cor.test(coefficients$Intercept, coefficients$Days)
The correlation test is not significant, and judging from the plot, there is little indication that the intercept and slope are correlated. We saw earlier that both the intercept and the slope seem to differ between subjects, and so m4
seems like the best choice here. Let’s stick with that, and look at a summary table for the model.
summary(m4, correlation = FALSE)
I like to add correlation = FALSE
here, which suppresses some superfluous output from summary
.
You’ll notice that unlike the summary
table for linear models, there are no p-values! This is a deliberate design choice from the lme4
developers, who argue that the approximate test available aren’t good enough for small sample sizes (Bates et al., 2015).
Using the bootstrap, as we will do in Section 8.4.3, is usually the best approach for mixed models. If you really want some quick p-values, you can load the lmerTest
package, which adds p-values computed using the Satterthwaite approximation (Kuznetsova et al., 2017). This is better than the usual approximate test, but still not perfect.
install.packages("lmerTest")
library(lmerTest)
<- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
m4 data = sleepstudy)
summary(m4, correlation = FALSE)
If we need to extract the model coefficients, we can do so using fixef
(for the fixed effects) and ranef
(for the random effects):
fixef(m4)
ranef(m4)
If we want to extract the variance components from the model, we can use VarCorr
:
VarCorr(m4)
Let’s add the lines from the fitted model to our facetted plot, to compare the results of our mixed model to the lines that were fitted separately for each individual:
<- coef(m4)$Subject
mixed_mod $Subject <- row.names(mixed_mod)
mixed_mod
ggplot(sleepstudy, aes(Days, Reaction)) +
geom_point() +
theme(legend.position = "none") +
facet_wrap(~ Subject, nrow = 3) +
geom_smooth(method = "lm", colour = "cyan", se = FALSE,
size = 0.8) +
geom_abline(aes(intercept = `(Intercept)`, slope = Days,
color = "magenta"),
data = mixed_mod, size = 0.8)
Notice that the lines differ. The intercept and slopes have been shrunk toward the global effects, i.e. toward the average of all lines.
\[\sim\]
Exercise 8.22 Consider the Oxboys
data from the nlme
package. Does a mixed model seem appropriate here? If so, is the intercept and slope for different subjects correlated? Fit a suitable model, with height
as the response variable.
Save the code for your model, as you will return to it in the next few exercises.
(Click here to go to the solution.)
Exercise 8.23 The broom.mixed
package allows you to get summaries of mixed models as data frames, just as broom
does for linear and generalised linear models. Install it and use it to get the summary table for the model for the Oxboys
data that you created in the previous exercise. How are fixed and random effects included in the table?
8.4.2 Model diagnostics
As for any linear model, residual plots are useful for diagnostics for linear mixed models. Of particular interest are signs of heteroscedasticity, as homoscedasticity is assumed in the mixed model. We’ll use fortify.merMod
to turn the model into an object that can be used with ggplot2
, and then create some residual plots:
library(ggplot2)
<- fortify.merMod(m4)
fm4
# Plot residuals:
ggplot(fm4, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
xlab("Fitted values") + ylab("Residuals")
# Compare the residuals of different subjects:
ggplot(fm4, aes(Subject, .resid)) +
geom_boxplot() +
coord_flip() +
ylab("Residuals")
# Observed values versus fitted values:
ggplot(fm4, aes(.fitted, Reaction)) +
geom_point(colour = "blue") +
facet_wrap(~ Subject, nrow = 3) +
geom_abline(intercept = 0, slope = 1) +
xlab("Fitted values") + ylab("Observed values")
## Q-Q plot of residuals:
ggplot(fm4, aes(sample = .resid)) +
geom_qq() + geom_qq_line()
## Q-Q plot of random effects:
ggplot(ranef(m4)$Subject, aes(sample = `(Intercept)`)) +
geom_qq() + geom_qq_line()
ggplot(ranef(m4)$Subject, aes(sample = `Days`)) +
geom_qq() + geom_qq_line()
The normality assumption appears to be satisfied, but there are some signs of heteroscedasticity in the boxplots of the residuals for the different subjects.
\[\sim\]
Exercise 8.24 Return to your mixed model for the Oxboys
data from Exercise 8.22. Make diagnostic plots for the model. Are there any signs of heteroscedasticity or non-normality?
8.4.3 Bootstrapping
Summary tables, including p-values, for the fixed effects are available through boot_summary
:
library(boot.pval)
boot_summary(m4, type = "perc")
boot_summary
calls a function called bootMer
, which performs parametric resampling from the model. In case you want to call it directly, you can do as follows:
<- bootMer(m4, fixef, nsim = 999)
boot_res
library(boot)
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # Days
8.4.4 Nested random effects and multilevel/hierarchical models
In many cases, a random factor is nested within another. To see an example of this, consider the Pastes
data from lme4
:
library(lme4)
?Pastesstr(Pastes)
We are interested in the strength
of a chemical product. There are ten delivery batches (batch
), and three casks within each delivery (cask
). Because of variations in manufacturing, transportation, storage, and so on, it makes sense to include random effects for both batch
and cask
in a linear mixed model. However, each cask only appears within a single batch, which makes the cask
effect nested within batch
.
Models that use nested random factors are commonly known as multilevel models (the random factors exist at different “levels”), or hierarchical models (there is a hierarchy between the random factors). These aren’t really any different from other mixed models, but depending on how the data is structured, we may have to be a bit careful to get the nesting right when we fit the model with lmer
.
If the two effects weren’t nested, we could fit a model using:
# Incorrect model:
<- lmer(strength ~ (1|batch) + (1|cask),
m1 data = Pastes)
summary(m1, correlation = FALSE)
However, because the casks are labelled a
, b
, and c
within each batch, we’ve now fitted a model where casks from different batches are treated as being equal! To clarify that the labels a
, b
, and c
belong to different casks in different batches, we need to include the nesting in our formula. This is done as follows:
# Cask in nested within batch:
<- lmer(strength ~ (1|batch/cask),
m2 data = Pastes)
summary(m2, correlation = FALSE)
Equivalently, we can also use:
<- lmer(strength ~ (1|batch) + (1|batch:cask),
m3 data = Pastes)
summary(m3, correlation = FALSE)
8.4.5 ANOVA with random effects
The lmerTest
package provides ANOVA tables that allow us to use random effects in ANOVA models. To use it, simply load lmerTest
before fitting a model with lmer
, and then run anova(m, type = "III")
(or replace III
with II
or I
if you want a type II or type I ANOVA table instead).
As an example, consider the TVbo
data from lmerTest
. 3 types of TV sets were compared by 8 assessors for 4 different pictures. To see if there is a difference in the mean score for the colour balance of the TV sets, we can fit a mixed model. We’ll include a random intercept for the assessor. This is a balanced design (in which case the results from all three types of tables coincide):
library(lmerTest)
# TV data:
?TVbo
# Fit model with both fixed and random effects:
<- lmer(Colourbalance ~ TVset*Picture + (1|Assessor),
m data = TVbo)
# View fitted model:
m
# All three types of ANOVA table give the same results here:
anova(m, type = "III")
anova(m, type = "II")
anova(m, type = "I")
The interaction effect is significant at the 5 % level. As for other ANOVA models, we can visualise this with an interaction plot:
interaction.plot(TVbo$TVset, TVbo$Picture,
response = TVbo$Colourbalance)
\[\sim\]
Exercise 8.25 Fit a mixed effects ANOVA to the TVbo
data, using Coloursaturation
as the response variable, TVset
and Picture
as fixed effects, and Assessor
as a random effect. Does there appear to be a need to include the interaction between Assessor
and TVset
as a random effect? If so, do it.
8.4.6 Generalised linear mixed models
Everything that we have just done for the linear mixed models carries over to generalised linear mixed models (GLMM), which are GLM’s with both fixed and random effects.
A common example is the item response model, which plays an important role in psychometrics. This model is frequently used in psychological tests containing multiple questions or sets of questions (“items”), where both the subject the item are considered random effects. As an example, consider the VerbAgg
data from lme4
:
library(lme4)
?VerbAggView(VerbAgg)
We’ll use the binary version of the response, r2
, and fit a logistic mixed regression model to the data, to see if it can be used to explain the subjects’ responses. The formula syntax is the same as for linear mixed models, but now we’ll use glmer
to fit a GLMM. We’ll include Anger
and Gender
as fixed effects (we are interested in seeing how these affect the response) and item
and id
as random effects with random slopes (we believe that answers to the same item and answers from the same individual may be correlated):
<- glmer(r2 ~ Anger + Gender + (1|item) + (1|id),
m data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)
We can plot the fitted random effects for item
to verify that there appear to be differences between the different items:
<- coef(m)$item
mixed_mod $item <- row.names(mixed_mod)
mixed_mod
ggplot(mixed_mod, aes(`(Intercept)`, item)) +
geom_point() +
xlab("Random intercept")
The situ
variable, describing situation type, also appears interesting. Let’s include it as a fixed effect. Let’s also allow different situational (random) effects for different respondents. It seems reasonable that such responses are random rather than fixed (as in the solution to Exercise 8.25), and we do have repeated measurements of these responses. We’ll therefore also include situ
as a random effect nested within id
:
<- glmer(r2 ~ Anger + Gender + situ + (1|item) + (1|id/situ),
m data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)
Finally, we’d like to obtain bootstrap confidence intervals for fixed effects. Because this is a fairly large dataset (\(n=7,584\)) this can take a looong time to run, so stretch your legs and grab a cup of coffee or two while you wait:
library(boot.pval)
boot_summary(m, type = "perc", R = 100)
# Ideally, R should be greater, but for the sake of
# this example, we'll use a low number.
\[\sim\]
Exercise 8.26 Consider the grouseticks
data from the lme4
package (Elston et al., 2001). Fit a mixed Poisson regression model to the data, with TICKS
as the response variable and YEAR
and HEIGHT
as fixed effects. What variables are suitable to use for random effects? Compute a bootstrap confidence interval for the effect of HEIGHT
.
8.4.7 Bayesian estimation of mixed models
From a numerical point of view, using Bayesian modelling with rstanarm
is preferable to frequentist modelling with lme4
if you have complex models with many random effects. Indeed, for some models, lme4
will return a warning message about a singular fit, basically meaning that the model is too complex, whereas rstanarm
, powered by the use of a prior distribution, always will return a fitted model regardless of complexity.
After loading rstanarm
, fitting a Bayesian linear mixed model with a weakly informative prior is as simple as substituting lmer
with stan_lmer
:
library(lme4)
library(rstanarm)
<- stan_lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
m4 data = sleepstudy)
# Print the results:
m4
To plot the posterior distributions for the coefficients of the fixed effects, we can use plot
, specifying which effects we are interested in using pars
:
plot(m4, "dens", pars = c("(Intercept)", "Days"))
To get 95 % credible intervals for the fixed effects, we can use posterior_interval
as follows:
posterior_interval(m4,
pars = c("(Intercept)", "Days"),
prob = 0.95)
We can also plot them using plot
:
plot(m4, "intervals",
pars = c("(Intercept)", "Days"),
prob = 0.95)
Finally, we’ll check that the model fitting has converged:
plot(m4, "rhat")
8.5 Survival analysis
Many studies are concerned with the duration of time until an event happens: time until a machine fails, time until a patient diagnosed with a disease dies, and so on. In this section we will consider some methods for survival analysis (also known as reliability analysis in engineering and duration analysis in economics), which is used for analysing such data. The main difficulty here is that studies often end before all participants have had events, meaning that some observations are right-censored - for these observations, we don’t know when the event happened, but only that it happened after the end of the study.
The survival
package contains a number of useful methods for survival analysis. Let’s install it:
install.packages("survival")
We will study the lung cancer data in lung
:
library(survival)
?lungView(lung)
The survival times of the patients consist of two parts: time
(the time from diagnosis until either death or the end of the study) and status
(1 if the observations is censored, 2 if the patient died before the end of the study). To combine these so that they can be used in a survival analysis, we must create a Surv
object:
Surv(lung$time, lung$status)
Here, a +
sign after a value indicates right-censoring.
8.5.1 Comparing groups
Survival times are best visualised using Kaplan-Meier curves that show the proportion of surviving patients. Let’s compare the survival times of women and men. We first fit a survival model using survfit
, and then draw the Kaplan-Meier curve (with parametric confidence intervals) using autoplot
from ggfortify
:
library(ggfortify)
library(survival)
<- survfit(Surv(time, status) ~ sex, data = lung)
m autoplot(m)
To print the values for the survival curves at different time points, we can use summary
:
summary(m)
To test for differences between two groups, we can use the logrank test (also known as the Mantel-Cox test), given by survfit
:
survdiff(Surv(time, status) ~ sex, data = lung)
Another option is the Peto-Peto test, which puts more weight on early events (deaths, in the case of the lung
data), and therefore is suitable when such events are of greater interest. In contrast, the logrank test puts equal weights on all events regardless of when they occur. The Peto-Peto test is obtained by adding the argument rho = 1
:
survdiff(Surv(time, status) ~ sex, rho = 1, data = lung)
The Hmisc
package contains a function for obtaining confidence intervals based on the Kaplan-Meier estimator, called bootkm
. This allows us to get confidence intervals for the quantiles (including the median) of the survival distribution for different groups, as well as for differences between the quantiles of different groups. First, let’s install it:
install.packages("Hmisc")
We can now use bootkm
to compute bootstrap confidence intervals for survival times based on the lung
data. We’ll compute an interval for the median survival time for females, and one for the difference in median survival time between females and males:
library(Hmisc)
# Create a survival object:
<- Surv(lung$time, lung$status)
survobj
# Get bootstrap replicates of the median survival time for
# the two groups:
<- bootkm(survobj[lung$sex == 2],
median_surv_time_female q = 0.5, B = 999)
<- bootkm(survobj[lung$sex == 1],
median_surv_time_male q = 0.5, B = 999)
# 95 % bootstrap confidence interval for the median survival time
# for females:
quantile(median_surv_time_female,
c(.025,.975), na.rm=TRUE)
# 95 % bootstrap confidence interval for the difference in median
# survival time:
quantile(median_surv_time_female - median_surv_time_male,
c(.025,.975), na.rm=TRUE)
To obtain confidence intervals for other quantiles, we simply change the argument q
in bootkm
.
Exercise 8.27 Consider the ovarian
data from the survival
package. Plot Kaplan-Meier curves comparing the two treatment groups. Compute a bootstrap confidence interval for the difference in the 75 % quantile for the survival time for the two groups.
8.5.2 The Cox proportional hazards model
The hazard function, or hazard rate, is the rate of events at time \(t\) if a subject has survived until time \(t\). The higher the hazard, the greater the probability of an event. Hazard rates play an integral part in survival analysis, particularly in regression models. To model how the survival times are affected by different explanatory variables, we can use a Cox proportional hazards model (Cox, 1972), fitted using coxph
:
<- coxph(Surv(time, status) ~ age + sex, data = lung)
m summary(m)
The exponentiated coefficients show the hazard ratios, i.e. the relative increases (values greater than 1) or decreases (values below 1) of the hazard rate when a covariate is increased one step while all others are kept fixed:
exp(coef(m))
In this case, the hazard increases with age (multiply the hazard by 1.017 for each additional year that the person has lived), and is lower for women (sex=2
) than for men (sex=1
).
The censboot_summary
function from boot.pval
provides a table of estimates, bootstrap confidence intervals, and bootstrap p-values for the model coefficients. The coef
argument can be used to specify whether to print confidence intervals for the coefficients or for the exponentiated coeffientes (i.e. the hazard ratios):
# censboot_summary requires us to use model = TRUE
# when fitting our regression model:
<- coxph(Surv(time, status) ~ age + sex,
m data = lung, model = TRUE)
library(boot.pval)
# Original coefficients:
censboot_summary(m, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m, type = "perc", coef = "exp")
To manually obtain bootstrap confidence intervals for the exponentiated coefficients, we can use the censboot
function from boot
as follows:
# Function to get the bootstrap replicates of the exponentiated
# coefficients:
<- function(data, formula) {
boot_fun <- coxph(formula, data = data)
m_boot return(exp(coef(m_boot)))
}
# Run the resampling:
library(boot)
<- censboot(lung[,c("time", "status", "age", "sex")],
boot_res R = 999,
boot_fun, formula =
formula(Surv(time, status) ~ age + sex))
# Compute the percentile bootstrap confidence intervals:
boot.ci(boot_res, type = "perc", index = 1) # Age
boot.ci(boot_res, type = "perc", index = 2) # Sex
As the name implies, the Cox proportional hazards model relies on the assumption of proportional hazards, which essentially means that the effect of the explanatory variables is constant over time. This can be assessed visually by plotting the model residuals, using cox.zph
and the ggcoxzph
function from the survminer
package. Specifically, we will plot the scaled Schoenfeld (1982) residuals, which measure the difference between the observed covariates and the expected covariates given the risk at the time of an event. If the proportional hazards assumption holds, then there should be no trend over time for these residuals. Use the trend line to aid the eye:
install.packages("survminer")
library(survminer)
ggcoxzph(cox.zph(m), var = 1) # age
ggcoxzph(cox.zph(m), var = 2) # sex
# Formal p-values for a test of proportional
# hazards, for each variable:
cox.zph(m)
In this case, there are no apparent trends over time (which is in line with the corresponding formal hypothesis tests), indicating that the proportional hazards model could be applicable here.
\[\sim\]
Exercise 8.28 Consider the ovarian
data from the survival
package.
- Use a Cox proportional hazards regression to test whether there is a difference between the two treatment groups, adjusted for age.
- Compute bootstrap confidence interval for the hazard ratio of age.
(Click here to go to the solution.)
Exercise 8.29 Consider the retinopathy
data from the survival
package. We are interested in a mixed survival model, where id
is used to identify patients and type
, trt
and age
are fixed effects. Fit a mixed Cox proportional hazards regression (add cluster = id
to the call to coxph
to include this as a random effect). Is the assumption of proportional hazards fulfilled?
8.5.3 Accelerated failure time models
In many cases, the proportional hazards assumption does not hold. In such cases we can turn to accelerated failure time models (Wei, 1992), for which the effect of covariates is to accelerate or decelerate the life course of a subject.
While the proportional hazards model is semiparametric, accelerated failure time models are typically fully parametric, and thus involve stronger assumptions about an underlying distribution. When fitting such a model using the survreg
function, we must therefore specify what distribution to use. Two common choices are the Weibull distribution and the log-logistic distribution. The Weibull distribution is commonly used in engineering, e.g. in reliability studies. The hazard function of Weibull models is always monotonic, i.e. either always increasing or always decreasing. In contrast, the log-logistic distribution allows the hazard function to be non-monotonic, making it more flexible, and often more appropriate for biological studies. Let’s fit both types of models to the lung
data and have a look at the results:
library(survival)
# Fit Weibull model:
<- survreg(Surv(time, status) ~ age + sex, data = lung,
m_w dist = "weibull", model = TRUE)
summary(m_w)
# Fit log-logistic model:
<- survreg(Surv(time, status) ~ age + sex, data = lung,
m_ll dist = "loglogistic", model = TRUE)
summary(m_ll)
Interpreting the coefficients of accelerated failure time models is easier than interpreting coefficients from proportional hazards models. The exponentiated coefficients show the relative increase or decrease in the expected survival times when a covariate is increased one step while all others are kept fixed:
exp(coef(m_ll))
In this case, according to the log-logistic model, the expected survival time decreases by 1.4 % (i.e. multiply by \(0.986\)) for each additional year that the patient has lived. The expected survival time for females (sex=2
) is 61.2 % higher than for males (multiply by \(1.612\)).
To obtain bootstrap confidence intervals and p-values for the effects, we follow the same procedure as for the Cox model, using censboot_summary
. Here is an example for the log-logistic accelerated failure time model:
library(boot.pval)
# Original coefficients:
censboot_summary(m_ll, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m_ll, type = "perc", coef = "exp")
We can also use censboot
:
# Function to get the bootstrap replicates of the exponentiated
# coefficients:
<- function(data, formula, distr) {
boot_fun <- survreg(formula, data = data, dist = distr)
m_boot return(exp(coef(m_boot)))
}
# Run the resampling:
library(boot)
<- censboot(lung[,c("time", "status", "age", "sex")],
boot_res R = 999,
boot_fun, formula =
formula(Surv(time, status) ~ age + sex),
distr = "loglogistic")
# Compute the percentile bootstrap confidence intervals:
boot.ci(boot_res, type = "perc", index = 2) # Age
boot.ci(boot_res, type = "perc", index = 3) # Sex
\[\sim\]
Exercise 8.30 Consider the ovarian
data from the survival
package. Fit a log-logistic accelerated failure time model to the data, using all available explanatory variables. What is the estimated difference in survival times between the two treatment groups?
8.5.4 Bayesian survival analysis
At the time of this writing, the latest release of rstanarm
does not contain functions for fitting survival analysis models. You can check whether this still is the case by running ?stan_surv
in the Console. If you don’t find the documentation for the stan_surv
function, you will have to install the development version of the package from GitHub (which contains such functions), using the following code:
# Check if the devtools package is installed, and start
# by installing it otherwise:
if (!require(devtools)) {
install.packages("devtools")
}library(devtools)
# Download and install the development version of the package:
install_github("stan-dev/rstanarm", build_vignettes = FALSE)
Now, let’s have a look at how to fit a Bayesian model to the lung
data from survival
:
library(survival)
library(rstanarm)
# Fit proportional hazards model using cubic M-splines (similar
# but not identical to the Cox model!):
<- stan_surv(Surv(time, status) ~ age + sex, data = lung)
m m
Fitting a survival model with a random effect works similarly, and uses the same syntax as lme4
. Here is an example with the retinopathy
data:
<- stan_surv(Surv(futime, status) ~ age + type + trt + (1|id),
m data = retinopathy)
m
8.5.5 Multivariate survival analysis
Some trials involve multiple time-to-event outcomes that need to be assessed simultaneously in a multivariate analysis. Examples includes studies of the time until each of several correlated symptoms or comorbidities occur. This is analogous to the multivariate testing problem of Section 7.2.6, but with right-censored data. To test for group differences for a vector of right-censored outcomes, a multivariate version of the logrank test described in Persson et al. (2019) can be used. It is available through the MultSurvTests
package:
install.packages("MultSurvTests")
As an example, we’ll use the diabetes
dataset from MultSurvTest
. It contains two time-to-event outcomes: time until blindness in a treated eye and in an untreated eye.
library(MultSurvTests)
# Diabetes data:
?diabetes
We’ll compare two groups that received two different treatments. The survival times (time until blindness) and censoring statuses of the two groups are put in a matrices called z
and z.delta
, which are used as input for the test function perm_mvlogrank
:
# Survival times for the two groups:
<- as.matrix(subset(diabetes, LASER==1)[,c(6,8)])
x <- as.matrix(subset(diabetes, LASER==2)[,c(6,8)])
y
# Censoring status for the two groups:
<- as.matrix(subset(diabetes, LASER==1)[,c(7,9)])
delta.x <- as.matrix(subset(diabetes, LASER==2)[,c(7,9)])
delta.y
# Create the input for the test:
<- rbind(x, y)
z <- rbind(delta.x, delta.y)
delta.z
# Run the test with 499 permutations:
perm_mvlogrank(B = 499, z, delta.z, n1 = nrow(x))
8.5.6 Power estimates for the logrank test
The spower
function in Hmisc
can be used to compute the power of the univariate logrank test in different scenarios using simulation. The helper functions Weibull2
, Lognorm2
, and Gompertz2
can be used to define Weibull, lognormal and Gomperts distributions to sample from, using survival probabilities at different time points rather than the traditional parameters of those distributions. We’ll look at an example involving the Weibull distribution here. Additional examples can be found in the function’s documentation (?spower
).
Let’s simulate the power of a 3-year follow-up study with two arms (i.e. two groups, control and intervention). First, we define a Weibull distribution for (compliant) control patients. Let’s say that their 1-year survival is 0.9 and their 3-year survival is 0.6. To define a Weibull distribution that corresponds to these numbers, we use Weibull2
as follows:
<- Weibull2(c(1, 3), c(.9, .6)) weib_dist
We’ll assume that the treatment has no effect for the first 6 months, and that it then has a constant effect, leading to a hazard ratio of 0.75 (so the hazard ratio is 1 if the time in years is less than or equal to 0.5, and 0.75 otherwise). Moreover, we’ll assume that there is a constant drop-out rate, such that 20 % of the patients can be expected to drop out during the three years. Finally, there is no drop-in. We define a function to simulate survival times under these conditions:
# In the functions used to define the hazard ratio, drop-out
# and drop-in, t denotes time in years:
<- Quantile2(weib_dist,
sim_func hratio = function(t) { ifelse(t <= 0.5, 1, 0.75) },
dropout = function(t) { 0.2*t/3 },
dropin = function(t) { 0 })
Next, we define a function for the censoring distribution, which is assumed to be the same for both groups. Let’s say that each follow-up is done at a random time point between 2 and 3 years. We’ll therefore use a uniform distribution on the interval \((2,3)\) for the censoring distribution:
<- function(n)
rcens
{runif(n, 2, 3)
}
Finally, we define two helper functions required by spower
and then run the simulation study. The output is the simulated power using the settings that we’ve just created.
# Define helper functions:
<- function(n) { sim_func(n, "control") }
rcontrol <- function(n) { sim_func(n, "intervention") }
rinterv
# Simulate power when both groups have sample size 300:
spower(rcontrol, rinterv, rcens, nc = 300, ni = 300,
test = logrank, nsim = 999)
# Simulate power when both groups have sample size 450:
spower(rcontrol, rinterv, rcens, nc = 450, ni = 450,
test = logrank, nsim = 999)
# Simulate power when the control group has size 100
# and the intervention group has size 300:
spower(rcontrol, rinterv, rcens, nc = 100, ni = 300,
test = logrank, nsim = 999)
8.6 Left-censored data and nondetects
Survival data is typically right-censored. Left-censored data, on the other hand, is common in medical research (e.g. in biomarker studies) and environmental chemistry (e.g. measurements of chemicals in water), where some measurements fall below the laboratory’s detection limits (or limit of detection, LoD). Such data also occur in studies in economics. A measurement below the detection limit, a nondetect, is still more informative than having no measurement at all - we may not know the exact value, but we know that the measurement is below a given threshold.
In principle, all methods that are applicable to survival analysis can also be used for left-censored data (although the interpretation of coefficients and parameters may differ), but in practice the distributions of lab measurements and economic variables often differ from those that typically describe survival times. In this section we’ll look at methods tailored to the kind of left-censored data that appears in applications in the aforementioned fields.
8.6.1 Estimation
The EnvStats
package contains a number of functions that can be used to compute descriptive statistics and estimating parameters of distributions from data with nondetects. Let’s install it:
install.packages("EnvStats")
Estimates of the mean and standard deviation of a normal distribution that take the censoring into account in the right way can be obtained with enormCensored
, which allows us to use several different estimators (the details surrounding the available estimators can be found using ?enormCensored
). Analogous functions are available for other distributions, for instance elnormAltCensored
for the lognormal distribution, egammaCensored
for the gamma distribution, and epoisCensored
for the Poisson distribution.
To illustrate the use of enormCensored
, we will generate data from a normal distribution. We know the true mean and standard deviation of the distribution, and can compute the estimates for the generated sample. We will then pretend that there is a detection limit for this data, and artificially left-censor about 20 % of it. This allows us to compare the estimates for the full sample and the censored sample, to see how the censoring affects the estimates. Try running the code below a few times:
# Generate 50 observations from a N(10, 9)-distribution:
<- rnorm(50, 10, 3)
x
# Estimate the mean and standard deviation:
<- mean(x)
mean_full <- sd(x)
sd_full
# Censor all observations below the "detection limit" 8
# and replace their values by 8:
<- x<8
censored <- 8
x[censored]
# The proportion of censored observations is:
mean(censored)
# Estimate the mean and standard deviation in a naive
# manner, using the ordinary estimators with all
# nondetects replaced by 8:
<- mean(x)
mean_cens_naive <- sd(x)
sd_cens_naive
# Estimate the mean and standard deviation using
# different estimators that take the censoring
# into account:
library(EnvStats)
# Maximum likelihood estimate:
<- enormCensored(x, censored,
estimates_mle method = "mle")
# Biased-corrected maximum likelihood estimate:
<- enormCensored(x, censored,
estimates_bcmle method = "bcmle")
# Regression on order statistics, ROS, estimate:
<- enormCensored(x, censored,
estimates_ros method = "ROS")
# Compare the different estimates:
mean_full; sd_full
mean_cens_naive; sd_cens_naive$parameters
estimates_mle$parameters
estimates_bcmle$parameters estimates_ros
The naive estimators tend to be biased for data with nondetects (sometimes very biased!). Your mileage may vary depending on e.g. the sample size and the amount of censoring, but in general, the estimators that take censoring into account will fare much better.
After we have obtained estimates for the parameters of the normal distribution, we can plot the data against the fitted distribution to check the assumption of normality:
library(ggplot2)
# Compare to histogram, including a bar for nondetects:
ggplot(data.frame(x), aes(x)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = estimates_mle$parameters[1],
sd = estimates_mle$parameters[2]))
# Compare to histogram, excluding nondetects:
<- x[!censored]
x_noncens ggplot(data.frame(x_noncens), aes(x_noncens)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = estimates_mle$parameters[1],
sd = estimates_mle$parameters[2]))
To obtain percentile and BCa bootstrap confidence intervals for the mean, we can add the options ci = TRUE
and ci.method = "bootstrap"
:
# Using 999 bootstrap replicates:
enormCensored(x, censored, method = "mle",
ci = TRUE, ci.method = "bootstrap",
n.bootstraps = 999)$interval$limits
\[\sim\]
Exercise 8.31 Download the il2rb.csv
data from the book’s web page. It contains measurements of the biomarker IL-2RB made in serum samples from two groups of patients. The values that are missing are in fact nondetects, with detection limit 0.25.
Under the assumption that the biomarker levels follow a lognormal distribution, compute bootstrap confidence intervals for the mean of the distribution for the control group. What proportion of the data is left-censored?
8.6.2 Tests of means
When testing the difference between two groups’ means, nonparametric tests like the Wilcoxon-Mann-Whitney test often perform very well for data with nondetects, unlike the t-test (Zhang et al., 2009). For data with a high degree of censoring (e.g. more than 50 %), most tests perform poorly. For multivariate tests of mean vectors the situation is the opposite, with Hotelling’s \(T^2\) (Section 7.2.6) being a much better option than nonparametric tests (Thulin, 2016).
\[\sim\]
Exercise 8.32 Return to the il2rb.csv
data from Exercise 8.32. Test the hypothesis that there is no difference in location between the two groups.
8.6.3 Censored regression
Censored regression models can be used when the response variable is censored. A common model in economics is the Tobit regression model (Tobin, 1958), which is a linear regression model with normal errors, tailored to left-censored data. It can be fitted using survreg
.
As an example, consider the EPA.92c.zinc.df
dataset available in EnvStats
. It contains measurements of zinc concentrations from five wells, made on 8 samples from each well, half of which are nondetects. Let’s say that we are interested in comparing these five wells (so that the wells aren’t random effects). Let’s also assume that the 8 samples were collected at different time points, and that we want to investigate whether the concentrations change over time. Such changes could be non-linear, so we’ll include the sample number as a factor. To fit a Tobit model to this data, we use survreg
as follows.
library(EnvStats)
.92c.zinc.df
?EPA
# Note that in Surv, in the vector describing censoring 0 means
# censoring and 1 no censoring. This is the opposite of the
# definition used in EPA.92c.zinc.df$Censored, so we use the !
# operator to change 0's to 1's and vice versa.
library(survival)
<- survreg(Surv(Zinc, !Censored, type = "left") ~ Sample + Well,
m data = EPA.92c.zinc.df, dist = "gaussian")
summary(m)
Similarly, we can fit a model under the assumption of lognormality:
<- survreg(Surv(Zinc, !Censored, type = "left") ~ Sample + Well,
m data = EPA.92c.zinc.df, dist = "lognormal")
summary(m)
Fitting regression models where the explanatory variables are censored is more challenging. For prediction, a good option is models based on decision trees, studied in Section 9.5. For testing whether there is a trend over time, tests based on Kendall’s correlation coefficient can be useful. EnvStats
provides two functions for this - kendallTrendTest
for testing a monotonic trend, and kendallSeasonalTrendTest
for testing a monotonic trend within seasons.
8.7 Creating matched samples
Matching is used to balance the distribution of explanatory variables in the groups that are being compared. This is often required in observational studies, where the treatment variable is not randomly assigned, but determined by some external factor(s) that may be related to the treatment. For instance, if you wish to study the effect of smoking on mortality, you can recruit a group of smokers and non-smokers and follow them for a few years. But both mortality and smoking are related to confounding variables such as age and gender, meaning that imbalances in the age and gender distributions of smokers and non-smokers can bias the results. There are several methods for creating balanced or matched samples that seek to mitigate this bias, including propensity score matching, which we’ll use here. The MatchIt
and optmatch
packages contain the functions that we need for this.
To begin with, let’s install the two packages:
install.packages(c("MatchIt", "optmatch"))
We will illustrate the use of the packages using the lalonde
dataset, that is shipped with the MatchIt
package:
library(MatchIt)
data(lalonde)
?lalondeView(lalonde)
Note that the data has row names, which are useful e.g. for identifying which individuals have been paired - we can access them using rownames(lalonde)
.
8.7.1 Propensity score matching
To perform automated propensity score matching, we will use the matchit
function, which computes propensity scores and then matches participants from the treatment and control groups using these. Matches can be found in several ways. We’ll consider two of them here. As input, the matchit
function takes a formula describing the treatment variable and potential confounders, what datasets to use, which method to use and what ratio of control to treatment participants to use.
A common method is nearest neighbour matching, where each participant is matched to the participant in the other group with the most similar propensity score. By default, it starts by finding a match for the participant in the treatment group that has the largest propensity score, then finds a match for the participant in the treatment groups with the second largest score, and so on. Two participants cannot be matched with the same participant in the control group. The nearest neighbour match is locally optimal in the sense that it find the best (still) available match for each participant in the treatment group, ignoring if that match in fact would be even better for another participant in the treatment group.
To perform propensity score matching using nearest neighbour matching with 1 match each, evaluate the results and then extract the matched samples we can use matchit
as follows:
<- matchit(treat ~ re74 + re75 + age + educ + married,
matches data = lalonde, method = "nearest", ratio = 1)
summary(matches)
plot(matches)
plot(matches, type = "hist")
<- match.data(matches)
matched_data summary(matched_data)
To view the matched pairs, you can use:
$match.matrix matches
To view the values of the re78
variable of the matched pairs, use:
<- "re78"
varName <- lalonde[row.names(matches$match.matrix), varName]
resMatrix for(i in 1:ncol(matches$match.matrix))
{<- cbind(resMatrix, lalonde[matches$match.matrix[,i],
resMatrix
varName])
}rownames(resMatrix) <- row.names(matches$match.matrix)
View(resMatrix)
As an alternative to nearest neighbour-matching, optimal matching can be used. This is similar to nearest neighbour-matching, but strives to obtain globally optimal matches rather than locally optimal. This means that each participant in the treatment group is paired with a participant in the control group, while also taking into account how similar the latter participant is to other participants in the treatment group.
To perform propensity score matching using optimal matching with 2 matches each:
<- matchit(treat ~ re74 + re75 + age + educ + married,
matches data = lalonde, method = "optimal", ratio = 2)
summary(matches)
plot(matches)
plot(matches, type = "hist")
<- match.data(matches)
matched_data summary(matched_data)
You may also want to find all controls that match participants in the treatment group exactly. This is called exact matching:
<- matchit(treat ~ re74 + re75 + age + educ + married,
matches data = lalonde, method = "exact")
summary(matches)
plot(matches)
plot(matches, type = "hist")
<- match.data(matches)
matched_data summary(matched_data)
Participants with no exact matches won’t be included in matched_data
.
8.7.2 Stepwise matching
At times you will want to combine the above approaches. For instance, you may want to have an exact match for age, and then an approximate match using the propensity scores for other variables. This is also achievable but requires the matching to be done in several steps. To first match the participant exactly on age and then 1-to-2 via nearest-neighbour propensity score matching on re74
and re75
we can use a loop:
# Match exactly one age:
<- matchit(treat ~ age, data = lalonde, method = "exact")
matches <- match.data(matches)
matched_data
# Match the first subclass 1-to-2 via nearest-neighbour propensity
# score matching:
<- matchit(treat ~ re74 + re75,
matches2 data = matched_data[matched_data$subclass == 1,],
method = "nearest", ratio = 2)
<- match.data(matches2, weights = "weights2",
matched_data2 subclass = "subclass2")
<- matches2$match.matrix
matchlist
# Match the remaining subclasses in the same way:
for(i in 2:max(matched_data$subclass))
{<- matchit(treat ~ re74 + re75,
matches2 data = matched_data[matched_data$subclass == i,],
method = "nearest", ratio = 2)
<- rbind(matched_data2, match.data(matches2,
matched_data2 weights = "weights2",
subclass = "subclass2"))
<- rbind(matchlist, matches2$match.matrix)
matchlist
}
# Check results:
View(matchlist)
View(matched_data2)
If the variables aren’t centred, the intercept is the expected value of the response variable when all explanatory variables are 0. This isn’t always realistic or meaningful.↩︎
On the contrary, doing so will usually only serve to make interpretation more difficult.↩︎
Prediction intervals provide interval estimates for the new observations. They incorporate both the uncertainty associated with our model estimates, and the fact that the new observation is likely to deviate slightly from its expected value.↩︎