6 R programming

The tools in Chapters 2-5 will allow you to manipulate, summarise and visualise your data in all sorts of ways. But what if you need to compute some statistic that there isn’t a function for? What if you need automatic checks of your data and results? What if you need to repeat the same analysis for a large number of files? This is where the programming tools you’ll learn about in this chapter, like loops and conditional statements, come in handy. And this is where you take the step from being able to use R for routine analyses to being able to use R for any analysis.

After working with the material in this chapter, you will be able to use R to:

Write your own R functions,
Use several new pipe operators,
Use conditional statements to perform different operations depending on whether or not a condition is satisfied,
Iterate code operations multiple times using loops,
Iterate code operations multiple times using functionals,
Measure the performance of your R code.

6.1 Functions

Suppose that we wish to compute the mean of a vector x. One way to do this would be to use sum and length:

x <- 1:100
# Compute mean:
sum(x)/length(x)

Now suppose that we wish to compute the mean of several vectors. We could do this by repeated use of sum and length:

x <- 1:100
y <- 1:200
z <- 1:300

# Compute means:
sum(x)/length(x)
sum(y)/length(y)
sum(z)/length(x)

But wait! I made a mistake when I copied the code to compute the mean of z - I forgot to change length(x) to length(z)! This is an easy mistake to make when you repeatedly copy and paste code. In addition, repeating the same code multiple times just doesn’t look good. It would be much more convenient to have a single function for computing the means. Fortunately, such a function exists - mean:

# Compute means
mean(x)
mean(y)
mean(z)

As you can see, using mean makes the code shorter and easier to read and reduces the risk of errors induced by copying and pasting code (we only have to change the argument of one function instead of two).

You’ve already used a ton of different functions in R: functions for computing means, manipulating data, plotting graphics, and more. All these functions have been written by somebody who thought that they needed to repeat a task (e.g. computing a mean or plotting a bar chart) over and over again. And in such cases, it is much more convenient to have a function that does that task than to have to write or copy code every time you want to do it. This is true also for your own work - whenever you need to repeat the same task several times, it is probably a good idea to write a function for it. It will reduce the amount of code you have to write and lessen the risk of errors caused by copying and pasting old code. In this section, you will learn how to write your own functions.

6.1.1 Creating functions

For the sake of the example, let’s say that we wish to compute the mean of several vectors but that the function mean doesn’t exist. We would therefore like to write our own function for computing the mean of a vector. An R function takes some variables as input (arguments or parameters) and returns an object. Functions are defined using function. The definition follows a particular format:

function_name <- function(argument1, argument2, ...)
{
      # ...
      # Some rows with code that creates some_object
      # ...
      return(some_object)
}

In the case of our function for computing a mean, this could look like:

average <- function(x)
{
      avg <- sum(x)/length(x)
      return(avg)
}

This defines a function called average, that takes an object called x as input. It computes the sum of the elements of x, divides that by the number of elements in x, and returns the resulting mean.

If we now make a call to average(x), our function will compute the mean value of the vector x. Let’s try it out, to see that it works:

x <- 1:100
y <- 1:200
average(x)
average(y)

6.1.2 Local and global variables

Note that despite the fact that the vector was called x in the code we used to define the function, average works regardless of whether the input is called x or y. This is because R distinguishes between global variables and local variables. A global variable is created in the global environment outside a function, and is available to all functions (these are the variables that you can see in the Environment panel in RStudio). A local variable is created in the local environment inside a function, and is only available to that particular function. For instance, our average function creates a variable called avg, yet when we attempt to access avg after running average this variable doesn’t seem to exist:

average(x)
avg

Because avg is a local variable, it is only available inside of the average function. Local variables take precedence over global variables inside the functions to which they belong. Because we named the argument used in the function x, x becomes the name of a local variable in average. As far as average is concerned, there is only one variable named x, and that is whatever object that was given as input to the function, regardless of what its original name was. Any operations performed on the local variable x won’t affect the global variable x at all.

Functions can access global variables:

y_squared <- function()
{
      return(y^2)
}

y <- 2
y_squared()

But operations performed on global variables inside functions won’t affect the global variable:

add_to_y <- function(n)
{
      y <- y + n
}

y <- 1
add_to_y(1)
y

Suppose you really need to change a global variable inside a function⁴¹. In that case, you can use an alternative assignment operator, <<-, which assigns a value to the variable in the parent environment to the current environment. If you use <<- for assignment inside a function that is called from the global environment, this means that the assignment takes place in the global environment. But if you use <<- in a function (function 1) that is called by another function (function 2), the assignment will take place in the environment for function 2, thus affecting a local variable in function 2. Here is an example of a global assignment using <<-:

add_to_y_global <- function(n)
{
      y <<- y + n
}

y <- 1
add_to_y_global(1)
y

6.1.3 Will your function work?

It is always a good idea to test if your function works as intended, and to try to figure out what can cause it to break. Let’s return to our average function:

average <- function(x)
{
      avg <- sum(x)/length(x)
      return(avg)
}

We’ve already seen that it seems to work when the input x is a numeric vector. But what happens if we input something else instead?

average(c(1, 5, 8)) # Numeric input
average(c(TRUE, TRUE, FALSE)) # Logical input
average(c("Lady Gaga", "Tool", "Dry the River")) # Character input
average(data.frame(x = c(1, 1, 1), y = c(2, 2, 1))) # Numeric df
average(data.frame(x = c(1, 5, 8), y = c("A", "B", "C"))) # Mixed type

The first two of these render the desired output (the logical values being represented by 0’s and 1’s), but the rest don’t. Many R functions include checks that the input is of the correct type, or checks to see which method should be applied depending on what data type the input is. We’ll learn how to perform such checks in Section 6.3.

As a side note, it is possible to write functions that don’t end with return. In that case, the output (i.e. what would be written in the Console if you’d run the code there) from the last line of the function will automatically be returned. I prefer to (almost) always use return though, as it is easy to accidentally make the function return nothing by finishing it with a line that yields no output. Below are two examples of how we could have written average without a call to return. The first doesn’t work as intended, because the function’s final (and only) line doesn’t give any output.

average_bad <- function(x)
{
      avg <- sum(x)/length(x)
}

average_ok <- function(x)
{
      sum(x)/length(x)
}

average_bad(c(1, 5, 8))
average_ok(c(1, 5, 8))

6.1.4 More on arguments

It is possible to create functions with as many arguments as you like, but it will become quite unwieldy if the user has to supply too many arguments to your function. It is therefore common to provide default values to arguments, which is done by setting a value in the function call. Here is an example of a function that computes $x^n$, using $n=2$ as the default:

power_n <- function(x, n = 2)
{
      return(x^n)
}

If we don’t supply n, power_n uses the default n = 2:

power_n(3)

But if we supply an n, power_n will use that instead:

power_n(3, 1)
power_n(3, 3)

For clarity, you can specify which value corresponds to which argument:

power_n(x = 2, n = 5)

…and can then even put the arguments in the wrong order:

power_n(n = 5, x = 2)

However, if we only supply n we get an error, because there is no default value for x:

power_n(n = 5)

It is possible to pass a function as an argument. Here is a function that takes a vector and a function as input, and applies the function to the first two elements of the vector:

apply_to_first2 <- function(x, func)
{
      result <- func(x[1:2])
      return(result)
}

By supplying different functions to apply_to_first2, we can make it perform different tasks:

x <- c(4, 5, 6)
apply_to_first2(x, sqrt)
apply_to_first2(x, is.character)
apply_to_first2(x, power_n)

But what if the function that we supply requires additional arguments? Using apply_to_first2 with sum and the vector c(4, 5, 6) works fine:

apply_to_first2(x, sum)

But if we instead use the vector c(4, NA, 6) the function returns NA :

x <- c(4, NA, 6)
apply_to_first2(x, sum)

Perhaps we’d like to pass na.rm = TRUE to sum to ensure that we get a numeric result, if at all possible. This can be done by adding ... to the list of arguments for both functions, which indicates additional parameters (to be supplied by the user) that will be passed to func:

apply_to_first2 <- function(x, func, ...)
{
      result <- func(x[1:2], ...)
      return(result)
}

x <- c(4, NA, 6)
apply_to_first2(x, sum)
apply_to_first2(x, sum, na.rm = TRUE)

\[\sim\]

Exercise 6.1 Write a function that converts temperature measurements in degrees Fahrenheit to degrees Celsius, and apply it to the Temp column of the airquality data.

6 R programming

6.1 Functions

6.1.1 Creating functions

6.1.2 Local and global variables

6.1.3 Will your function work?

6.1.4 More on arguments

6.1.5 Namespaces

6.1.6 Sourcing other scripts

6.2 More on pipes

6.2.1 Ce ne sont pas non plus des pipes

6.2.2 Writing functions with pipes

6.3 Checking conditions

6.3.1 if and else

6.3.2 & & &&

6.3.3 ifelse

6.3.4 switch

6.3.5 Failing gracefully

6.4 Iteration using loops

6.4.1 for loops

6.4.2 Loops within loops

6.4.3 Keeping track of what’s happening

6.4.4 Loops and lists

6.4.5 while loops

6.5 Iteration using vectorisation and functionals

6.5.1 A first example with apply

6.5.2 Variations on a theme

6.5.3 purrr

6.5.4 Specialised functions

6.5.5 Exploring data with functionals

6.5.6 Keep calm and carry on

6.5.7 Iterating over multiple variables

6.6 Measuring code performance

6.6.1 Timing functions

6.6.2 Measuring memory usage - and a note on compilation

6.3.1 `if` and `else`

6.3.2 `&` & `&&`

6.3.3 `ifelse`

6.3.4 `switch`

6.4.1 `for` loops

6.4.5 `while` loops

6.5.1 A first example with `apply`

6.5.3 `purrr`