f1 <- function(string, prefix) {
str_sub(string, 1, str_length(prefix)) == prefix
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}Practical #2
Advanced Statistical Programming using R - Introduction
THE PRACTICAL NEXT WEEK WILL BE ON THURSDAY, APRIL 30th, RIGHT AFTER THE LECTURE!!
Since May 1st is a public holiday, we needed to reschedule.
Quiz
Before starting, work through this QUIZ to check your understanding of the concepts covered in the lecture, as well as preliminary knowledge about R.
General Remarks
In this practical, we will take a look at how to write functions in R, and why they are useful for data science tasks. We will cover the following topics:
- Naming conventions for functions and their arguments
- Why is it important to choose meaningful names for functions and their arguments?
- What are some common naming conventions in R (e.g. snake_case, camelCase, etc.) and when should you use them?
- Refactoring code into functions to make it more modular and reusable
- How can you identify which parts of your code can be turned into functions?
- What are the benefits of refactoring code into functions (e.g. improved readability, easier maintenance, reusability)?
- Error handling in functions
- How can you use mechanisms such as
stop()to handle errors in your functions? - What are some best practices for error handling in R (e.g. providing informative error messages, using assertions to check input validity)?
- Recursion and iteration
- What is the difference between recursion and iteration, and when should you use each approach?
- How can you implement recursive functions in R, and what are some common pitfalls to avoid (e.g. stack overflow errors)?
Exercise 1: Naming Functions
A good function name is a verb or short phrase that tells you what the function does without having to read its body. Badly named functions (f1, do_stuff, helper2) force every reader — including future-you — to re-derive intent from the code each time.
The exercises below are adapted from R4DS, Chapter 26.
- Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.
Ask yourself:
- Does the name start with a verb (
compute_,plot_,load_,filter_)? - Is it specific enough that you could distinguish it from a sibling function (
plot_bills()vs.plot_penguins())? - Are the arguments named by what they are (
data,species,x_col) rather than where they came from (df1,tmp,input)?
- Make a case for why
norm_r(),norm_d()etc. would be better thanrnorm(),dnorm()1. Make a case for the opposite. How could you make the names even clearer?
Exercise 2: Using Functions to Package Code
Last week you wrote a bit of code to filter and plot the palmerpenguins data. Copy-pasting that code every time you want the same plot for a different species is a recipe for bugs — the moment you change one copy and forget the others, your plots start disagreeing. Turning the code into a function fixes this: there is one place that defines the behaviour, and you invoke it with different arguments.
- Take a look at the following code. Wrap this code in a function with a fitting name and appropriate arguments.
library(palmerpenguins)
library(tidyverse)
penguins %>%
filter(species == "Adelie") %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
labs(title = "Bill Length vs. Bill Depth for Adelie Penguins",
x = "Bill Length (mm)",
y = "Bill Depth (mm)")What pieces of this code would you want to vary between calls? The species is an obvious candidate. What about the two axis variables? And the title?
- Add a docstring to your function that explains what the function does, its arguments, and its return value.
In R, docstrings are written as comments starting with #' directly above the function definition, using roxygen2 tags like @param, @return, @examples. See the solution to Exercise 7 below for a fuller example.
Test your function by calling it with different arguments and checking the output. Make sure to test edge cases:
- What happens if you pass in a species name that doesn’t exist (e.g.
"Penguin")? - What happens if you pass a column name that isn’t in the data (e.g.
x_var = "fluffiness")? - What happens if you pass a non-numeric column (e.g.
x_var = "island")?
Write down what you expect to happen for each case before running it — are the actual errors informative? We’ll fix the bad ones in Exercise 3.
- What happens if you pass in a species name that doesn’t exist (e.g.
Split your function into smaller helper functions. One function should only perform one task.
Then rewrite
plot_penguin_bills()to just call these two helpers in sequence. The top-level function becomes shorter and easier to read, and each helper is independently testable.
Exercise 3: Error Handling
In Exercise 2, you probably noticed that passing a non-existent species name returns an empty plot with no explanation, and passing a bad column name gives a cryptic object not found error. A well-written function should fail fast with an informative message rather than produce silently broken output.
R has a few tools for this:
stop("message")throws an error with a custom message.stopifnot(condition1, condition2, ...)throws an error if any condition isFALSE. It’s a compact way to write several assertions at the top of a function.- (Extension)
cli::cli_abort()produces prettier, multi-line error messages with argument highlighting — it’s what the tidyverse uses internally.
Tasks
- Add input validation to
plot_penguin_bills()from Exercise 2 so that it fails with a useful message when:speciesis not one of the three valid species ("Adelie","Chinstrap","Gentoo").x_varory_varis not a column of thepenguinsdata frame.x_varory_varrefers to a non-numeric column.
- Compare the error messages you get now vs. before. Which version would be easier to debug at 11pm the night before an assignment deadline?
cli
The cli package (covered in the lecture) lets you write multi-line error messages with argument highlighting:
library(cli)
if (!species %in% valid_species) {
cli_abort(c(
"{.arg species} must be one of {.val {valid_species}}.",
"x" = "You supplied {.val {species}}."
))
}This is the style used throughout the tidyverse.
Exercise 4: Recursion and Iteration
In this section we’ll take a look at how to compute 2 raised to the power of 10 (\(2^{10}\)). There are two common approaches to solve this problem:
Approach 1: Repeat 10 Times
\(2^{10}\) can be computed by multiplying 2 by itself 10 times: \(\Pi_{i=1}^{10} 2 = 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 = 1024\). This is an example of iteration, where we repeat an operation a certain number of times.
Iteration is a programming technique where a block of code is repeated a certain number of times or until a certain condition is met. An iterative function typically uses loops (e.g. for, while).
- The same function \(f(a, b) = a^b\) can be implemented iteratively as follows:
- we initialize a variable
resultto 1 - Then, in a loop that runs
btimes, - we multiply result by
a.
- we initialize a variable
# Example of an iterative function to compute a to the power of b
power_iterative <- function(a, b) {
# We initialize a variable `result` to 1, which will hold the final result of a^b.
result <- 1
# Then, we use a for loop that runs `b` times.
# 1:b creates a sequence of numbers from 1 to b, and the loop iterates over this sequence.
for (i in 1:b) {
# In each iteration, we update `result` by multiplying it with `a`.
# After the loop finishes, `result` will contain the value of a^b.
result <- result * a
}
# Finally, we return the computed result.
return(result)
}
power_iterative(2, 3) # This will compute 2^3 = 8Iteration can be more efficient and avoid stack overflow errors, but it can also be less elegant and harder to read.
Approach 2: Well, I know \(2^9\) …
\(2^{10}\) can also be computed as \(2 * 2^9 = 2 * (2 * 2^8) = 2 * (2 * (2 * 2^7))\), and so on, until we reach the base case of \(2^0 = 1\). Formally, we have \(2^{n} = 2 * 2^{n-1}\) if \(n > 0\), and \(2^{0} = 1\). This is an example of recursion, where a function calls itself to solve a problem.
Recursion is a programming technique where a function calls itself in order to solve a problem. A recursive function typically consists of a recursive case (also called the recursion step) that breaks the problem into smaller subproblems and a base case that stops the recursion.
- Take the function \(f(a, b) = a^b\)
- This function can be implemented recursively as follows:
- Base case \(f(a, 0) = 1\)
- Recursive case \(f(a, b) = a * f(a, b-1)\)
# Example of a recursive function to compute a to the power of b
power_recursive <- function(a, b) {
if (b == 0) {
# This is the so called base case, which is necessary to prevent infinite recursion
# It defines, when the function should stop calling itself and return a value.
# In this case, any number to the power of 0 is 1, so we return 1.
return(1)
} else {
# This is the recursion step, where the function calls itself with modified arguments.
# In this case, we multiply `a` by the result of calling `power_recursive` with `b`
# decremented by 1, which effectively reduces the problem size with each recursive call.
return(a * power_recursive(a, b - 1))
}
}
power_recursive(2, 3) # This will compute 2^3 = 8Recursion can be more elegant and easier to read, but it can also be less efficient and lead to stack overflow errors if the recursion depth is too large.

Task: The Fibonacci Sequence
Do you remember the Fibonacci sequence from your math classes? The Fibonacci sequence is defined as follows:
- \(F(0) = 0\), \(F(1) = 1\), and \(F(n) = F(n-1) + F(n-2)\) for \(n > 1\), so:
- \(F(2) = F(1) + F(0) = 1 + 0 = 1\)
- \(F(3) = F(2) + F(1) = 1 + 1 = 2\)
- \(F(4) = F(3) + F(2) = 2 + 1 = 3\)
- \(F(5) = F(4) + F(3) = 3 + 2 = 5\)
- …

The Fibonacci sequence is a classic example of a problem that can be solved using both iteration and recursion, and it illustrates the trade-offs between these two approaches.
- Write a function in R to compute the nth Fibonacci number iteratively.
- Write a function in R to compute the nth Fibonacci number recursively.
- Compare the performance of the iterative and recursive implementations by computing the 30th Fibonacci number. What is the difference? Why does this difference occur?
Exercise 5: Refactoring
A guiding principle: DRY — Don’t Repeat Yourself. If the same block of code appears in more than one place, there’s one behaviour with multiple copies. Change one, forget the others, and they drift out of sync.
Task 1: Spot the repetition
Consider the following function that produces a small summary for a single species:
summarise_species <- function(species_name) {
adelie_mean_bill <- mean(penguins$bill_length_mm[penguins$species == "Adelie"], na.rm = TRUE)
gentoo_mean_bill <- mean(penguins$bill_length_mm[penguins$species == "Gentoo"], na.rm = TRUE)
chinstrap_mean_bill <- mean(penguins$bill_length_mm[penguins$species == "Chinstrap"], na.rm = TRUE)
adelie_mean_mass <- mean(penguins$body_mass_g[penguins$species == "Adelie"], na.rm = TRUE)
gentoo_mean_mass <- mean(penguins$body_mass_g[penguins$species == "Gentoo"], na.rm = TRUE)
chinstrap_mean_mass <- mean(penguins$body_mass_g[penguins$species == "Chinstrap"], na.rm = TRUE)
if (species_name == "Adelie") {
cat("Adelie — mean bill:", adelie_mean_bill, "mean mass:", adelie_mean_mass, "\n")
} else if (species_name == "Gentoo") {
cat("Gentoo — mean bill:", gentoo_mean_bill, "mean mass:", gentoo_mean_mass, "\n")
} else if (species_name == "Chinstrap") {
cat("Chinstrap — mean bill:", chinstrap_mean_bill, "mean mass:", chinstrap_mean_mass, "\n")
}
}- Identify every piece of repetition in this function. How many distinct ideas are actually being expressed?
- Rewrite the function so that:
- the per-species means are not hard-coded,
- there’s no long
if/elsechain onspecies_name, - the code would still work if a fourth species showed up in the data.
Task 2: Refactor a long function into helpers
Take the following monolithic function. It loads a dataset, cleans it, summarises it, and plots it — all in one place.
penguin_report <- function(species_name) {
# load
library(palmerpenguins)
library(tidyverse)
data <- penguins
# clean
data <- data[!is.na(data$bill_length_mm) & !is.na(data$body_mass_g), ]
# filter
data <- data[data$species == species_name, ]
# summarise
cat("n =", nrow(data), "\n")
cat("mean bill length:", mean(data$bill_length_mm), "\n")
cat("mean body mass:", mean(data$body_mass_g), "\n")
# plot
ggplot(data, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point() +
labs(title = paste("Bill length vs. body mass —", species_name))
}
penguin_report("Gentoo")Refactor it into at least three helper functions. Each helper should do one thing, and the top-level penguin_report() should read like a recipe: load, then clean, then summarise, then plot.
Each helper is now independently testable, and the top-level function reads top-to-bottom as the story of the analysis. :::
Exercise 6: Reflection Log
- Take some time to write this week’s reflection log.
- Add and commit your reflection log to your Git repository (the one we initialized last week).
(Optional) Exercise 7: Freestyle
Write a function that takes the starwars dataset as input and performs some data transformation and visualization tasks. You can choose any dataset and any transformations/visualizations you like. The goal is to practice writing functions that are useful for data science tasks.
Follow these principles when writing your function: - Modularity: Don’t put one large bulk of code in one function. Instead, break it down into smaller helper functions that perform specific tasks. This will make your code easier to read and maintain. - Documentation: Add docstrings to your functions that explain what the function does, its arguments, and its return value. This will make it easier for others (or yourself in the future) to understand how to use the function. - Testing: Test each function with different inputs to ensure it works correctly. Consider edge cases (e.g. empty datasets, missing values) to make sure your function is robust. - Reusability: Design your functions to be reusable in different contexts. Avoid hardcoding specific values or assumptions that limit the function’s applicability. Instead use parameters to allow for flexibility in how the function can be used. E.g. - instead of hardcoding the name of a column to plot, allow the user to specify which column they want to visualize. - instead of hardcoding the type of plot, allow the user to specify which type of plot they want to create (e.g. scatter plot, bar plot, histogram).
Start with writing down the code and then identify which parts of the code can be turned into functions. For example, you might have a block of code that loads the data, another block that performs data transformations, and another block that creates a plot. Each of these blocks can be turned into a separate function. Then, you can write a main function that calls these helper functions in sequence to perform the entire analysis. This way, you can reuse the helper functions in other contexts, and your main function will be easier to read and understand.
Resources
- R4DS Ch. 25–27: Functions, Iteration
- Advanced R, Ch. 6: Functions (depth reference)
- https://fun2debug.njtierney.com - A blog by Nicholas J. Tierney about writing functions in R, with a focus on making them easy to debug and maintain.
Footnotes
rnorm()generates random numbers from the normal distribution, anddnorm()computes the density (probability density function) of the normal distribution.↩︎