Practical #8
Advanced Statistical Programming using R — Open Data & renv
Course Evaluation
Please take a few minutes to fill out the course evaluation survey before starting today’s practical.
Please evaluate the practicals led by Lisa and Leo only — not the full course.
Log in at lehrevaluation.uni-muenchen.de/evasys/public/online/index using the Losung E7D61, and submit your answers at the end of the questionnaire. The survey is anonymous.
Quiz
Before starting, work through this QUIZ to check your understanding of the concepts covered in this week’s lecture on open data and reproducible environments.
Overview
In this practical you will set up renv on your group project and work on documenting your dataset.
Work in your project group throughout. At least one person should have the group project open in RStudio.
The practical is structured into four parts:
- Proposal revision — use the feedback from yesterday’s lecture to sharpen your plan
- Set up
renv— lock package versions in your group project repository - Group project — audit your dataset’s documentation, write a data README, and review your repository structure
- Reflection log — record what you learned this week
Part 1: Proposal Revision
You received general feedback on your proposals in yesterday’s lecture. Before moving on to new material, take 20–30 minutes as a group to use that feedback to sharpen your plan and set yourselves up for the work ahead.
Open your proposal document in RStudio alongside yesterday’s slides.
Exercise 1.1: Commit to your research questions
Re-read your research questions and decide as a group which ones you will actually deliver by the final submission.
- For each question you keep, write one sentence describing the concrete output you will produce — a specific table, plot, or model result
- If you have more than two questions, prioritise: pick the ones you can execute well rather than attempting everything at a lower quality
- For each question, note which variables you will use — this will make it much easier to divide work and open GitHub Issues later in today’s session
“How does X relate to Y?” becomes actionable when you write: “We will fit a linear regression of Y on X, controlling for Z, and report a coefficient table and residual plot.” If you cannot write that sentence yet, the question needs more thought before anyone starts coding.
A good outcome: your proposal lists at most two research questions, each with a one-sentence output description and a list of the variables involved. Any question that would require more than two to three focused work sessions to complete has been cut or scoped down. This gives you a realistic plan for the coming weeks and a clear basis for assigning tasks.
Exercise 1.2: Evaluate your dataset
The dataset is the foundation of everything else. A clean, pre-packaged dataset limits what you can demonstrate — and a dataset that cannot actually answer your research questions will cause problems in every subsequent week.
Work through these questions as a group:
Is the dataset genuinely messy?
Load the raw file and run glimpse() and summary(). Are there encoding issues, inconsistent formatting, missing values, or columns that need transformation? If the data loads perfectly and every variable is already clean and well-named, it is probably not the right choice for this project.
Can it actually answer your research questions?
For each research question, identify the specific variables you would need. Are they in the dataset? Do they have enough variation and coverage to support the kind of claim you want to make? A dataset that only partially overlaps with your questions will produce results that are technically correct but not meaningful.
Can you say why you chose it?
Write one sentence explaining why this dataset — not just any dataset — is interesting for your questions. “It was available” is not a reason. “It contains X, which lets us examine Y, which matters because Z” is.
If your dataset is a well-known teaching dataset (e.g. palmerpenguins, nycflights13, gapminder) or was synthetically generated, consider whether it gives you enough to work with. These datasets are designed to be easy — which is the opposite of what makes a good portfolio piece.
There is no single right answer, but a group that has done this exercise well can say: “Our dataset has [specific messiness] that we will need to handle; it contains [specific variables] that map onto our research questions; and we chose it because [specific reason].” If any of those three sentences cannot be filled in, it is worth reconsidering the dataset now rather than in week 11.
Exercise 1.3: Make your analysis plan specific
Re-read your analysis plan. For each analytical step, you should be able to answer three questions: what exactly will you run, which variables, and why does this answer the research question?
For each step in your plan:
- Name the specific method and the specific variables — not “regression” but “regress Y on X1 and X2”
- Say what you will report — a coefficient table, a plot, a confidence interval
- Write one sentence connecting the result back to the research question: “This tells us whether X is associated with Y, which addresses RQ2 because…”
- Replace “we will clean the data” with the specific issue you know exists and how you plan to handle it
Stick to methods you have already studied. A simple linear regression with the right variables and a clear justification is more valuable than a method you cannot explain. During the oral exam you will be asked to defend every analytical choice — “the model said so” is not a sufficient answer.
If your proposal was drafted with heavy AI assistance, this is a good moment to make it yours. Read every sentence and ask: do I actually know what this means and why it is there? The coming weeks will cover more tools for writing targeted analyses — but you need to own the decisions, not just the output.
A specific plan looks like: “We will regress gesamt (total daily count) on sonnenstunden (sunshine hours) and niederschlag (precipitation), controlling for day of week. We will report a coefficient table and a residual plot. This addresses RQ1 because it quantifies the relationship between weather and cycling volume while accounting for weekly patterns.” Compare this to “We will use regression to analyse the effect of weather.” Both describe the same idea — but only the first gives your group members enough to start working independently, and only the first gives you something to defend. You will turn these steps into GitHub Issues at the end of Part 3.
Part 2: Set up renv
Without renv, everyone in your group installs packages independently. One person has dplyr 1.1.4, another has 1.0.9 — and a function that works on one machine produces a different result or a cryptic error on another. By the time you are debugging, it is rarely obvious that package versions are the cause.
renv solves this by giving the project its own R library and recording the exact version of every package in a lockfile (renv.lock). Anyone who clones the repository runs renv::restore() and gets an identical environment. It also makes your work reproducible for anyone outside your group — a marker, a future employer, or yourself in six months.
Open your group project in RStudio via the .Rproj file before starting.
Exercise 2.1: Initialise renv
Run in the R console:
# Install renv if you do not have it
install.packages("renv")
# Initialise renv in the current project
renv::init()After running, look at what changed in your project folder. Which files and folders did renv create?
renv::init() creates:
renv/— the project-local library (packages install here instead of your global library)renv.lock— the lockfile: a JSON snapshot of every package name, version, and source.Rprofile— a small file that activatesrenvautomatically whenever you open the project
renv also adds renv/library/ to .gitignore. This is correct — the library contains compiled binaries that are platform-specific and can always be rebuilt from the lockfile.
Exercise 2.2: Check the status and snapshot
Run in the R console:
renv::status()This compares what is installed in the project library against what is recorded in the lockfile.
If packages your project uses are missing from the lockfile, install them and snapshot:
# Install any packages your scripts use, e.g.:
install.packages(c("tidyverse", "skimr", "naniar", "visdat"))
# Record the current state in the lockfile
renv::snapshot()What is the difference between renv::snapshot() and renv::restore()?
renv::snapshot()writes the currently installed package versions torenv.lock. Run it after installing or updating packages to record the new state.renv::restore()installs the package versions recorded in the lockfile into the project library. Run it after cloning or pulling a repository to reproduce someone else’s environment.
They are inverses: snapshot() writes the lockfile from the library; restore() builds the library from the lockfile.
Exercise 2.3: Commit the lockfile to GitHub
Run in the terminal (not the R console):
git add renv.lock .Rprofile .gitignore
git commit -m "feat: initialise renv"
git pushrenv.lock and .Rprofile should always be committed and pushed. Commit an updated renv.lock every time you add or update a package — treat it the same way you would a DESCRIPTION file.
Why is renv/library/ excluded from version control even though it contains the actual installed packages?
The renv/library/ folder contains compiled package binaries. These are:
- Platform-specific: macOS binaries will not run on Linux or Windows
- Large: a full tidyverse library can be several hundred MB
- Fully reproducible: any machine can rebuild them from
renv.lockby runningrenv::restore()
Committing them would bloat the repository, cause merge conflicts across platforms, and provide no benefit — the lockfile already contains all the information needed.
Exercise 2.4: Test the restore
Have a different group member pull the latest changes, open the project in RStudio, and run in the R console:
renv::restore()Confirm it succeeds by loading a package and running a short script from your project:
library(tidyverse)
dat <- read_csv("data/raw/your-data.csv")
glimpse(dat)A successful renv::restore() prints something like:
- Restoring packages into 'renv/library'...
✓ Restored 42 packages.
If a package fails to install, it usually means a system-level dependency is missing (e.g. libxml2 for the xml2 package on Linux). Install the system library first and re-run renv::restore().
If renv::status() still reports issues after a restore, run renv::snapshot() on the machine with the full installation, commit the updated lockfile, and repeat.
Part 3: Group Project
Good data documentation answers five questions for someone who has never seen your data: what it is, where it comes from, what variables it contains, what its limitations are, and what processing has been applied. Most open datasets — including the ones you chose for your projects — only partially answer these questions.
In this part you will first audit what is and isn’t documented about your dataset, then use what you find to write a data/README.qmd directly in your project repository.
Use your group’s own dataset throughout. Open the original source page alongside the raw file in R. Make sure you are working inside your group project — with renv active from Part 2 — so that any packages you install here are recorded in the lockfile.
If your group does not yet have a dataset, you can work with the Munich bicycle counter daily values with weather, available at opendata.muenchen.de/dataset/daten-der-raddauerzaehlstellen-muenchen-2024 — download any monthly Tageswerte CSV (e.g. rad_2024_01_tage_korr.csv). We strongly recommend working with your own project dataset instead.
Exercise 3.1: Check the license
Find the license field on your dataset’s source page.
- What license does it use, and what does that allow you to do?
- Is attribution required? If so, what exactly must you write — is “Data from [source name]” sufficient?
- Are there any restrictions on commercial use or redistribution?
The bicycle counter dataset uses Datenlizenz Deutschland – Namensnennung – Version 2.0 (dl-de/by-2-0). This is a German open government licence roughly equivalent to CC BY 4.0: you can copy, use, and redistribute the data for any purpose, including commercially, as long as you attribute the source.
Attribution must be explicit. “Data from Munich Open Data” is not sufficient — the licence requires naming the publisher, the dataset title, the URL, and the licence: “Landeshauptstadt München, Daten der Raddauerzählstellen München, opendata.muenchen.de, Datenlizenz Deutschland Namensnennung 2.0.”
Exercise 3.2: Identify documentation gaps in the raw file
library(readr)
library(dplyr)
dat <- read_csv("your-data-file.csv")
glimpse(dat)If you need to install packages here (e.g. janitor, skimr), do so inside the project and run renv::snapshot() in the R console afterwards to record them in the lockfile.
Look at the column names and types. Are there any you could not explain to someone else — where the meaning, units, or valid values are not obvious from the name alone?
Write down at least three columns where the documentation is incomplete or unclear. You will fill these in when you write the README.
The bicycle counter file has 12 columns. Several have documentation gaps:
richtung_1/richtung_2— directional counts, but which direction is “1”? This varies by station. At Arnulfstraße,richtung_2counts cyclists going against the direction of travel — noted only in a prose aside on the portal page, not in the file.min-temp/max-temp— temperature, but units not stated anywhere. Presumably °C.niederschlag— precipitation; units not stated (mm, presumably).bewoelkung— cloud cover; units not stated (%, presumably).
Note also that min-temp and max-temp contain hyphens — not valid in R variable names. Use janitor::clean_names() before working with them.
Exercise 3.3: Identify structural gaps
Look at the source page for your dataset:
- Is there a variable dictionary or codebook anywhere — something that formally defines what each column means?
- Are any known data quality issues documented (sensor outages, missing periods, revisions)?
- If you needed to answer a question about the data that the file itself doesn’t answer, where would you look? Is that link documented anywhere?
The bicycle counter portal page has inline station-specific notes, but no formal variable dictionary. The notes are written in prose and easy to miss when downloading.
Two stations are documented as absent from the 2024 data entirely (Bad-Kreuther-Str. and Margaretenstr.), and Erhardtstraße had a sensor defect from September to November 2024. None of this is flagged in the data files themselves — the rows simply do not appear. A user who loaded the file without reading the portal page would have no way of knowing those stations ever existed.
These are exactly the gaps your data/README.qmd should fill.
Exercise 3.4: Write a data README
Create data/README.qmd in your project repository. Use the gaps you identified in Exercises 3.2 and 3.3 to guide what you write.
Start with a 2–4 sentence overview, then add a variable dictionary as a Markdown table covering your key variables.
Use skimr::skim() in the R console to get types, missingness counts, and ranges quickly:
library(skimr)
skim(dat)Use the output to fill in the table — translate it into readable descriptions rather than pasting it directly.
Example overview:
The dataset contains daily bicycle counts from automated induction-loop stations operated by the Landeshauptstadt München, covering January–December 2024. Each row represents one counting station on one calendar day, combined with daily weather observations for Munich. The data are published under Datenlizenz Deutschland Namensnennung 2.0 and downloaded from the Munich Open Data Portal on 2026-06-11.
Example variable dictionary:
| Variable | Type | Description | Example / range |
|---|---|---|---|
datum |
<date> |
Observation date | 2024-01-01 – 2024-12-31 |
zaehlstelle |
<chr> |
Counting station name | “Arnulf”, “Olympia” |
gesamt |
<int> |
Total daily bicycle count (both directions) | 78 – 4,821 |
min_temp |
<dbl> |
Daily minimum temperature (°C) | −10.2 – 23.4 |
niederschlag |
<dbl> |
Daily precipitation (mm) | 0.0 – 38.1 |
sonnenstunden |
<dbl> |
Daily sunshine duration (hours) | 0.0 – 14.9 |
The goal is human-readable documentation. “Daily precipitation in mm, range 0–38” is useful. “niederschlag, 0–38” is not.
Exercise 3.5: Review your repository structure
Check your project repository against the requirements for the final submission. Open the repository on GitHub and work through this checklist:
For any item that is missing, create the file or folder now and commit a placeholder.
A minimal contribution statement placeholder:
## Alice — Contribution Statement
*To be completed before the final submission deadline.*
**Primarily responsible for:**
**Key tasks:**
**Collaborative tasks:**Commit it as contributions/alice.qmd (one file per group member).
Exercise 3.6: Plan your next analysis step
Review your proposal’s analysis plan and decide on the single most important next step. Open a GitHub Issue in your repository describing:
- What you are trying to answer or produce
- Which variables and dataset you will use
- Who is responsible
- A rough estimate of when it will be done
GitHub Issues are a natural record of project decisions. Opening issues and assigning them to people makes it easier for every group member to commit directly — which the project guidelines require.
Example issue title: “Plot seasonal trend in bicycle counts by station”
Example issue body:
Goal: Produce a faceted line chart showing monthly totals per station, to address RQ2 from the proposal.
Data:
data/processed/bikes_clean.rds— variablesdatum,zaehlstelle,gesamtOwner: Alice
Target: Draft ready for group review by Jun 18
Assign the issue to the responsible group member before closing the practical session.
Part 4: Reflection Log
Take a few minutes at the end of the session to add this week’s entry to your reflection log. Then commit and push everything — renv.lock, data/README.qmd, any new files from Exercise 3.5, and the reflection log entry can all go in one commit.
End-of-practical checklist
Before you leave today, make sure your group has:
- Proposal revised: research questions narrowed, dataset description improved, analysis plan made specific
renv.lockcommitted and pushed, withrenv::status()reporting no issues- At least one group member has run
renv::restore()from a clean pull without errors data/README.qmdcreated with a dataset overview, variable dictionary, and at least two quality notes- Project repository structure reviewed; any missing folders or placeholder files committed
- At least one GitHub Issue opened and assigned for the next analysis step
- This week’s reflection log entry committed and pushed
Resources
Reproducible environments
- Introduction to renv — official getting-started guide
- Shannon Pileggi, renv — posit::conf(2025) — 20-minute talk covering renv in team workflows
Open data
- Munich Open Data Portal — all datasets used in this course
- FAIR Principles — Findable, Accessible, Interoperable, Reusable; the standard framework for evaluating open datasets
- Creative Commons licence chooser — understand what each CC licence allows
- The Turing Way — Open Data — practical guidance on working with and publishing open data
Data documentation
- The Turing Way — Research Data Management — covers documentation, file organisation, and metadata
- WCD Open Data lecture notes — used in preparing this session