Practical #8

Advanced Statistical Programming using R — Open Data & renv

Author

Leonhard Kestel, Lisa Bondo Andersen, Cynthia Huang

Published

June 11, 2026

Course Evaluation

Please take a few minutes to fill out the course evaluation survey before starting today’s practical.

Important

Please evaluate the practicals led by Lisa and Leo only — not the full course.

Log in at lehrevaluation.uni-muenchen.de/evasys/public/online/index using the Losung E7D61, and submit your answers at the end of the questionnaire. The survey is anonymous.

Quiz

Before starting, work through this QUIZ to check your understanding of the concepts covered in this week’s lecture on open data and reproducible environments.

Overview

In this practical you will set up renv on your group project and work on documenting your dataset.

Work in your project group throughout. At least one person should have the group project open in RStudio.

The practical is structured into four parts:

Proposal revision — use the feedback from yesterday’s lecture to sharpen your plan
Set up renv — lock package versions in your group project repository
Group project — audit your dataset’s documentation, write a data README, and review your repository structure
Reflection log — record what you learned this week

Part 1: Proposal Revision

You received general feedback on your proposals in yesterday’s lecture. Before moving on to new material, take 20–30 minutes as a group to use that feedback to sharpen your plan and set yourselves up for the work ahead.

Open your proposal document in RStudio alongside yesterday’s slides.

Exercise 1.1: Commit to your research questions

Re-read your research questions and decide as a group which ones you will actually deliver by the final submission.

For each question you keep, write one sentence describing the concrete output you will produce — a specific table, plot, or model result
If you have more than two questions, prioritise: pick the ones you can execute well rather than attempting everything at a lower quality
For each question, note which variables you will use — this will make it much easier to divide work and open GitHub Issues later in today’s session

Tip 1

“How does X relate to Y?” becomes actionable when you write: “We will fit a linear regression of Y on X, controlling for Z, and report a coefficient table and residual plot.” If you cannot write that sentence yet, the question needs more thought before anyone starts coding.

Tip 2

A good outcome: your proposal lists at most two research questions, each with a one-sentence output description and a list of the variables involved. Any question that would require more than two to three focused work sessions to complete has been cut or scoped down. This gives you a realistic plan for the coming weeks and a clear basis for assigning tasks.

Exercise 1.2: Evaluate your dataset

The dataset is the foundation of everything else. A clean, pre-packaged dataset limits what you can demonstrate — and a dataset that cannot actually answer your research questions will cause problems in every subsequent week.

Work through these questions as a group:

Is the dataset genuinely messy?

Load the raw file and run glimpse() and summary(). Are there encoding issues, inconsistent formatting, missing values, or columns that need transformation? If the data loads perfectly and every variable is already clean and well-named, it is probably not the right choice for this project.

Can it actually answer your research questions?

For each research question, identify the specific variables you would need. Are they in the dataset? Do they have enough variation and coverage to support the kind of claim you want to make? A dataset that only partially overlaps with your questions will produce results that are technically correct but not meaningful.

Can you say why you chose it?

Write one sentence explaining why this dataset — not just any dataset — is interesting for your questions. “It was available” is not a reason. “It contains X, which lets us examine Y, which matters because Z” is.

Tip 1

If your dataset is a well-known teaching dataset (e.g. palmerpenguins, nycflights13, gapminder) or was synthetically generated, consider whether it gives you enough to work with. These datasets are designed to be easy — which is the opposite of what makes a good portfolio piece.

Tip 2

There is no single right answer, but a group that has done this exercise well can say: “Our dataset has [specific messiness] that we will need to handle; it contains [specific variables] that map onto our research questions; and we chose it because [specific reason].” If any of those three sentences cannot be filled in, it is worth reconsidering the dataset now rather than in week 11.

Exercise 1.3: Make your analysis plan specific

Re-read your analysis plan. For each analytical step, you should be able to answer three questions: what exactly will you run, which variables, and why does this answer the research question?

For each step in your plan:

Name the specific method and the specific variables — not “regression” but “regress Y on X1 and X2”
Say what you will report — a coefficient table, a plot, a confidence interval
Write one sentence connecting the result back to the research question: “This tells us whether X is associated with Y, which addresses RQ2 because…”
Replace “we will clean the data” with the specific issue you know exists and how you plan to handle it

Tip 1

Stick to methods you have already studied. A simple linear regression with the right variables and a clear justification is more valuable than a method you cannot explain. During the oral exam you will be asked to defend every analytical choice — “the model said so” is not a sufficient answer.

Important

If your proposal was drafted with heavy AI assistance, this is a good moment to make it yours. Read every sentence and ask: do I actually know what this means and why it is there? The coming weeks will cover more tools for writing targeted analyses — but you need to own the decisions, not just the output.

Tip 2

A specific plan looks like: “We will regress gesamt (total daily count) on sonnenstunden (sunshine hours) and niederschlag (precipitation), controlling for day of week. We will report a coefficient table and a residual plot. This addresses RQ1 because it quantifies the relationship between weather and cycling volume while accounting for weekly patterns.” Compare this to “We will use regression to analyse the effect of weather.” Both describe the same idea — but only the first gives your group members enough to start working independently, and only the first gives you something to defend. You will turn these steps into GitHub Issues at the end of Part 3.

Part 2: Set up `renv`

Without renv, everyone in your group installs packages independently. One person has dplyr 1.1.4, another has 1.0.9 — and a function that works on one machine produces a different result or a cryptic error on another. By the time you are debugging, it is rarely obvious that package versions are the cause.

renv solves this by giving the project its own R library and recording the exact version of every package in a lockfile (renv.lock). Anyone who clones the repository runs renv::restore() and gets an identical environment. It also makes your work reproducible for anyone outside your group — a marker, a future employer, or yourself in six months.

Open your group project in RStudio via the .Rproj file before starting.

Exercise 2.1: Initialise `renv`

Run in the R console:

# Install renv if you do not have it
install.packages("renv")

# Initialise renv in the current project
renv::init()

After running, look at what changed in your project folder. Which files and folders did renv create?

Solution

renv::init() creates:

renv/ — the project-local library (packages install here instead of your global library)
renv.lock — the lockfile: a JSON snapshot of every package name, version, and source
.Rprofile — a small file that activates renv automatically whenever you open the project

renv also adds renv/library/ to .gitignore. This is correct — the library contains compiled binaries that are platform-specific and can always be rebuilt from the lockfile.

Exercise 2.2: Check the status and snapshot

Run in the R console:

renv::status()

This compares what is installed in the project library against what is recorded in the lockfile.

If packages your project uses are missing from the lockfile, install them and snapshot:

# Install any packages your scripts use, e.g.:
install.packages(c("tidyverse", "skimr", "naniar", "visdat"))

# Record the current state in the lockfile
renv::snapshot()

What is the difference between renv::snapshot() and renv::restore()?

Solution

renv::snapshot() writes the currently installed package versions to renv.lock. Run it after installing or updating packages to record the new state.
renv::restore() installs the package versions recorded in the lockfile into the project library. Run it after cloning or pulling a repository to reproduce someone else’s environment.

They are inverses: snapshot() writes the lockfile from the library; restore() builds the library from the lockfile.

Exercise 2.3: Commit the lockfile to GitHub

Run in the terminal (not the R console):

git add renv.lock .Rprofile .gitignore
git commit -m "feat: initialise renv"
git push

Tip

renv.lock and .Rprofile should always be committed and pushed. Commit an updated renv.lock every time you add or update a package — treat it the same way you would a DESCRIPTION file.

Why is renv/library/ excluded from version control even though it contains the actual installed packages?

Solution

The renv/library/ folder contains compiled package binaries. These are:

Platform-specific: macOS binaries will not run on Linux or Windows
Large: a full tidyverse library can be several hundred MB
Fully reproducible: any machine can rebuild them from renv.lock by running renv::restore()

Committing them would bloat the repository, cause merge conflicts across platforms, and provide no benefit — the lockfile already contains all the information needed.

Exercise 2.4: Test the restore

Have a different group member pull the latest changes, open the project in RStudio, and run in the R console:

renv::restore()

Confirm it succeeds by loading a package and running a short script from your project:

library(tidyverse)
dat <- read_csv("data/raw/your-data.csv")
glimpse(dat)

Solution

A successful renv::restore() prints something like:

- Restoring packages into 'renv/library'...
✓ Restored 42 packages.

If a package fails to install, it usually means a system-level dependency is missing (e.g. libxml2 for the xml2 package on Linux). Install the system library first and re-run renv::restore().

If renv::status() still reports issues after a restore, run renv::snapshot() on the machine with the full installation, commit the updated lockfile, and repeat.

Part 3: Group Project

Good data documentation answers five questions for someone who has never seen your data: what it is, where it comes from, what variables it contains, what its limitations are, and what processing has been applied. Most open datasets — including the ones you chose for your projects — only partially answer these questions.

In this part you will first audit what is and isn’t documented about your dataset, then use what you find to write a data/README.qmd directly in your project repository.

Use your group’s own dataset throughout. Open the original source page alongside the raw file in R. Make sure you are working inside your group project — with renv active from Part 2 — so that any packages you install here are recorded in the lockfile.

If your group does not yet have a dataset, you can work with the Munich bicycle counter daily values with weather, available at opendata.muenchen.de/dataset/daten-der-raddauerzaehlstellen-muenchen-2024 — download any monthly Tageswerte CSV (e.g. rad_2024_01_tage_korr.csv). We strongly recommend working with your own project dataset instead.

Exercise 3.1: Check the license

Find the license field on your dataset’s source page.

What license does it use, and what does that allow you to do?
Is attribution required? If so, what exactly must you write — is “Data from [source name]” sufficient?
Are there any restrictions on commercial use or redistribution?

Solution — backup dataset

The bicycle counter dataset uses Datenlizenz Deutschland – Namensnennung – Version 2.0 (dl-de/by-2-0). This is a German open government licence roughly equivalent to CC BY 4.0: you can copy, use, and redistribute the data for any purpose, including commercially, as long as you attribute the source.

Attribution must be explicit. “Data from Munich Open Data” is not sufficient — the licence requires naming the publisher, the dataset title, the URL, and the licence: “Landeshauptstadt München, Daten der Raddauerzählstellen München, opendata.muenchen.de, Datenlizenz Deutschland Namensnennung 2.0.”

Exercise 3.2: Identify documentation gaps in the raw file

library(readr)
library(dplyr)

dat <- read_csv("your-data-file.csv")
glimpse(dat)

Tip

If you need to install packages here (e.g. janitor, skimr), do so inside the project and run renv::snapshot() in the R console afterwards to record them in the lockfile.

Look at the column names and types. Are there any you could not explain to someone else — where the meaning, units, or valid values are not obvious from the name alone?

Write down at least three columns where the documentation is incomplete or unclear. You will fill these in when you write the README.

Solution — backup dataset

The bicycle counter file has 12 columns. Several have documentation gaps:

richtung_1 / richtung_2 — directional counts, but which direction is “1”? This varies by station. At Arnulfstraße, richtung_2 counts cyclists going against the direction of travel — noted only in a prose aside on the portal page, not in the file.
min-temp / max-temp — temperature, but units not stated anywhere. Presumably °C.
niederschlag — precipitation; units not stated (mm, presumably).
bewoelkung — cloud cover; units not stated (%, presumably).

Note also that min-temp and max-temp contain hyphens — not valid in R variable names. Use janitor::clean_names() before working with them.

Exercise 3.3: Identify structural gaps

Look at the source page for your dataset:

Is there a variable dictionary or codebook anywhere — something that formally defines what each column means?
Are any known data quality issues documented (sensor outages, missing periods, revisions)?
If you needed to answer a question about the data that the file itself doesn’t answer, where would you look? Is that link documented anywhere?

Solution — backup dataset

The bicycle counter portal page has inline station-specific notes, but no formal variable dictionary. The notes are written in prose and easy to miss when downloading.

Two stations are documented as absent from the 2024 data entirely (Bad-Kreuther-Str. and Margaretenstr.), and Erhardtstraße had a sensor defect from September to November 2024. None of this is flagged in the data files themselves — the rows simply do not appear. A user who loaded the file without reading the portal page would have no way of knowing those stations ever existed.

These are exactly the gaps your data/README.qmd should fill.

Exercise 3.4: Write a data README

Create data/README.qmd in your project repository. Use the gaps you identified in Exercises 3.2 and 3.3 to guide what you write.

Start with a 2–4 sentence overview, then add a variable dictionary as a Markdown table covering your key variables.

Tip

Use skimr::skim() in the R console to get types, missingness counts, and ranges quickly:

library(skimr)
skim(dat)

Use the output to fill in the table — translate it into readable descriptions rather than pasting it directly.

Solution

Example overview:

The dataset contains daily bicycle counts from automated induction-loop stations operated by the Landeshauptstadt München, covering January–December 2024. Each row represents one counting station on one calendar day, combined with daily weather observations for Munich. The data are published under Datenlizenz Deutschland Namensnennung 2.0 and downloaded from the Munich Open Data Portal on 2026-06-11.

Example variable dictionary:

Variable	Type	Description	Example / range
`datum`	`<date>`	Observation date	2024-01-01 – 2024-12-31
`zaehlstelle`	`<chr>`	Counting station name	“Arnulf”, “Olympia”
`gesamt`	`<int>`	Total daily bicycle count (both directions)	78 – 4,821
`min_temp`	`<dbl>`	Daily minimum temperature (°C)	−10.2 – 23.4
`niederschlag`	`<dbl>`	Daily precipitation (mm)	0.0 – 38.1
`sonnenstunden`	`<dbl>`	Daily sunshine duration (hours)	0.0 – 14.9

The goal is human-readable documentation. “Daily precipitation in mm, range 0–38” is useful. “niederschlag, 0–38” is not.

Exercise 3.5: Review your repository structure

Check your project repository against the requirements for the final submission. Open the repository on GitHub and work through this checklist:

README.md at the root with dataset description, research questions, and group members
Raw data in data/raw/ (or a download script if the file is too large to commit) with the new data/README.qmd
A cleaning script — e.g. scripts/01_clean.R or data-cleaning.qmd
renv.lock committed (from Part 2)
contributions/ folder or a CONTRIBUTING.qmd — even a placeholder is fine

For any item that is missing, create the file or folder now and commit a placeholder.

Solution

A minimal contribution statement placeholder:

## Alice — Contribution Statement

*To be completed before the final submission deadline.*

**Primarily responsible for:**

**Key tasks:**

**Collaborative tasks:**

Commit it as contributions/alice.qmd (one file per group member).

Exercise 3.6: Plan your next analysis step

Review your proposal’s analysis plan and decide on the single most important next step. Open a GitHub Issue in your repository describing:

What you are trying to answer or produce
Which variables and dataset you will use
Who is responsible
A rough estimate of when it will be done

Tip

GitHub Issues are a natural record of project decisions. Opening issues and assigning them to people makes it easier for every group member to commit directly — which the project guidelines require.

Solution

Example issue title: “Plot seasonal trend in bicycle counts by station”

Example issue body:

Goal: Produce a faceted line chart showing monthly totals per station, to address RQ2 from the proposal.

Data: data/processed/bikes_clean.rds — variables datum, zaehlstelle, gesamt

Owner: Alice

Target: Draft ready for group review by Jun 18

Assign the issue to the responsible group member before closing the practical session.

Part 4: Reflection Log

Take a few minutes at the end of the session to add this week’s entry to your reflection log. Then commit and push everything — renv.lock, data/README.qmd, any new files from Exercise 3.5, and the reflection log entry can all go in one commit.

End-of-practical checklist

Before you leave today, make sure your group has:

Proposal revised: research questions narrowed, dataset description improved, analysis plan made specific
renv.lock committed and pushed, with renv::status() reporting no issues
At least one group member has run renv::restore() from a clean pull without errors
data/README.qmd created with a dataset overview, variable dictionary, and at least two quality notes
Project repository structure reviewed; any missing folders or placeholder files committed
At least one GitHub Issue opened and assigned for the next analysis step
This week’s reflection log entry committed and pushed

Resources

Reproducible environments

Introduction to renv — official getting-started guide
Shannon Pileggi, renv — posit::conf(2025) — 20-minute talk covering renv in team workflows

Open data

Munich Open Data Portal — all datasets used in this course
FAIR Principles — Findable, Accessible, Interoperable, Reusable; the standard framework for evaluating open datasets
Creative Commons licence chooser — understand what each CC licence allows
The Turing Way — Open Data — practical guidance on working with and publishing open data

Data documentation

The Turing Way — Research Data Management — covers documentation, file organisation, and metadata
WCD Open Data lecture notes — used in preparing this session

--- title: "Practical #8" subtitle: "Advanced Statistical Programming using R — Open Data & renv" author: "Leonhard Kestel, Lisa Bondo Andersen, Cynthia Huang" date: "June 11, 2026" format: html: theme: default toc: true toc-depth: 2 code-tools: true highlight-style: github execute: eval: false message: false warning: false draft: false --- ## Course Evaluation Please take a few minutes to fill out the course evaluation survey before starting today's practical. ::: {.callout-important} Please evaluate the **practicals led by Lisa and Leo only** — not the full course. ::: Log in at [lehrevaluation.uni-muenchen.de/evasys/public/online/index](https://www.lehrevaluation.uni-muenchen.de/evasys/public/online/index){target="_blank"} using the Losung **E7D61**, and submit your answers at the end of the questionnaire. The survey is anonymous. ------------------------------------------------------------------------ ## Quiz Before starting, work through this [QUIZ](quiz.qmd){target="_blank"} to check your understanding of the concepts covered in this week's lecture on open data and reproducible environments. ------------------------------------------------------------------------ # Overview In this practical you will set up `renv` on your group project and work on documenting your dataset. Work in your project group throughout. At least one person should have the group project open in RStudio. The practical is structured into four parts: 1. **Proposal revision** — use the feedback from yesterday's lecture to sharpen your plan 2. **Set up `renv`** — lock package versions in your group project repository 3. **Group project** — audit your dataset's documentation, write a data README, and review your repository structure 4. **Reflection log** — record what you learned this week --- # Part 1: Proposal Revision You received general feedback on your proposals in yesterday's lecture. Before moving on to new material, take 20–30 minutes as a group to use that feedback to sharpen your plan and set yourselves up for the work ahead. Open your proposal document in RStudio alongside yesterday's slides. ## Exercise 1.1: Commit to your research questions Re-read your research questions and decide as a group which ones you will actually deliver by the final submission. - For each question you keep, write one sentence describing the concrete output you will produce — a specific table, plot, or model result - If you have more than two questions, prioritise: pick the ones you can execute well rather than attempting everything at a lower quality - For each question, note which variables you will use — this will make it much easier to divide work and open GitHub Issues later in today's session ::: {.callout-tip title="Tip 1" collapse="true"} "How does X relate to Y?" becomes actionable when you write: "We will fit a linear regression of Y on X, controlling for Z, and report a coefficient table and residual plot." If you cannot write that sentence yet, the question needs more thought before anyone starts coding. ::: ::: {.callout-tip title="Tip 2" collapse="true"} A good outcome: your proposal lists at most two research questions, each with a one-sentence output description and a list of the variables involved. Any question that would require more than two to three focused work sessions to complete has been cut or scoped down. This gives you a realistic plan for the coming weeks and a clear basis for assigning tasks. ::: ## Exercise 1.2: Evaluate your dataset The dataset is the foundation of everything else. A clean, pre-packaged dataset limits what you can demonstrate — and a dataset that cannot actually answer your research questions will cause problems in every subsequent week. Work through these questions as a group: **Is the dataset genuinely messy?** Load the raw file and run `glimpse()` and `summary()`. Are there encoding issues, inconsistent formatting, missing values, or columns that need transformation? If the data loads perfectly and every variable is already clean and well-named, it is probably not the right choice for this project. **Can it actually answer your research questions?** For each research question, identify the specific variables you would need. Are they in the dataset? Do they have enough variation and coverage to support the kind of claim you want to make? A dataset that only partially overlaps with your questions will produce results that are technically correct but not meaningful. **Can you say why you chose it?** Write one sentence explaining why this dataset — not just any dataset — is interesting for your questions. "It was available" is not a reason. "It contains X, which lets us examine Y, which matters because Z" is. ::: {.callout-tip title="Tip 1" collapse="true"} If your dataset is a well-known teaching dataset (e.g. `palmerpenguins`, `nycflights13`, `gapminder`) or was synthetically generated, consider whether it gives you enough to work with. These datasets are designed to be easy — which is the opposite of what makes a good portfolio piece. ::: ::: {.callout-tip title="Tip 2" collapse="true"} There is no single right answer, but a group that has done this exercise well can say: "Our dataset has [specific messiness] that we will need to handle; it contains [specific variables] that map onto our research questions; and we chose it because [specific reason]." If any of those three sentences cannot be filled in, it is worth reconsidering the dataset now rather than in week 11. ::: ## Exercise 1.3: Make your analysis plan specific Re-read your analysis plan. For each analytical step, you should be able to answer three questions: *what exactly will you run*, *which variables*, and *why does this answer the research question*? For each step in your plan: - Name the specific method and the specific variables — not "regression" but "regress Y on X1 and X2" - Say what you will report — a coefficient table, a plot, a confidence interval - Write one sentence connecting the result back to the research question: "This tells us whether X is associated with Y, which addresses RQ2 because…" - Replace "we will clean the data" with the specific issue you know exists and how you plan to handle it ::: {.callout-tip title="Tip 1" collapse="true"} Stick to methods you have already studied. A simple linear regression with the right variables and a clear justification is more valuable than a method you cannot explain. During the oral exam you will be asked to defend every analytical choice — "the model said so" is not a sufficient answer. ::: ::: {.callout-important} If your proposal was drafted with heavy AI assistance, this is a good moment to make it yours. Read every sentence and ask: do I actually know what this means and why it is there? The coming weeks will cover more tools for writing targeted analyses — but you need to own the decisions, not just the output. ::: ::: {.callout-tip title="Tip 2" collapse="true"} A specific plan looks like: "We will regress `gesamt` (total daily count) on `sonnenstunden` (sunshine hours) and `niederschlag` (precipitation), controlling for day of week. We will report a coefficient table and a residual plot. This addresses RQ1 because it quantifies the relationship between weather and cycling volume while accounting for weekly patterns." Compare this to "We will use regression to analyse the effect of weather." Both describe the same idea — but only the first gives your group members enough to start working independently, and only the first gives you something to defend. You will turn these steps into GitHub Issues at the end of Part 3. ::: --- # Part 2: Set up `renv` Without `renv`, everyone in your group installs packages independently. One person has `dplyr` 1.1.4, another has 1.0.9 — and a function that works on one machine produces a different result or a cryptic error on another. By the time you are debugging, it is rarely obvious that package versions are the cause. `renv` solves this by giving the project its own R library and recording the exact version of every package in a lockfile (`renv.lock`). Anyone who clones the repository runs `renv::restore()` and gets an identical environment. It also makes your work reproducible for anyone outside your group — a marker, a future employer, or yourself in six months. Open your group project in RStudio via the `.Rproj` file before starting. ## Exercise 2.1: Initialise `renv` Run in the **R console**: ```r # Install renv if you do not have it install.packages("renv") # Initialise renv in the current project renv::init() ``` After running, look at what changed in your project folder. Which files and folders did `renv` create? ::: {.callout-note title="Solution" collapse="true"} `renv::init()` creates: - `renv/` — the project-local library (packages install here instead of your global library) - `renv.lock` — the lockfile: a JSON snapshot of every package name, version, and source - `.Rprofile` — a small file that activates `renv` automatically whenever you open the project `renv` also adds `renv/library/` to `.gitignore`. This is correct — the library contains compiled binaries that are platform-specific and can always be rebuilt from the lockfile. ::: ## Exercise 2.2: Check the status and snapshot Run in the **R console**: ```r renv::status() ``` This compares what is installed in the project library against what is recorded in the lockfile. If packages your project uses are missing from the lockfile, install them and snapshot: ```r # Install any packages your scripts use, e.g.: install.packages(c("tidyverse", "skimr", "naniar", "visdat")) # Record the current state in the lockfile renv::snapshot() ``` What is the difference between `renv::snapshot()` and `renv::restore()`? ::: {.callout-note title="Solution" collapse="true"} - `renv::snapshot()` writes the **currently installed** package versions to `renv.lock`. Run it after installing or updating packages to record the new state. - `renv::restore()` installs the package versions **recorded in the lockfile** into the project library. Run it after cloning or pulling a repository to reproduce someone else's environment. They are inverses: `snapshot()` writes the lockfile from the library; `restore()` builds the library from the lockfile. ::: ## Exercise 2.3: Commit the lockfile to GitHub Run in the **terminal** (not the R console): ```bash git add renv.lock .Rprofile .gitignore git commit -m "feat: initialise renv" git push ``` ::: {.callout-tip} `renv.lock` and `.Rprofile` should always be committed and pushed. Commit an updated `renv.lock` every time you add or update a package — treat it the same way you would a `DESCRIPTION` file. ::: Why is `renv/library/` excluded from version control even though it contains the actual installed packages? ::: {.callout-note title="Solution" collapse="true"} The `renv/library/` folder contains compiled package binaries. These are: - **Platform-specific**: macOS binaries will not run on Linux or Windows - **Large**: a full tidyverse library can be several hundred MB - **Fully reproducible**: any machine can rebuild them from `renv.lock` by running `renv::restore()` Committing them would bloat the repository, cause merge conflicts across platforms, and provide no benefit — the lockfile already contains all the information needed. ::: ## Exercise 2.4: Test the restore Have a different group member pull the latest changes, open the project in RStudio, and run in the **R console**: ```r renv::restore() ``` Confirm it succeeds by loading a package and running a short script from your project: ```r library(tidyverse) dat <- read_csv("data/raw/your-data.csv") glimpse(dat) ``` ::: {.callout-note title="Solution" collapse="true"} A successful `renv::restore()` prints something like: ``` - Restoring packages into 'renv/library'... ✓ Restored 42 packages. ``` If a package fails to install, it usually means a system-level dependency is missing (e.g. `libxml2` for the `xml2` package on Linux). Install the system library first and re-run `renv::restore()`. If `renv::status()` still reports issues after a restore, run `renv::snapshot()` on the machine with the full installation, commit the updated lockfile, and repeat. ::: --- # Part 3: Group Project Good data documentation answers five questions for someone who has never seen your data: what it is, where it comes from, what variables it contains, what its limitations are, and what processing has been applied. Most open datasets — including the ones you chose for your projects — only partially answer these questions. In this part you will first audit what is and isn't documented about your dataset, then use what you find to write a `data/README.qmd` directly in your project repository. **Use your group's own dataset throughout.** Open the original source page alongside the raw file in R. Make sure you are working inside your group project — with `renv` active from Part 2 — so that any packages you install here are recorded in the lockfile. If your group does not yet have a dataset, you can work with the Munich bicycle counter daily values with weather, available at [opendata.muenchen.de/dataset/daten-der-raddauerzaehlstellen-muenchen-2024](https://opendata.muenchen.de/dataset/daten-der-raddauerzaehlstellen-muenchen-2024){target="_blank"} — download any monthly **Tageswerte** CSV (e.g. `rad_2024_01_tage_korr.csv`). We strongly recommend working with your own project dataset instead. ## Exercise 3.1: Check the license Find the license field on your dataset's source page. - What license does it use, and what does that allow you to do? - Is attribution required? If so, what exactly must you write — is "Data from [source name]" sufficient? - Are there any restrictions on commercial use or redistribution? ::: {.callout-note title="Solution — backup dataset" collapse="true"} The bicycle counter dataset uses **Datenlizenz Deutschland – Namensnennung – Version 2.0** (dl-de/by-2-0). This is a German open government licence roughly equivalent to CC BY 4.0: you can copy, use, and redistribute the data for any purpose, including commercially, as long as you attribute the source. Attribution must be explicit. "Data from Munich Open Data" is not sufficient — the licence requires naming the publisher, the dataset title, the URL, and the licence: "Landeshauptstadt München, Daten der Raddauerzählstellen München, opendata.muenchen.de, Datenlizenz Deutschland Namensnennung 2.0." ::: ## Exercise 3.2: Identify documentation gaps in the raw file ```r library(readr) library(dplyr) dat <- read_csv("your-data-file.csv") glimpse(dat) ``` ::: {.callout-tip} If you need to install packages here (e.g. `janitor`, `skimr`), do so inside the project and run `renv::snapshot()` in the **R console** afterwards to record them in the lockfile. ::: Look at the column names and types. Are there any you could not explain to someone else — where the meaning, units, or valid values are not obvious from the name alone? Write down at least three columns where the documentation is incomplete or unclear. You will fill these in when you write the README. ::: {.callout-note title="Solution — backup dataset" collapse="true"} The bicycle counter file has 12 columns. Several have documentation gaps: - `richtung_1` / `richtung_2` — directional counts, but which direction is "1"? This varies by station. At Arnulfstraße, `richtung_2` counts cyclists going *against* the direction of travel — noted only in a prose aside on the portal page, not in the file. - `min-temp` / `max-temp` — temperature, but units not stated anywhere. Presumably °C. - `niederschlag` — precipitation; units not stated (mm, presumably). - `bewoelkung` — cloud cover; units not stated (%, presumably). Note also that `min-temp` and `max-temp` contain hyphens — not valid in R variable names. Use `janitor::clean_names()` before working with them. ::: ## Exercise 3.3: Identify structural gaps Look at the source page for your dataset: - Is there a variable dictionary or codebook anywhere — something that formally defines what each column means? - Are any known data quality issues documented (sensor outages, missing periods, revisions)? - If you needed to answer a question about the data that the file itself doesn't answer, where would you look? Is that link documented anywhere? ::: {.callout-note title="Solution — backup dataset" collapse="true"} The bicycle counter portal page has inline station-specific notes, but no formal variable dictionary. The notes are written in prose and easy to miss when downloading. Two stations are documented as absent from the 2024 data entirely (Bad-Kreuther-Str. and Margaretenstr.), and Erhardtstraße had a sensor defect from September to November 2024. None of this is flagged in the data files themselves — the rows simply do not appear. A user who loaded the file without reading the portal page would have no way of knowing those stations ever existed. These are exactly the gaps your `data/README.qmd` should fill. ::: ## Exercise 3.4: Write a data README Create `data/README.qmd` in your project repository. Use the gaps you identified in Exercises 3.2 and 3.3 to guide what you write. Start with a 2–4 sentence overview, then add a variable dictionary as a Markdown table covering your key variables. ::: {.callout-tip} Use `skimr::skim()` in the **R console** to get types, missingness counts, and ranges quickly: ```r library(skimr) skim(dat) ``` Use the output to fill in the table — translate it into readable descriptions rather than pasting it directly. ::: ::: {.callout-note title="Solution" collapse="true"} *Example overview:* > The dataset contains daily bicycle counts from automated induction-loop stations operated by the Landeshauptstadt München, covering January–December 2024. Each row represents one counting station on one calendar day, combined with daily weather observations for Munich. The data are published under Datenlizenz Deutschland Namensnennung 2.0 and downloaded from the Munich Open Data Portal on 2026-06-11. *Example variable dictionary:* | Variable | Type | Description | Example / range | |----------|------|-------------|-----------------| | `datum` | `<date>` | Observation date | 2024-01-01 – 2024-12-31 | | `zaehlstelle` | `<chr>` | Counting station name | "Arnulf", "Olympia" | | `gesamt` | `<int>` | Total daily bicycle count (both directions) | 78 – 4,821 | | `min_temp` | `<dbl>` | Daily minimum temperature (°C) | −10.2 – 23.4 | | `niederschlag` | `<dbl>` | Daily precipitation (mm) | 0.0 – 38.1 | | `sonnenstunden` | `<dbl>` | Daily sunshine duration (hours) | 0.0 – 14.9 | The goal is human-readable documentation. "Daily precipitation in mm, range 0–38" is useful. "niederschlag, 0–38" is not. ::: ## Exercise 3.5: Review your repository structure Check your project repository against the requirements for the final submission. Open the repository on GitHub and work through this checklist: - [ ] `README.md` at the root with dataset description, research questions, and group members - [ ] Raw data in `data/raw/` (or a download script if the file is too large to commit) with the new `data/README.qmd` - [ ] A cleaning script — e.g. `scripts/01_clean.R` or `data-cleaning.qmd` - [ ] `renv.lock` committed (from Part 2) - [ ] `contributions/` folder or a `CONTRIBUTING.qmd` — even a placeholder is fine For any item that is missing, create the file or folder now and commit a placeholder. ::: {.callout-note title="Solution" collapse="true"} A minimal contribution statement placeholder: ```markdown ## Alice — Contribution Statement *To be completed before the final submission deadline.* **Primarily responsible for:** **Key tasks:** **Collaborative tasks:** ``` Commit it as `contributions/alice.qmd` (one file per group member). ::: ## Exercise 3.6: Plan your next analysis step Review your proposal's analysis plan and decide on the single most important next step. Open a GitHub Issue in your repository describing: - What you are trying to answer or produce - Which variables and dataset you will use - Who is responsible - A rough estimate of when it will be done ::: {.callout-tip} GitHub Issues are a natural record of project decisions. Opening issues and assigning them to people makes it easier for every group member to commit directly — which the project guidelines require. ::: ::: {.callout-note title="Solution" collapse="true"} *Example issue title:* "Plot seasonal trend in bicycle counts by station" *Example issue body:* > **Goal:** Produce a faceted line chart showing monthly totals per station, to address RQ2 from the proposal. > > **Data:** `data/processed/bikes_clean.rds` — variables `datum`, `zaehlstelle`, `gesamt` > > **Owner:** Alice > > **Target:** Draft ready for group review by Jun 18 Assign the issue to the responsible group member before closing the practical session. ::: --- # Part 4: Reflection Log Take a few minutes at the end of the session to add this week's entry to your [reflection log](_reflection-prompts.qmd). Then commit and push everything — `renv.lock`, `data/README.qmd`, any new files from Exercise 3.5, and the reflection log entry can all go in one commit. --- ## End-of-practical checklist Before you leave today, make sure your group has: - Proposal revised: research questions narrowed, dataset description improved, analysis plan made specific - `renv.lock` committed and pushed, with `renv::status()` reporting no issues - At least one group member has run `renv::restore()` from a clean pull without errors - `data/README.qmd` created with a dataset overview, variable dictionary, and at least two quality notes - Project repository structure reviewed; any missing folders or placeholder files committed - At least one GitHub Issue opened and assigned for the next analysis step - This week's reflection log entry committed and pushed --- # Resources **Reproducible environments** - [Introduction to renv](https://rstudio.github.io/renv/articles/renv.html) — official getting-started guide - [Shannon Pileggi, renv — posit::conf(2025)](https://www.youtube.com/watch?v=l01u7Ue9pIQ) — 20-minute talk covering renv in team workflows **Open data** - [Munich Open Data Portal](https://opendata.muenchen.de) — all datasets used in this course - [FAIR Principles](https://www.go-fair.org/fair-principles/) — Findable, Accessible, Interoperable, Reusable; the standard framework for evaluating open datasets - [Creative Commons licence chooser](https://chooser-beta.creativecommons.org/) — understand what each CC licence allows - [The Turing Way — Open Data](https://the-turing-way.netlify.app/reproducible-research/open.html) — practical guidance on working with and publishing open data **Data documentation** - [The Turing Way — Research Data Management](https://the-turing-way.netlify.app/reproducible-research/rdm.html) — covers documentation, file organisation, and metadata - [WCD Open Data lecture notes](https://wcd.numbat.space/week1/#/open-data-is) — used in preparing this session

Course Evaluation

Quiz

Overview

Part 1: Proposal Revision

Exercise 1.1: Commit to your research questions

Exercise 1.2: Evaluate your dataset

Exercise 1.3: Make your analysis plan specific

Part 2: Set up renv

Exercise 2.1: Initialise renv

Exercise 2.2: Check the status and snapshot

Exercise 2.3: Commit the lockfile to GitHub

Exercise 2.4: Test the restore

Part 3: Group Project

Exercise 3.1: Check the license

Exercise 3.2: Identify documentation gaps in the raw file

Exercise 3.3: Identify structural gaps

Exercise 3.4: Write a data README

Exercise 3.5: Review your repository structure

Exercise 3.6: Plan your next analysis step

Part 4: Reflection Log

End-of-practical checklist

Resources

Part 2: Set up `renv`

Exercise 2.1: Initialise `renv`