Week 9: Reproducibility, Open Data & renv
2026-06-11
Questions about your specific proposal or project?
We will end the lecture a little earlier today for Teaching Evaluation. If there is additional time, you may ask questions about your specific projects.
You should also attend the upcoming practicals to get additional feedback and suggestions for improving your projects.
Good enough proposals showed:
Weaker proposals:
Note that you were not expected to address all these areas in your proposals. You will have the chance to revise your project plans as we cover relevant content in upcoming weeks.
Scope & Focus
Choice of Datasets & Describing Your Data
100 rows, 10 variables tell you?Analysis Plans
Communication Style and Clarity
Let’s look briefly at an actual project from your peers!
Exam questions beyond your projects
We may ask you about topics not demonstrated in your group project (e.g. R packages, even if your project doesn’t include one)
Tip
This week’s practical is an opportunity to update your project plans, set up renv, and improve your data and overall documentation.
Choosing clean/simple datasets is not as clever as you think! How will you answer exam questions about cleaning datasets?
For many of you, this might be the first time you’ve worked with real-world data, and it might be a little bit overwhelming.
Tip
Feeling stuck is a sign you are thinking carefully, not that you are doing it wrong.
You can often learn more from a partial attempt at something challenging than from a polished version of something trivial. Some suggestions for stretching yourself:
Once you have developed your baseline analysis and results, shift your attention to practising your presentation and communication skills.
Only one final URL
Make sure any additional outputs can be reached from your website homepage — i.e. make sure to add links to all presentation outputs to the navigation bar.
Learning is a reflective and reinforcing process. Try to keep track of what has been done, what has been decided, and what you’ve learnt. In addition to the weekly reflection logs, you might also use GitHub Issues to coordinate your group’s work:
Tip
Issues create a record of who was responsible for what — this is exactly what your contribution statement will draw on. One issue per task is better than one big issue for everything.
| Milestone | Date | Submission |
|---|---|---|
| ✓ general feedback in class today | ||
| Progress Update | 29 Jun | URL to rendered site, updated Project Proposal |
| Group reflection — in-class discussion | W14 practical | N/A |
Final submission + group-reflection.qmd |
Due 22 Jul | URL to final website/webpage, group reflection doc |
| Individual contribution statement | Due 23 Jul | PDF via Moodle |
| Oral exam | 29 Jul | — |
Make sure your repository includes:
README.md with all group member namescontributions/ folder (e.g. contributions/alice.qmd):
Part 1: Statistical Programming Foundations
Part 2: Working with Real World Data
Part 3: Advanced Topics & Summary
lubridate for dates; other specialist packages for geospatial (sf), time-of-day (hms), factors (forcats) etc.glimpse() for the whole table; str() for a single columnsummary() hidesas.*()), flag/replace out-of-range values, standardise encodingsdistinct()), reshape (pivot_*()), split/merge columns (separate(), unite())NA) — visible; find with is.na(), summary(), naniar::vis_miss()tidyr::complete() or tsibble::fill_gaps() to make implicit missingness explicitAdapted from: https://rcp.numbat.space/week1/#/etc5513-title
The problem
The payoff
From the US National Academies (2019):
Tip
A result can be computationally reproducible but not replicable — and vice versa.
Reproducing a result requires the same environment at every level:
| Level | What it covers | Tools |
|---|---|---|
| 1–3 | OS, system libraries, R version | Docker, Singularity, rang |
| 4 | R packages | renv, packrat, checkpoint |
Tip
renv does not replace Docker — it solves a narrower, more common problem: fixing which package versions a project uses.
The scope of reproducibility practices and tools is extremely broad. We will focus mainly on:
For extended tutorial see:

renv.lock — a plain-text snapshot of your environmentrenv::init() createsrenv/ — the project-local library; packages install here instead of your global libraryrenv.lock — a JSON snapshot of every package name, version, and source.Rprofile — activates renv automatically when you open the projectTip
renv/library/ is automatically added to .gitignore — compiled binaries are platform-specific and are always rebuildable from renv.lock.
renv::status() — compare installed packages against renv.lock; run this first to diagnose problemsrenv::snapshot() — update renv.lock to match what is currently installedrenv::restore() — install packages as recorded in renv.lockrenv::install() — install from CRAN, Bioconductor, GitHub, etc.renv::update() — get latest versions of all dependenciesrenv::init() then renv::snapshot()renv.lock, .Rprofile, and .gitignore — push to GitHubrenv::restore()renv::snapshot() and commit the updated lockfileImportant
Treat renv.lock like a DESCRIPTION file — commit it every time you add or update a package.
“Open data can be freely used, modified, and shared by anyone for any purpose.” — Open Definition
Legal openness
In the public domain or under liberal terms — commercial and non-commercial use permitted, minimal restrictions
Technical openness
Machine-readable, non-proprietary formats; accessible via freely available tools; no passwords or paywalls
Examples from: Open Data Handbook — Why Open Data?
Tip
Open data creates value beyond the original collection purpose — but only if it is findable, accessible, and well-documented.
Introduced by Wilkinson et al. (2016) to guide sharing of scientific data — now widely applied in research and open data contexts.
Findable
Persistent identifier (e.g. DOI); rich, searchable metadata
Accessible
Retrievable via standard open protocols; metadata remains accessible even if data is not
Interoperable
Standard, open formats; shared vocabularies and schemas
Reusable
Clear licence; documented provenance; meets community standards
| Licence | Attribution? | Share-alike? | Commercial use? |
|---|---|---|---|
| CC0 (Public Domain) | No | No | Yes |
| CC BY | Yes | No | Yes |
| CC BY-SA | Yes | Yes | Yes |
| ODbL (databases) | Yes | Yes | Yes |
Tip
“Attribution required” means naming the publisher, dataset title, URL, and licence — not just “Data from [website]”.
LICENSE (or LICENSE.md) in the root of your repositoryREADME.md and any data documentationFor software licences, see: The Turing Way — Licensing
Just like broader reproducibility practices, open data involves lots of different tools, platforms and standards. We focus on:
renv.lock, sessionInfo())What would a new collaborator get wrong about your data if they only read your description and not the source page?
Tip
Describing a data file is not describing a dataset — what can someone learn about the world from this data?
Most datasets — including open ones — are only partially documented. Check for:
| Stage | Goal | Documentation output |
|---|---|---|
| IDA | Check data integrity before analysis | Issues log, decisions + justifications, cleaning script |
| EDA | Generate hypotheses, explore patterns | Annotated notebooks, summary of key findings |
| CDA | Answer pre-defined research questions | Methods, results, discussion; formal report |
All stages require a reproducible record of decisions — not just what you did, but why.
What would someone need to read to understand how and why you prepared and used your data?
data-raw/) that runs from raw → cleandata-raw/ for scripts, vignettes for narrative — IDA and packages are friends!What would someone need to read to understand why you asked your research question?
What decisions in your analysis would a reader need to know about to evaluate your conclusions?
Documentation needs vary by context — there is no universal standard:
Research
Formal stages (IDA/EDA/CDA), pre-registration, discipline-specific reporting standards
Business / industry
Reproducibility within the organisation; changelogs, access controls, and data dictionaries often matter more than formal stages
What might “good documentation” mean across different audiences and use cases?
Broman & Woo (2018): Data Organization in Spreadsheets explain a set of principles that apply regardless of context, including:
YYYY-MM-DD — unambiguous, sorts correctlyFormal reporting standards exist for specific study types:
Important
Designing documentation standards is hard, ongoing work — STROBE took years of community effort and is still evolving. Most data science contexts have no agreed standard yet.
renv::init() creates a project-local library and records versions in renv.lockrenv::snapshot() updates the lockfile; renv::restore() rebuilds the library from itrenv.lock to git; add renv/library/ to .gitignoredata-raw/YYYY-MM-DD dates, one value per cell, plain text filesKey reminders from proposal feedback:
Please log on to:
https://www.lehrevaluation.uni-muenchen.de/evasys/public/online/index
Lösung: 4JQS5