Quiz: Open Data & Renv

Week 9 · Advanced Statistical Programming using R

Recap: IDA & Data Cleaning

Q1

Your dataset has monthly temperature readings, but no row exists for July 2024. What kind of missingness is this, and how do you make it visible?

Explicit missingness — call is.na()
Implicit missingness — use tidyr::complete() to fill the absent row with NA
Structural missingness — drop the row from your dataset
Random missingness — no action needed

Answer: 2) Implicit missings are absent rows, not NA values. tidyr::complete() fills the gap with NA so it shows up in your missingness checks.

Q2

Why should you plot your data before trusting summary statistics like mean() or median()?

Plots are required by the tidyverse style guide
summary() is too slow on large datasets
Plots reveal patterns and issues that summary numbers hide
ggplot2 automatically fixes data errors

Answer: 3) Two datasets with identical mean and SD can look completely different (Anscombe’s quartet). Plot first, hypothesise about what you see, then decide on a fix.

Reproducibility

Q3

What does Replicability mean?

Same input data, same code, same conditions → same results
Same data, different reasonable analysis choices → consistent conclusions
Same code, different programming languages → equivalent results
Same scientific question, new independently collected data → consistent results

Answer: 4) Replicability checks whether a finding holds up when the study is repeated with new data. Answer 1 = computational reproducibility; Answer 2 = robustness.

renv

Q4

Which files/ folders does renv::init() create in your project?

renv/, renv.lock, and an updated .Rprofile
A single packages.txt file
Dockerfile and docker-compose.yml
.libPaths set globally for your user

Answer: 1) renv/ holds the project-local library, renv.lock records exact versions, and .Rprofile activates renv whenever you open the project.

Q5

Which files should you commit to git using renv? (multiple correct answers)

renv.lock
.Rprofile
renv/library/ (compiled package binaries)
.gitignore (with renv/library/ excluded)

Answer: 1, 2 & 4. The lockfile and activation script are needed for others to rebuild the library with renv::restore(). The compiled binaries in renv/library/ are platform-specific and gitignored.

Open Data

Q6

Which combination correctly defines open data?

Anything available on a website
Government data only
Legally open (free to use, modify, share) AND technically open (machine-readable, accessible)
Data with at least 1000 rows

Q7

What does FAIR stand for in the context of open data?

Fast, Accurate, Inclusive, Reliable
Free, Anonymous, Indexed, Reviewed
Findable, Accessible, Interoperable, Reusable
Filtered, Annotated, Inspected, Reproducible

Answer: 3) Introduced by Wilkinson et al. (2016). Findable (persistent ID), Accessible (open protocols), Interoperable (standard formats), Reusable (clear licence and provenance).

Q8

You use a dataset published under CC BY 4.0 in your project. What must you do?

Pay a licence fee
Apply the same licence to all your derived work
Nothing, because the data is in the public domain
Credit the source by naming publisher, dataset, URL, and licence

Answer: 4) “BY” stands for attribution. Just writing “Data from [website]” is not enough. You have to name the publisher, dataset title, URL, and licence so others can trace it.

Documenting Data Science

Q9

Which documentation of a dataset helps a new collaborator the most? (multiple correct answers)

A README explaining what the data measures and how it was collected
A PDF with screenshots of the data in Excel
A data dictionary with variable names, descriptions, and units
The licence and known limitations of the dataset

Answer: 1, 3 & 4) Describing a data file is not describing a dataset.

Q10

What describes Pre-registration in Confirmatory Data Analysis?

Reviewing your results before submitting the paper
Recording hypotheses and analysis plan before data collection
Registering your dataset with a public repository
Running multiple models and picking the best one

Answer: 2) Pre-registration commits you to a plan before seeing the data, so post-hoc hypothesising can’t undermine the results.