Quiz: Open Data & Renv

Week 9 · Advanced Statistical Programming using R

Recap: IDA & Data Cleaning

Q1

Your dataset has monthly temperature readings, but no row exists for July 2024. What kind of missingness is this, and how do you make it visible?

  1. Explicit missingness — call is.na()
  2. Implicit missingness — use tidyr::complete() to fill the absent row with NA
  3. Structural missingness — drop the row from your dataset
  4. Random missingness — no action needed

Answer: 2) Implicit missings are absent rows, not NA values. tidyr::complete() fills the gap with NA so it shows up in your missingness checks.

Q2

Why should you plot your data before trusting summary statistics like mean() or median()?

  1. Plots are required by the tidyverse style guide
  2. summary() is too slow on large datasets
  3. Plots reveal patterns and issues that summary numbers hide
  4. ggplot2 automatically fixes data errors

Answer: 3) Two datasets with identical mean and SD can look completely different (Anscombe’s quartet). Plot first, hypothesise about what you see, then decide on a fix.

Reproducibility

Q3

What does Replicability mean?

  1. Same input data, same code, same conditions → same results
  2. Same data, different reasonable analysis choices → consistent conclusions
  3. Same code, different programming languages → equivalent results
  4. Same scientific question, new independently collected data → consistent results

Answer: 4) Replicability checks whether a finding holds up when the study is repeated with new data. Answer 1 = computational reproducibility; Answer 2 = robustness.

renv

Q4

Which files/ folders does renv::init() create in your project?

  1. renv/, renv.lock, and an updated .Rprofile
  2. A single packages.txt file
  3. Dockerfile and docker-compose.yml
  4. .libPaths set globally for your user

Answer: 1) renv/ holds the project-local library, renv.lock records exact versions, and .Rprofile activates renv whenever you open the project.

Q5

Which files should you commit to git using renv? (multiple correct answers)

  1. renv.lock
  2. .Rprofile
  3. renv/library/ (compiled package binaries)
  4. .gitignore (with renv/library/ excluded)

Answer: 1, 2 & 4. The lockfile and activation script are needed for others to rebuild the library with renv::restore(). The compiled binaries in renv/library/ are platform-specific and gitignored.

Open Data

Q6

Which combination correctly defines open data?

  1. Anything available on a website
  2. Government data only
  3. Legally open (free to use, modify, share) AND technically open (machine-readable, accessible)
  4. Data with at least 1000 rows

Answer: 3) Both dimensions matter. A PDF behind a paywall is legally open but technically locked. A CSV with “all rights reserved” is technically open but legally closed.

Q7

What does FAIR stand for in the context of open data?

  1. Fast, Accurate, Inclusive, Reliable
  2. Free, Anonymous, Indexed, Reviewed
  3. Findable, Accessible, Interoperable, Reusable
  4. Filtered, Annotated, Inspected, Reproducible

Answer: 3) Introduced by Wilkinson et al. (2016). Findable (persistent ID), Accessible (open protocols), Interoperable (standard formats), Reusable (clear licence and provenance).

Q8

You use a dataset published under CC BY 4.0 in your project. What must you do?

  1. Pay a licence fee
  2. Apply the same licence to all your derived work
  3. Nothing, because the data is in the public domain
  4. Credit the source by naming publisher, dataset, URL, and licence

Answer: 4) “BY” stands for attribution. Just writing “Data from [website]” is not enough. You have to name the publisher, dataset title, URL, and licence so others can trace it.

Documenting Data Science

Q9

Which documentation of a dataset helps a new collaborator the most? (multiple correct answers)

  1. A README explaining what the data measures and how it was collected
  2. A PDF with screenshots of the data in Excel
  3. A data dictionary with variable names, descriptions, and units
  4. The licence and known limitations of the dataset

Answer: 1, 3 & 4) Describing a data file is not describing a dataset.

Q10

What describes Pre-registration in Confirmatory Data Analysis?

  1. Reviewing your results before submitting the paper
  2. Recording hypotheses and analysis plan before data collection
  3. Registering your dataset with a public repository
  4. Running multiple models and picking the best one

Answer: 2) Pre-registration commits you to a plan before seeing the data, so post-hoc hypothesising can’t undermine the results.