Week 9 · Advanced Statistical Programming using R
Your dataset has monthly temperature readings, but no row exists for July 2024. What kind of missingness is this, and how do you make it visible?
is.na()tidyr::complete() to fill the absent row with NAAnswer: 2) Implicit missings are absent rows, not NA values. tidyr::complete() fills the gap with NA so it shows up in your missingness checks.
Why should you plot your data before trusting summary statistics like mean() or median()?
summary() is too slow on large datasetsAnswer: 3) Two datasets with identical mean and SD can look completely different (Anscombe’s quartet). Plot first, hypothesise about what you see, then decide on a fix.
What does Replicability mean?
Answer: 4) Replicability checks whether a finding holds up when the study is repeated with new data. Answer 1 = computational reproducibility; Answer 2 = robustness.
Which files/ folders does renv::init() create in your project?
renv/, renv.lock, and an updated .Rprofilepackages.txt fileDockerfile and docker-compose.yml.libPaths set globally for your userAnswer: 1) renv/ holds the project-local library, renv.lock records exact versions, and .Rprofile activates renv whenever you open the project.
Which files should you commit to git using renv? (multiple correct answers)
renv.lock.Rprofilerenv/library/ (compiled package binaries).gitignore (with renv/library/ excluded)Answer: 1, 2 & 4. The lockfile and activation script are needed for others to rebuild the library with renv::restore(). The compiled binaries in renv/library/ are platform-specific and gitignored.
Which combination correctly defines open data?
Answer: 3) Both dimensions matter. A PDF behind a paywall is legally open but technically locked. A CSV with “all rights reserved” is technically open but legally closed.
What does FAIR stand for in the context of open data?
Answer: 3) Introduced by Wilkinson et al. (2016). Findable (persistent ID), Accessible (open protocols), Interoperable (standard formats), Reusable (clear licence and provenance).
You use a dataset published under CC BY 4.0 in your project. What must you do?
Answer: 4) “BY” stands for attribution. Just writing “Data from [website]” is not enough. You have to name the publisher, dataset title, URL, and licence so others can trace it.
Which documentation of a dataset helps a new collaborator the most? (multiple correct answers)
Answer: 1, 3 & 4) Describing a data file is not describing a dataset.
What describes Pre-registration in Confirmatory Data Analysis?
Answer: 2) Pre-registration commits you to a plan before seeing the data, so post-hoc hypothesising can’t undermine the results.