Week 7 · Advanced Statistical Programming using R
What does devtools::check() do?
DESCRIPTIONR CMD check. Tests, examples, documentation, package structureAnswer: 2) check() runs the complete validation including tests, examples, docs, and package structure. Run it before sharing or committing.
In your data package, how do you document the dataset museum_visitors?
R/data.R ending with the string "museum_visitors"README-museum_visitors.md filedata/museum_visitors-docs.txt fileNAMESPACE directlyAnswer: 1) Data is documented like a function, but the roxygen block ends with the dataset name as a string. devtools::document() then generates the ?museum_visitors help page.
What does Initial Data Analysis (IDA) refer to?
Answer: 2) IDA is the preparation phase. It checks types, ranges, and missing values so the data matches your assumptions before any inference.
Why should IDA decisions be documented in a reproducible script rather than made ad-hoc in the console?
Answer: 4) Quiet decisions during cleaning bias your results -> called “data snooping”. A script makes every choice visible, reviewable, and reproducible.
Why is it good practice to specify col_types explicitly when reading a CSV?
read_csv() will not runAnswer: 1) read_csv() guesses types from the first few rows, which can go wrong with dates or ID columns. Example: dates as text, IDs as numbers.
What does dplyr::glimpse(data) show you?
Answer: 4) glimpse() rotates the data so each column becomes one row. Great for wide datasets where head() cuts columns off.
Which is an example of a domain knowledge check in data cleaning?
999tests/testthat/Answer: 3) Domain checks use what you know about the world to spot impossible values. Sentinel codes like 999 or -1 often mean “missing” in disguise.
A missing value is structural when:
NAAnswer: 1) Structural missings reflect the population being measured. Example: ‘spouse’s age’ for an unmarried respondent couldn’t exist in the first place.
Which statement about imputation is correct?
NAs before analysingAnswer: 4) Tools like simputation::impute_lm() fill the gap without explanation. Always sense-check imputed values.
You run glimpse() on a dataset and notice monat is stored as <chr> instead of a date. After fixing it with lubridate::ym(), what should you do next?
monat column so there’s no confusionAnswer: 3) IDA is iterative — each fix can reveal new issues or change how the data looks. Plotting before and after a change is the recommended workflow.