Quiz: Initial Data Analysis & Data Cleaning

Week 7 · Advanced Statistical Programming using R

Recap: R Packages

Q1

What does devtools::check() do?

Installs all dependencies listed in DESCRIPTION
Runs the full R CMD check. Tests, examples, documentation, package structure
Pushes your package to CRAN
Renders the package vignettes as PDFs

Answer: 2) check() runs the complete validation including tests, examples, docs, and package structure. Run it before sharing or committing.

Q2

In your data package, how do you document the dataset museum_visitors?

Write a roxygen2 block in R/data.R ending with the string "museum_visitors"
Add a separate README-museum_visitors.md file
Save a data/museum_visitors-docs.txt file
Edit NAMESPACE directly

Answer: 1) Data is documented like a function, but the roxygen block ends with the dataset name as a string. devtools::document() then generates the ?museum_visitors help page.

Initial Data Analysis

Q3

What does Initial Data Analysis (IDA) refer to?

The first time you open a dataset in R
Systematic checks of data quality before formal analysis
Running a regression model on a fresh dataset
Creating publication-ready plots

Answer: 2) IDA is the preparation phase. It checks types, ranges, and missing values so the data matches your assumptions before any inference.

Q4

Why should IDA decisions be documented in a reproducible script rather than made ad-hoc in the console?

Scripts run faster than interactive R
R requires all data work to live in scripts
It is required by GDPR
Undocumented choices (outliers, exclusions, transformations) can bias results

Answer: 4) Quiet decisions during cleaning bias your results -> called “data snooping”. A script makes every choice visible, reviewable, and reproducible.

Data Cleaning

Q5

Why is it good practice to specify col_types explicitly when reading a CSV?

It prevents R from misinterpreting columns
It makes the file load faster
It reduces the file size on disk
Without it, read_csv() will not run

Answer: 1) read_csv() guesses types from the first few rows, which can go wrong with dates or ID columns. Example: dates as text, IDs as numbers.

Q6

What does dplyr::glimpse(data) show you?

Only the first 6 rows
A correlation matrix of all numeric columns
A summary table with means and medians
A transposed view: column names, types, and example values per column

Answer: 4) glimpse() rotates the data so each column becomes one row. Great for wide datasets where head() cuts columns off.

Q7

Which is an example of a domain knowledge check in data cleaning?

Verifying that all column names use snake_case
Confirming the file is UTF-8 encoded
Checking that survey ages are between 0 and ~120 — not negative, not 999
Running unit tests in tests/testthat/

Answer: 3) Domain checks use what you know about the world to spot impossible values. Sentinel codes like 999 or -1 often mean “missing” in disguise.

Missing Values

Q8

A missing value is structural when:

The data point logically cannot exist (e.g. “spouse’s age” for unmarried respondents)
It was lost randomly during data transfer
The dataset has fewer than 100 rows
The variable type is set to NA

Answer: 1) Structural missings reflect the population being measured. Example: ‘spouse’s age’ for an unmarried respondent couldn’t exist in the first place.

Q9

Which statement about imputation is correct?

Imputing missing values always improves your analysis
Imputation substitutes for understanding why values are missing
Imputed values are best — always replace NAs before analysing
Imputation can be useful, but it doesn’t fix bad data — sense-check every imputed value

Answer: 4) Tools like simputation::impute_lm() fill the gap without explanation. Always sense-check imputed values.

Q10

You run glimpse() on a dataset and notice monat is stored as <chr> instead of a date. After fixing it with lubridate::ym(), what should you do next?

Proceed directly to modelling — the type is now correct
Delete the original monat column so there’s no confusion
Plot the data again to check the fix worked and look for any new issues
Save the file as CSV immediately

Answer: 3) IDA is iterative — each fix can reveal new issues or change how the data looks. Plotting before and after a change is the recommended workflow.