Quiz: Initial Data Analysis & Data Cleaning

Week 7 · Advanced Statistical Programming using R

Recap: R Packages

Q1

What does devtools::check() do?

  1. Installs all dependencies listed in DESCRIPTION
  2. Runs the full R CMD check. Tests, examples, documentation, package structure
  3. Pushes your package to CRAN
  4. Renders the package vignettes as PDFs

Answer: 2) check() runs the complete validation including tests, examples, docs, and package structure. Run it before sharing or committing.

Q2

In your data package, how do you document the dataset museum_visitors?

  1. Write a roxygen2 block in R/data.R ending with the string "museum_visitors"
  2. Add a separate README-museum_visitors.md file
  3. Save a data/museum_visitors-docs.txt file
  4. Edit NAMESPACE directly

Answer: 1) Data is documented like a function, but the roxygen block ends with the dataset name as a string. devtools::document() then generates the ?museum_visitors help page.

Initial Data Analysis

Q3

What does Initial Data Analysis (IDA) refer to?

  1. The first time you open a dataset in R
  2. Systematic checks of data quality before formal analysis
  3. Running a regression model on a fresh dataset
  4. Creating publication-ready plots

Answer: 2) IDA is the preparation phase. It checks types, ranges, and missing values so the data matches your assumptions before any inference.

Q4

Why should IDA decisions be documented in a reproducible script rather than made ad-hoc in the console?

  1. Scripts run faster than interactive R
  2. R requires all data work to live in scripts
  3. It is required by GDPR
  4. Undocumented choices (outliers, exclusions, transformations) can bias results

Answer: 4) Quiet decisions during cleaning bias your results -> called “data snooping”. A script makes every choice visible, reviewable, and reproducible.

Data Cleaning

Q5

Why is it good practice to specify col_types explicitly when reading a CSV?

  1. It prevents R from misinterpreting columns
  2. It makes the file load faster
  3. It reduces the file size on disk
  4. Without it, read_csv() will not run

Answer: 1) read_csv() guesses types from the first few rows, which can go wrong with dates or ID columns. Example: dates as text, IDs as numbers.

Q6

What does dplyr::glimpse(data) show you?

  1. Only the first 6 rows
  2. A correlation matrix of all numeric columns
  3. A summary table with means and medians
  4. A transposed view: column names, types, and example values per column

Answer: 4) glimpse() rotates the data so each column becomes one row. Great for wide datasets where head() cuts columns off.

Q7

Which is an example of a domain knowledge check in data cleaning?

  1. Verifying that all column names use snake_case
  2. Confirming the file is UTF-8 encoded
  3. Checking that survey ages are between 0 and ~120 — not negative, not 999
  4. Running unit tests in tests/testthat/

Answer: 3) Domain checks use what you know about the world to spot impossible values. Sentinel codes like 999 or -1 often mean “missing” in disguise.

Missing Values

Q8

A missing value is structural when:

  1. The data point logically cannot exist (e.g. “spouse’s age” for unmarried respondents)
  2. It was lost randomly during data transfer
  3. The dataset has fewer than 100 rows
  4. The variable type is set to NA

Answer: 1) Structural missings reflect the population being measured. Example: ‘spouse’s age’ for an unmarried respondent couldn’t exist in the first place.

Q9

Which statement about imputation is correct?

  1. Imputing missing values always improves your analysis
  2. Imputation substitutes for understanding why values are missing
  3. Imputed values are best — always replace NAs before analysing
  4. Imputation can be useful, but it doesn’t fix bad data — sense-check every imputed value

Answer: 4) Tools like simputation::impute_lm() fill the gap without explanation. Always sense-check imputed values.

Q10

You run glimpse() on a dataset and notice monat is stored as <chr> instead of a date. After fixing it with lubridate::ym(), what should you do next?

  1. Proceed directly to modelling — the type is now correct
  2. Delete the original monat column so there’s no confusion
  3. Plot the data again to check the fix worked and look for any new issues
  4. Save the file as CSV immediately

Answer: 3) IDA is iterative — each fix can reveal new issues or change how the data looks. Plotting before and after a change is the recommended workflow.