Advanced Statistical Programming using R

Week 7: Initial Data Analysis & Data Cleaning

2026-05-28

Announcements

Reminders

  • All groups should be registered:
    • Group member names & LMU campus emails
    • GitHub repo (set up this week in lecture/practical)
  • Proposal submission deadline extended to start of W8 (Mon Jun 8)

Pick your datasets by tomorrow (May 29)!

This week’s practical will be based around helping you prepare your group project proposals. The practical will be most helpful if you have collected the data you plan to use in your projects.

Tips

Projects must be based on a cohesive set of data (i.e. your dataset), in a single GitHub repo, and available on the web. It is intended to help you gain hands-on experience with statistical programming skills.

If you don’t know where to start, try:

  • Turning code for data import, cleaning, or tidying into functions — and collecting them into an R package
  • Putting visualisations into different layouts on a Quarto webpage
  • Describing how the dataset was collected and what it contains (variables, rows, licence)
  • Documenting discussions — e.g. “We wanted to look at daily temperatures vs. museum visitors, but visitor numbers were only monthly, so we instead explored…”

FAQs

Q: What format is the project?

A: Up to you! Be creative.

Q: Do we have to use a particular dataset?

A: No — any openly licensed data source is fine.

Q: Can we work in groups of 2? or alone?

A: Yes, though 3–4 is recommended so you have more chances to encounter merge conflicts and manage pull requests.

Suggestions for Final Output

The format should suit your analysis context and target audience:

Key dates

Milestone Date Submission
Group registration May 22 Group members, GitHub repo URL
Proposal Start in W7 practical, due 8 Jun URL to rendered proposal
Group reflection — in-class discussion W14 practical N/A
Final submission + group-reflection.qmd Due 22 Jul URL to final website/webpage, group reflection doc
Individual contribution statement Due 23 Jul PDF via Moodle
Oral exam 29 Jul

Submission Formats

  • All submissions will be via forms on Moodle.
  • The final submission should be a URL to a rendered website.
  • The group-reflection.qmd documents what your group did, how you used LLMs, and key decisions
  • contribution.pdf documents your contribution to your group project, and the specific tasks and skills you worked on.
  • For anyone working alone (i.e. groups of 1), your group reflection & contribution statements can be the same.

Relation to Oral Exam

Students who submit individual contribution statements will have oral exam questions related to their own work. Otherwise, the oral exam questions will involve generic projects.

Syllabus

Syllabus

Part 1: Statistical Programming Foundations

  • W02: Scripts, Functions & Refactoring
  • W03: Debugging
  • W04: Version Control & Remotes
  • W05: Quarto Websites & Collaborative Coding
  • W06: R Packages

Part 2: Working with Real World Data

  • W07: Initial Data Analysis & Data Cleaning
  • W09: Open Data & Renv
  • W10: Modelling & Analysis
  • W11: Statistical Communication & Visualisation

Part 3: Advanced Topics & Summary

  • W12: Interactive Data Storytelling
  • W13: Multi-lingual Analysis (Python/R)
  • W14: Review & Outro

Last Week

R Packages

  • A package is the standard unit for sharing functions, data, and documentation in R
  • usethis & devtools streamline the development workflow: create → document → check → install
  • Data packages (e.g. munichvisitors) distribute clean, documented datasets — a great first package
  • roxygen2 turns #' comments into .Rd help files — document as you write

Documentation & Design

  • Every exported object needs @param, @returns, @export, and @examples
  • Expose data as an argument (data = museum_visitors) when users may want to filter or swap it
  • Single responsibility — each function does one thing; package groups functions with a shared theme
  • Consistent interfacesdata first, options last with sensible defaults
  • Limit dependencies — every imported package is a liability your users inherit

This Week

  • Initial Data Analysis
  • Data Screening
  • Data Cleaning
  • Missing Data

Initial Data Analysis

Based on: https://ddde.numbat.space/week3/slides.html#/title-slide, Huebner et al. (2018), Huebner et al. (2020)

Idealised Statistical Investigation

From Chatfield (1985):

  1. Clarify the objectives of the investigation
  2. Collect the data in an appropriate way
  3. Investigate the structure and quality of the data
  4. Carry out an initial examination of the data
  5. Select and carry out an appropriate formal statistical analysis
  6. Compare findings with previous results or acquire further data
  7. Interpret and communicate the results

Note

Steps 3 & 4 describe Initial Data Analysis (IDA).

What is Initial Data Analysis?

“IDA is a systematic and documented pre-analysis examination of a dataset, carried out before formal statistical analysis and without reference to the main research questions.” — adapted from Huebner et al. (2018)

Where does IDA fit?

  • Occurs between data collection and planned statistical analyses
  • Does not touch the research questions

What does IDA uncover?

  • Missing values
  • Subgroups with small sample sizes
  • Data collection shortcomings
  • Impact of these issues on research results

IDA, EDA, and CDA

Exploratory Data Analysis

  • Hypothesis-generating
  • Open-ended exploration
  • May shift research direction

Confirmatory Data Analysis

  • Executes the statistical analysis plan (SAP)
  • Answers predefined research questions

Initial Data Analysis

  • Ensures transparency & integrity
  • Checks preconditions for analysis
  • Does not directly answer research questions

Shared methods and tools

All three types of analysis use statistical visualisation and summaries. The difference is intent and timing, not technique.

IDA vs EDA vs CDA

IDA EDA CDA
Goal Check preconditions; ensure data integrity Generate hypotheses Answer predefined research questions
Scope Systematic, pre-specified Open-ended Planned (statistical analysis plan)
Relation to research questions Must not touch them May reshape them Answers them
When Before analysis begins Early–mid analysis After IDA + EDA
Output Documentation of data issues & decisions New hypotheses, patterns Statistical inference, conclusions

Relation to Data Science Workflow

From R4DS:

  • Statistical analysis involves both planned (sequential) and exploratory (iterative) phases and subphases.
  • IDA maps most easily to import, tidy & transform stages
  • But IDA investigations may also involve all components of a data science workflow:
    • e.g. using visualisation to identify data quality issues

IDA in practice

Guiding principle

IDA strategies require careful thinking and design based on problem and context, not analyst preference.

“There are no routine statistical questions, only questionable statistical routines.” — David R. Cox

  • Reporting is mixed and varied across data analysis reports and research papers
  • Part of the statistical value chain in official statistics
  • A skill developed through practice and hands-on reasoning about data
  • Forces you to engage with assumptions you might unconsciously make about data

Uncovering data issues in IDA

Issues found during IDA can occur at multiple levels:

  • Univariate — problems within a single variable (wrong type, outliers, unexpected distribution)
  • Multivariate — problems across variables (unexpected correlations, group imbalances, distributional differences by subgroup)
  • Dataset — structural problems (missing rows, duplicate observations, wrong unit of analysis)

IDA & the Analysis Plan

Issues uncovered during IDA may require updating the statistical analysis plan before proceeding to CDA.

Topic Examples Possible influence on Analysis Plan
Suspicious values Outliers; inconsistent follow-up dates; conflicting diagnosis sources Rules for whether/how to use inconsistent information
Suspicious subjects Doubts on inclusion criteria; highly corrupt subject data Exclusion decisions balancing data cleanliness vs. selection bias
Unexpected heterogeneity Incompatible education systems; centre-level measurement differences Exclusion or stratification; re-examining measurement instructions
Variable distribution Unexpectedly skewed covariates; bimodal continuous variables Transformations; splitting variables; adapting statistical methods
Violated method requirements Non-normal outcomes; asymptotic tests with small subgroups Refinement, extension, or reduction of models

IDA and Inference

The Goal of IDA: is to avoid misleading or inappropriate statistical analyses. However, IDA can compromise validity of inference if you:

  • Change inference or research questions based on IDA findings
  • Make undocumented or unjustified decisions about:
    • Outlier removal or retention
    • Missing value imputation strategy
    • Treatment of zeros
    • Variable type handling (categorical, temporal)
    • Variable and observation selection
  • Skip multivariate relationship checks or subgroup-level checks

Protecting CDA validity after IDA

Justify key decisions

  • Outlier removal or retention
  • Variable selection & recoding
  • Sampling choices
  • Handling of zeros

Document everything

  • Record ALL IDA in a reproducible analysis script
  • Pre-register your CDA plan so research questions cannot drift — ensures against accusations of data snooping, data dredging, or data fishing.

IDA in StatProg2

We cover IDA across several weeks — each builds on the last:

IDA component When
Data screening — checking types, context, distributions W7 (this lecture)
Data cleaning / preprocessing — fixing issues found in screening W7 (this lecture + practical)
Analysis planning & updating — refining the analysis plan based on IDA findings W7 practical + W10
Reporting & documentation — recording IDA decisions reproducibly W9 (next lecture)

Group project

Your project proposal should include both an IDA plan (what you will screen and clean) and a CDA plan (the research questions you will answer). IDA findings may lead you to refine your CDA plan before analysis.

Review: Data Import & Tidying

  • getting data into R
  • tidying and reshaping data

Reading data into R

Format Function Package
CSV / TSV read_csv(), read_tsv() readr
Excel (.xlsx, .xls) read_excel() readxl
SPSS / Stata / SAS read_sav(), read_dta(), read_sas() haven
JSON fromJSON() jsonlite
Parquet read_parquet() arrow
Database dbReadTable(), tbl() DBI / dbplyr

Tip

Prefer readr::read_csv() over base read.csv() — it produces a tibble, infers column types with explicit feedback, and is faster on large files.

Raw, Interim & Final Data

Folder convention

  • data/raw/ — original source files; never modified
  • data/interim/ — intermediate .rds snapshots from cleaning scripts
  • data/final/ — analysis-ready dataset; reproducibly derived from raw/
my-project/
├── data/
│   ├── raw/          ← source files; read-only
│   ├── interim/      ← .rds cleaning snapshots
│   └── final/        ← analysis-ready dataset
├── code/
│   ├── 01-import.R
│   ├── 02-clean.R
│   └── 03-analyse.R
├── docs/
└── my-project.Rproj

Importing and caching

Raw data files (Excel, SPSS, database exports) can be slow to re-import. Tidied or cleaned intermediates are often saved as .rds snapshots in data/interim/ rather than re-derived each run.

Import, process and cache

library(readxl)
raw <- read_excel("data/raw/raw-data.xlsx")

tidy <- raw |>
  janitor::clean_names() |>
  tidyr::pivot_longer(...) |>
  ...

saveRDS(tidy, "data/interim/tidy-data.rds")

RDS preserves R data types (factors, dates, ordered levels) without re-parsing, unlike saving back to CSV.

Load for analysis or more processing

tidy <- readRDS("data/interim/tidy-data.rds")

Export final analysis-ready dataset

final <- tidy |>
  dplyr::mutate(...) |>
  dplyr::filter(...) |>
  ...

readr::write_csv(final, "data/final/analysis-data.csv")

Tip

Never overwrite the original raw data file. Save processed versions separately. Write final datasets to common formats like CSV for easier sharing.

Tidy data

Three rules (Wickham, 2014)

  1. Each variable is a column
  2. Each observation is a row
  3. Each value is a single cell

Tidy data is the target state — raw data almost never arrives this way.

Why it matters

  • Tidyverse tools (dplyr, ggplot2, tidyr) assume tidy input
  • Tidy data makes vectorised operations natural and pipe-friendly
  • A shared standard means collaborators can read your data without explanation

Tidying vs. IDA & Cleaning

Tidying is not the same as cleaning — a dataset can be tidy but still have wrong values, bad types, or missing data.

Common tidying tools

Task Function Package
Standardise column names clean_names() janitor
Tabulate a categorical variable tabyl() janitor
Reshape wide → long pivot_longer() tidyr
Reshape long → wide pivot_wider() tidyr
Fill implicit gaps in time series complete() tidyr
Recode / transform variables mutate() dplyr
Remove duplicate rows distinct() dplyr
Filter to valid rows filter() dplyr

Tip

janitor::clean_names() should almost always be the first step after import — it converts AUSPRAEGUNGauspraegung, removing spaces and special characters so column names are pipe-safe.

Data screening

  • making sure data are represented correctly in R
  • identifying data type and format errors
  • preliminary characterisation of dataset

Data types & formats

  • Data types are semantic vectors — they encode the kind of data, not just its storage
    • Basic R types (review): logical, integer, double, character, factor
    • Complex types built on top: Date/POSIXct on numerics; factor on integers
  • Data formats describe how data is encoded on disk
    • Some types need specialist import handling (geospatial, images, nested JSON)

Best Practices

Declare types explicitly with col_types when importing, then verify with glimpse(). The R data type should closely match the real-world meaning of the variable.

Example: New York Temperatures

new-york-temp.csv
id,date,loc,temp
1,3/1/10,New York,42
2,3/2/10,New York,41.4
3,3/3/10,New York,38.5
4,3/4/10,New York,41.1
5,3/5/10,New York,39.8

Checking data types

Whole table: glimpse()

library(readr)
df <- read_csv("data/date-example.csv")
glimpse(df)
Rows: 5
Columns: 4
$ id   <dbl> 1, 2, 3, 4, 5
$ date <chr> "3/1/10", "3/2/10", "3/3/10", "3/4/10", "3/5/10"
$ loc  <chr> "New York", "New York", "New York", "New York", "New York"
$ temp <dbl> 42.0, 41.4, 38.5, 41.1, 39.8

Single column: str()

str(df$date)
 chr [1:5] "3/1/10" "3/2/10" "3/3/10" "3/4/10" "3/5/10"
str(df$id)
 num [1:5] 1 2 3 4 5

Tip

The R type shown should match the real-world meaning of the variable.

What happens if you don’t check?

What column types do we get when reading in the New York temperature data?

library(readr)
read_csv("data/date-example.csv")
# A tibble: 5 × 4
     id date   loc       temp
  <dbl> <chr>  <chr>    <dbl>
1     1 3/1/10 New York  42  
2     2 3/2/10 New York  41.4
3     3 3/3/10 New York  38.5
4     4 3/4/10 New York  41.1
5     5 3/5/10 New York  39.8
  • What problems are there with R’s interpretation of the data type?
  • What context-specific factors suggest an incorrect interpretation?

Compare with .xlsx import:

library(readxl)
library(readr)
df <- read_excel("data/date-example.xlsx")
df
# A tibble: 5 × 4
     id date                loc       temp
  <dbl> <dttm>              <chr>    <dbl>
1     1 2010-01-03 00:00:00 New York  42  
2     2 2010-02-03 00:00:00 New York  41.4
3     3 2010-03-03 00:00:00 New York  38.5
4     4 2010-04-03 00:00:00 New York  41.1
5     5 2010-05-03 00:00:00 New York  39.8
  • Excel files carry column type metadata — so read_excel() can infer datetime.
  • CSV is plain text with no type information — read_csv() must guess, and guesses chr for ambiguous strings like "1/3/2010".
  • R has no way to know it is a date without being told.

Fixing data types

library(lubridate)
library(dplyr)

# Step 1: set raw storage type on import
df <- read_excel("data/date-example.xlsx",
                 col_types = c("text",    # id
                               "date",    # date → POSIXct
                               "text",    # loc
                               "numeric"))# temp

# Step 2: convert to intended R types; extract date components
df |>
  mutate(id   = as.factor(id),  # chr → factor
         date = ydm(date)) |>   # POSIXct → Date (year-day-month)
  mutate(day   = day(date),
         month = month(date),
         year  = year(date))
# A tibble: 5 × 7
  id    date       loc       temp   day month  year
  <fct> <date>     <chr>    <dbl> <int> <dbl> <dbl>
1 1     2010-03-01 New York  42       1     3  2010
2 2     2010-03-02 New York  41.4     2     3  2010
3 3     2010-03-03 New York  38.5     3     3  2010
4 4     2010-03-04 New York  41.1     4     3  2010
5 5     2010-03-05 New York  39.8     5     3  2010
  • id is now a factor instead of integer
  • day, month, year extracted from date
  • Is the data screening issue fixed now?

Ambiguity in Date Formats!

  • There are multiple different date formats in widespread usage:
    • MM/DD/YYYY (US)
    • DD/MM/YYYY (international)
    • YYYY/MM/DD (ISO 8601)
  • What’s the most likely format in this example?
  • Most likely 1st–5th March, not 3rd of Jan–May
  • Could check against some reference facts (e.g. look up weather history for New York)

Working with dates in R {lubridate}

Lubridate is a package designed to make working with dates smoother and easier in R.

Parse dates from strings

library(lubridate)
ymd("2024-05-15")      # ISO 8601 → Date
dmy("15/05/2024")      # day-month-year
mdy("05/15/2024")      # US format
ym("202405")           # year-month only

Extract components

d <- ymd("2024-05-15")
year(d)                # 2024
month(d)               # 5
day(d)                 # 15
wday(d, label = TRUE)  # Wed

Tip

Other special data types have dedicated helper packages — e.g. sf for geospatial data, hms for time-of-day, forcats for factors.

Specifying types explicitly

Explicit types catch data errors before they silently affect analysis — a misread date or factor-stored-as-number can corrupt results without any warning.

read_excel() with col_types

xlsx_df <- read_excel(
  "data/date-example.xlsx",
  col_types = c("text", "date", "text", "numeric")) |>
  mutate(id   = as.factor(id),
         date = as.character(date),
         date = as.Date(date, format = "%Y-%d-%m"))

read_csv() with cols() spec

csv_df <- read_csv(
  "data/date-example.csv",
  col_types = cols(
    id   = col_factor(),
    date = col_date(format = "%m/%d/%y"),
    loc  = col_character(),
    temp = col_double()))

Dataset Definition and Quality

Defining and constructing datasets

Dataset conceptualisation is the process of deciding what a dataset should contain before assessing what it does contain.

  • What is the unit of observation?
  • What is the target population?
  • What time period does it cover?
  • What variables are required for the analysis?

Why it matters

You cannot assess whether data is complete, suitable, or correctly structured without first defining what the dataset should look like.

The same raw data can produce very different datasets depending on the conceptualisation — e.g. one row per museum visit vs. one row per museum per month.

What is data quality?

Data quality depends on both the dataset itself and the intended use — a dataset can be high quality for one purpose and unsuitable for another.

Context-independent quality

  • Is it well documented? (codebook, variable descriptions)
  • Are values internally consistent?
  • Is the target population and sampling frame clearly defined?

Context-specific quality

  • Is the dataset suitable for this analysis task?
  • Were appropriate collection methods used for this research question?
  • Does the sample represent the population of interest?

Data Cleaning

  • investigating
  • modifying and fixing

What is data cleaning?

No universal definition

  • Data cleaning is commonly used to refer to a variety of data tasks performed before modelling or inference on data.
  • Also called data preparation, data preprocessing, or data wrangling
  • Think in terms of “preparation” or “preprocessing” to avoid assuming there is one “true clean” version of a dataset

What it involves

  • Investigating possible data quality issues
  • Making decisions about outliers, missing values, and variable types
  • Performed in the context of particular target dataset concept
  • Can overlap with ‘data screening’ — there’s no hard rule for which stage you can or will spot certain issues with the data

Data screening, tidying and cleaning

Compared to data screening and tidying, data cleaning requires more judgement and iterative investigation — the tools and methods used depend on what problems are actually found in the data.

Issues within and across variables

Check for issues within individual variables or across the whole dataset:

Variable-level issues

  • Values outside plausible range (e.g. negative heights, temperatures above 100°C)
  • Non-integer counts; compositional values that don’t sum to 1
  • Wrong data type (e.g. date stored as text)
  • Missing values (NA): unrecorded, censored, or not applicable
  • Skewed, bimodal, or otherwise unexpected distributions
  • Detect with: glimpse(), summary(), histograms, boxplots

Structural / cross-variable issues

  • Imbalanced groups (e.g. treatment vs. control demographics differ)
  • Train/test distribution mismatch
  • Insufficient data at levels of a categorical variable
  • CDA assumption violations: non-normality, unequal variance, unexpected discreteness
  • Detect with: cross-tabulations, group comparisons, scatter plots

Approaches to identifying issues

Both variable-level and structural / cross-variable issues (see previous section) can be detected with a mix of numerical and graphical tools.

Look at your data

  • Eyeballing raw values — catches obvious entry errors and formatting problems
  • Numerical summaries: summary(), skimr::skim()
  • Graphical summaries: histograms, boxplots, scatter plots

Prioritise graphical checks

  • Distributions are easier to read as plots than as tables
  • Relationships between variables may only be visible visually

Look and then look again at your data!

Always plot your data before, after and between preparation steps. Exactly what to plot depends on what you are trying to see.

Fixing identified issues

Variable-level fixes

  • Recode wrong types: mutate() + as.*()
  • Replace or flag out-of-range values
  • Standardise inconsistent encodings (e.g. "M" / "Male" → consistent factor)
  • Handle missing values: drop, impute, or flag

Structural fixes

  • Remove duplicate rows: distinct()
  • Fill implicitly missing rows: tidyr::complete()
  • Correct unit of analysis: reshape with pivot_*()
  • Split or merge variables: separate(), unite()

Handling missing values

Can you just drop missing values?

Common practice: drop rows with NA using na.omit() or filter(!is.na(x))

Pitfalls:

  • Reduces sample size and statistical power
  • May introduce selection bias if values are not missing at random
  • Silently changes the population your analysis represents

Warning

Most R functions silently drop NA with na.rm = TRUE — this hides the problem rather than resolving it.

Why missingness is conceptually hard

  • The absence of a value is itself informative
  • You cannot directly observe why a value is missing
  • The right approach depends on the missing data mechanism — not just the proportion missing

Explicit vs. Implicit Missingness

Explicit missingness

  • NA values present in the data
  • Identified with is.na(), summary(), or naniar::vis_miss()
  • Need to decide: remove, impute, or model the missingness mechanism

Implicit missingness

  • Rows that should exist are absent entirely
  • Common in longitudinal/time-series data (missing time points) or categorical combinations that were never recorded
  • Harder to detect — requires domain knowledge about expected structure

Tools for missingness

naniar — tidy missing data tools

library(naniar)

# summary of missingness by variable
miss_var_summary(df)

# heatmap of NA pattern across rows & columns
vis_miss(df)

# upset plot of co-missingness patterns
gg_miss_upset(df)

visdat — visual data overview

library(visdat)

# colour-coded type + NA overview
vis_dat(df)

# proportion NA per variable
vis_miss(df, sort_miss = TRUE)

Tip

vis_dat() is often the fastest way to spot type problems and missing values in one glance.

Example: Olympic medals

Explicit NAs are relatively easy to detect — but implicit missingness often requires reasoning about what rows should exist.

             country totalmedal
1       UnitedStates        104
2              China         88
3             Russia         82
4       GreatBritain         65
5            Germany         44
6              Japan         38
7          Australia         35
8             France         34
9         SouthKorea         28
10             Italy         28
11       Netherlands         20
12           Ukraine         20
13            Canada         18
14           Hungary         17
15             Spain         17
16            Brazil         17
17              Cuba         14
18        Kazakhstan         13
19        NewZealand         13
20              Iran         12
21           Jamaica         12
22           Belarus         12
23             Kenya         11
24     CzechRepublic         10
25            Poland         10
26        Azerbaijan         10
27           Romania          9
28           Denmark          9
29            Sweden          8
30          Colombia          8
31          Ethiopia          7
32            Mexico          7
33           Georgia          7
34        NorthKorea          6
35       SouthAfrica          6
36           Croatia          6
37             India          6
38            Turkey          5
39         Lithuania          5
40           Ireland          5
41          Mongolia          5
42       Switzerland          4
43            Norway          4
44          Slovenia          4
45            Serbia          4
46         Argentina          4
47        Uzbekistan          4
48 TrinidadandTobago          4
49          Slovakia          4
50           Tunisia          3
51          Thailand          3
52           Finland          3
53           Belgium          3
54           Armenia          3
55 DominicanRepublic          2
56            Latvia          2
57             Egypt          2
58        PuertoRico          2
59          Malaysia          2
60         Indonesia          2
61           Estonia          2
62            Taiwan          2
63          Bulgaria          2
64         Singapore          2
65             Qatar          2
66           Moldova          2
67            Greece          2
68         Venezuela          1
69            Uganda          1
70           Grenada          1
71           Bahamas          1
72           Algeria          1
73          Portugal          1
74        Montenegro          1
75         Guatemala          1
76             Gabon          1
77            Cyprus          1
78          Botswana          1
79        Tajikistan          1
80       SaudiArabia          1
81           Morocco          1
82            Kuwait          1
83          HongKong          1
84           Bahrain          1
85       Afghanistan          1

What is the average medals per country?

n_in_data   <- nrow(olym12)
total_medals <- sum(olym12$totalmedal)
round(total_medals / n_in_data, 2)
[1] 11.32

What is missing?

Countries that won zero medals are absent — only medal-winning countries are recorded.

Correct average requires knowing how many nations competed (204):

n_competing <- 204
round(total_medals / n_competing, 2)
[1] 4.72

How to handle missingness?

When dropping is defensible

  • Missingness is a small fraction of observations
  • Domain knowledge supports MCAR (missing completely at random)
  • Sensitivity analysis confirms results are robust to the dropped rows

When to consider alternatives

  • Large proportion of missing values
  • Missing values concentrated in a subgroup (potential bias)
  • The outcome variable itself is missing
  • Alternatives: imputation, modelling the missingness mechanism, or retaining as a separate "Unknown" category.

Group projects

Use AI/LLMs to help reason through your handling of any missingness — then document and justify the decision in your analysis script.

Making implicit missing data explicit

Make implicit missing data explicit when you need to:

  • calculate statistics that account for absent rows (e.g. averages over all time periods, including unrecorded months)
  • visualise time-series gaps correctly without interpolating across missing periods
  • verify coverage before CDA

Warning

Don’t make all combinations explicit without thought — with many grouping variables, tidyr::complete() can generate millions of placeholder rows.

Tools for converting implicit to explicit

tidyr::complete()

Adds rows for all combinations of specified variables, filling gaps with NA:

library(tidyr)
df |>
  complete(museum, year_month)

tsibble::fill_gaps()

For regular time series, fills gaps in the time index:

library(tsibble)
ts_df |>
  fill_gaps()

Extension: Missing data mechanisms

The missing data mechanism determines whether dropping or imputing is valid, and what modelling adjustments may be needed.

MCAR — Missing Completely At Random

  • Probability of missing is unrelated to any variable
  • Safe to drop; complete-case analysis remains unbiased
  • Testable with Little’s MCAR test

MAR — Missing At Random

  • Missingness depends on observed variables, not the missing value itself
  • Imputation is valid (e.g. predictive mean matching with {mice})

MNAR — Missing Not At Random

  • Missingness depends on the unobserved value itself
    • e.g. sicker patients drop out of a health study
  • Dropping or standard imputation introduces bias
  • Requires explicitly modelling the missingness mechanism

Warning

MNAR cannot be confirmed from the data alone — it requires domain knowledge and assumptions.

Illustrative Case Study

Data: Statistisches Amt München. Monatszahlen Museen. Landeshauptstadt München Open Data Portal. License: Datenlizenz Deutschland Namensnennung 2.0 (dl-by-de). https://opendata.muenchen.de/dataset/monatszahlen-museen

Munich Museums from Last Week

IDA for Munich Museums

It’s helpful to list out possible issues when investigating a new dataset. For example in museum_visitors:

  • Data types: what does glimpse() show — are monat, jahr, auspraegung the right types for analysis?
  • Data format: the MONAT column encodes year-month as a 6-digit integer (e.g. 202301). What transformation is needed before time-series analysis?
  • Explicit missingness: which museums and months have NA in wert?
  • Implicit missingness: the data dictionary states different start dates per museum (as early as January 2000, as late as May 2009). Are earlier months present as rows, or absent entirely?
  • Completeness: which museums have the fewest observations? Are any museums absent from the dataset entirely?
  • Anomalies: are there months with unexpectedly low visitor counts? What real-world event might explain a sharp drop in 2020?

Example: Initial plots

Let’s plot museum_visitors as a line chart with monat on the x-axis, and wert on the y:

museum_visitors |>
    ggplot2::ggplot(ggplot2::aes(x = monat, y = wert, colour = auspraegung)) +
    geom_line()

What went wrong?

  • Why do you think there is a vertical line?
  • Can you see data for all the museums?
  • What was I assuming about the values in monat and wert?
  • What should we investigate first?

Example: Screening the data

What do you notice about the column types?

## lets start with glimpse()
glimpse(museum_visitors)
Rows: 3,429
Columns: 9
$ monatszahl                     <chr> "Besucher*innen", "Besucher*innen", "Be…
$ auspraegung                    <chr> "Alte Pinakothek", "Alte Pinakothek", "…
$ jahr                           <dbl> 2026, 2026, 2026, 2026, 2026, 2026, 202…
$ monat                          <chr> "202601", "202602", "202603", "202604",…
$ wert                           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ vorjahreswert                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ veraend_vormonat_prozent       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ veraend_vorjahresmonat_prozent <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ zwoelf_monate_mittelwert       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Could we fix this with {lubridate} parsing?

head(museum_visitors$monat)
[1] "202601" "202602" "202603" "202604" "202605" "202606"
head(museum_visitors$monat) |> lubridate::ym()
[1] "2026-01-01" "2026-02-01" "2026-03-01" "2026-04-01" "2026-05-01"
[6] "2026-06-01"

Looks good so far, let’s apply the fix to the table

museum_visitors <- museum_visitors |>
  dplyr::mutate(monat_fix = lubridate::ym(monat))

Example: Fixed data?

Did we correctly parse monat?

museum_visitors |>
  relocate(monat_fix, .after = monat) |>
  glimpse()
Rows: 3,429
Columns: 10
$ monatszahl                     <chr> "Besucher*innen", "Besucher*innen", "Be…
$ auspraegung                    <chr> "Alte Pinakothek", "Alte Pinakothek", "…
$ jahr                           <dbl> 2026, 2026, 2026, 2026, 2026, 2026, 202…
$ monat                          <chr> "202601", "202602", "202603", "202604",…
$ monat_fix                      <date> 2026-01-01, 2026-02-01, 2026-03-01, 20…
$ wert                           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ vorjahreswert                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ veraend_vormonat_prozent       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ veraend_vorjahresmonat_prozent <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ zwoelf_monate_mittelwert       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Example: Plot again!

No more data issues?

museum_visitors |>
    ggplot2::ggplot(ggplot2::aes(x = monat_fix, y = wert, colour = auspraegung)) +
    geom_line()

Example: Look again – overview

visdat::vis_dat(museum_visitors) + display_adjustments

Example: Look again – closely

Do you notice anything odd in monat?

unique(museum_visitors$monat)
  [1] "202601" "202602" "202603" "202604" "202605" "202606" "202607" "202608"
  [9] "202609" "202610" "202611" "202612" "202501" "202502" "202503" "202504"
 [17] "202505" "202506" "202507" "202508" "202509" "202510" "202511" "202512"
 [25] "202401" "202402" "202403" "202404" "202405" "202406" "202407" "202408"
 [33] "202409" "202410" "202411" "202412" "202301" "202302" "202303" "202304"
 [41] "202305" "202306" "202307" "202308" "202309" "202310" "202311" "202312"
 [49] "202201" "202202" "202203" "202204" "202205" "202206" "202207" "202208"
 [57] "202209" "202210" "202211" "202212" "202101" "202102" "202103" "202104"
 [65] "202105" "202106" "202107" "202108" "202109" "202110" "202111" "202112"
 [73] "202001" "202002" "202003" "202004" "202005" "202006" "202007" "202008"
 [81] "202009" "202010" "202011" "202012" "201901" "201902" "201903" "201904"
 [89] "201905" "201906" "201907" "201908" "201909" "201910" "201911" "201912"
 [97] "201801" "201802" "201803" "201804" "201805" "201806" "201807" "201808"
[105] "201809" "201810" "201811" "201812" "201701" "201702" "201703" "201704"
[113] "201705" "201706" "201707" "201708" "201709" "201710" "201711" "201712"
[121] "Summe"  "201601" "201602" "201603" "201604" "201605" "201606" "201607"
[129] "201608" "201609" "201610" "201611" "201612" "201501" "201502" "201503"
[137] "201504" "201505" "201506" "201507" "201508" "201509" "201510" "201511"
[145] "201512" "201401" "201402" "201403" "201404" "201405" "201406" "201407"
[153] "201408" "201409" "201410" "201411" "201412" "201301" "201302" "201303"
[161] "201304" "201305" "201306" "201307" "201308" "201309" "201310" "201311"
[169] "201312" "201201" "201202" "201203" "201204" "201205" "201206" "201207"
[177] "201208" "201209" "201210" "201211" "201212" "201101" "201102" "201103"
[185] "201104" "201105" "201106" "201107" "201108" "201109" "201110" "201111"
[193] "201112" "201001" "201002" "201003" "201004" "201005" "201006" "201007"
[201] "201008" "201009" "201010" "201011" "201012" "200901" "200902" "200903"
[209] "200904" "200905" "200906" "200907" "200908" "200909" "200910" "200911"
[217] "200912" "200801" "200802" "200803" "200804" "200805" "200806" "200807"
[225] "200808" "200809" "200810" "200811" "200812" "200701" "200702" "200703"
[233] "200704" "200705" "200706" "200707" "200708" "200709" "200710" "200711"
[241] "200712" "200601" "200602" "200603" "200604" "200605" "200606" "200607"
[249] "200608" "200609" "200610" "200611" "200612" "200501" "200502" "200503"
[257] "200504" "200505" "200506" "200507" "200508" "200509" "200510" "200511"
[265] "200512" "200401" "200402" "200403" "200404" "200405" "200406" "200407"
[273] "200408" "200409" "200410" "200411" "200412" "200301" "200302" "200303"
[281] "200304" "200305" "200306" "200307" "200308" "200309" "200310" "200311"
[289] "200312" "200201" "200202" "200203" "200204" "200205" "200206" "200207"
[297] "200208" "200209" "200210" "200211" "200212" "200101" "200102" "200103"
[305] "200104" "200105" "200106" "200107" "200108" "200109" "200110" "200111"
[313] "200112" "200001" "200002" "200003" "200004" "200005" "200006" "200007"
[321] "200008" "200009" "200010" "200011" "200012"

What about wert?

boxplot(museum_visitors$wert)

hist(museum_visitors$wert)

Example: ‘hidden’ annual data!

Notice that monat also contains the value Summe

annual <- museum_visitors |>
    dplyr::filter(
      monat == "Summe"
    )
annual
# A tibble: 213 × 10
   monatszahl     auspraegung      jahr monat   wert vorjahreswert
   <chr>          <chr>           <dbl> <chr>  <dbl>         <dbl>
 1 Besucher*innen Alte Pinakothek  2016 Summe 189996        252404
 2 Besucher*innen Alte Pinakothek  2015 Summe 252404        242740
 3 Besucher*innen Alte Pinakothek  2014 Summe 242740        277484
 4 Besucher*innen Alte Pinakothek  2013 Summe 277484        270534
 5 Besucher*innen Alte Pinakothek  2012 Summe 270534        359183
 6 Besucher*innen Alte Pinakothek  2011 Summe 359183        274524
 7 Besucher*innen Alte Pinakothek  2010 Summe 274524        232143
 8 Besucher*innen Alte Pinakothek  2009 Summe 232143        225231
 9 Besucher*innen Alte Pinakothek  2008 Summe 225231        218386
10 Besucher*innen Alte Pinakothek  2007 Summe 218386        272646
# ℹ 203 more rows
# ℹ 4 more variables: veraend_vormonat_prozent <dbl>,
#   veraend_vorjahresmonat_prozent <dbl>, zwoelf_monate_mittelwert <dbl>,
#   monat_fix <date>

Example: Annual and Monthly rows?

  • What issues might arise if we didn’t know about the annual rows?
  • How does this change your interpretation of the boxplot outliers?

Summary

IDA: Key ideas

  • IDA is a systematic pre-analysis step — it checks data integrity without touching the research questions
  • IDA issues occur at univariate, multivariate, and dataset levels — check all three
  • Tidying (reshape/rename) ≠ cleaning (fix values/types) ≠ screening (characterise) — they overlap but have different goals
  • All preparation decisions must be documented and justified in a reproducible script

Screening: Data types & formats

  • Declare column types explicitly on import — avoid relying on guessing
  • Use lubridate for dates; other specialist packages for geospatial (sf), time-of-day (hms), factors (forcats) etc.
  • Check every column: does the R type match the real-world meaning of the variable?
  • Use glimpse() for the whole table; str() for a single column

Cleaning: tools & fixes

  • Plot first — histograms, boxplots, and line plots reveal issues that summary() hides
  • Hypothesise before fixing — ask why a value looks wrong before deciding how to handle it
  • Variable-level: recode types (as.*()), flag/replace out-of-range values, standardise encodings
  • Structural: remove duplicates (distinct()), reshape (pivot_*()), split/merge columns (separate(), unite())
  • Not handling missingness is itself a decision — document it either way

Missingness

  • Explicit missingness (NA) — visible; find with is.na(), summary(), naniar::vis_miss()
  • Implicit missingness — absent rows; requires domain knowledge about expected structure
  • Use tidyr::complete() or tsibble::fill_gaps() to make implicit missingness explicit
  • The missing data mechanism (MCAR / MAR / MNAR) determines whether dropping or imputing is valid

For your group project

  • Structure your data folder: raw/ (read-only) → interim/ (.rds snapshots) → final/
  • Start every dataset with glimpse(), summary(), and a plot — before writing any analysis code
  • Use tidyr::complete() to surface implicit missingness in time-series data
  • Document why you made each data decision, not just what you did
  • IDA findings may require revising your CDA plan — that is expected and normal

IDA in the oral exam

Aim to understand and explain:

  • where your data comes from,
  • what it tried to measure, what it actually measures
  • if your questions and data match
  • screening, tidying and cleaning issues and fixes
  • missing data handling decisions

References

Chatfield, C. 1985. “The Initial Examination of Data.” Journal of the Royal Statistical Society. Series A (General) 148 (3): 214. https://doi.org/10.2307/2981969.
Huebner, Marianne, Saskia Le Cessie, Carsten Oliver Schmidt, and Lara Lusa. 2020. “Hidden Analyses: A Review of Reporting Practice and Recommendations for More Transparent Reporting of Initial Data Analyses.” BMC Medical Research Methodology 20 (1): 61. https://doi.org/10.1186/s12874-020-00942-y.
Huebner, Marianne, Saskia Le Cessie, Carsten O. Schmidt, and Werner Vach. 2018. “A Contemporary Conceptual Framework for Initial Data Analysis.” Observational Studies 4 (1): 171–92. https://doi.org/10.1353/obs.2018.0014.