StatProg2
  • Home
  • Syllabus
  • Group Project
  • Reflection Prompts
  • Setup

On this page

  • Quiz
  • Overview
  • Part 0: Warm-up on untidy data
    • Exercise 0.1: Load and inspect the raw file
    • Exercise 0.2: Identify the data quality issues
    • Exercise 0.3: Clean the date column with lubridate
    • Exercise 0.4: Quick quality check
  • Part 1: Proposal framing
    • Exercise 1.1: Dataset and research questions
    • Exercise 1.2: Context and target audience
  • Part 2: Data screening
    • Exercise 2.1: Load and inspect your dataset
    • Exercise 2.2: Check whether types were interpreted correctly
    • Exercise 2.3: Fix import types explicitly (if needed)
  • Part 3: Data cleaning
    • Exercise 3.1: Run summary checks
    • Exercise 3.2: Run domain-informed quality checks
    • Exercise 3.3: Inspect missingness patterns
    • Exercise 3.4: Record quality issues
  • Part 4: EDA plots
    • Exercise 4.1: Plot the outcome distribution
    • Exercise 4.2: Plot a relationship or trend
    • Exercise 4.3: Highlight missingness in scatterplots (optional)
  • Part 5: IDA write-up in proposal.qmd
    • Exercise 5.1: Draft your IDA section
  • Part 6: Analysis plan & workflow organisation
    • Exercise 6.1: Analysis plan
    • Exercise 6.2: Workflow organisation
    • Wrap-up

Practical #7

Advanced Statistical Programming using R — Initial Data Analysis

Author

Leonhard Kestel, Lisa Bondo Andersen, Cynthia Huang

Published

May 28, 2026

Quiz

Before starting, work through this QUIZ to check your understanding of the concepts covered in this week’s lecture on initial data analysis and data cleaning.


Overview

In this practical you will apply the IDA steps from the lecture to your own group project dataset. By the end of the session you should have a working draft of the Initial data analysis section of your project proposal, which is due Mon 8 Jun.

For full proposal requirements, see Group project guidelines.

Work in your project group. At least one person should have the dataset loaded and a proposal.qmd open in your project repository.

The practical is structured into seven parts:

  1. Warm-up on untidy data — practise cleaning on a small messy dataset
  2. Proposal framing — define dataset scope, research questions, and audience
  3. Data screening — verify import and data types
  4. Data cleaning — check plausibility, consistency, and missings
  5. EDA plots — produce proposal-ready figures
  6. IDA write-up — draft the IDA section in proposal.qmd
  7. Analysis plan & workflow — complete the remaining proposal sections

Part 0: Warm-up on untidy data

We will use a publicly available Munich open dataset on monthly traffic accidents as the warm-up. Download it from opendata.muenchen.de.

Exercise 0.1: Load and inspect the raw file

library(readr)
library(dplyr)
library(lubridate)

accidents <- read_csv("data.txt")
glimpse(accidents)

Look at the MONAT column carefully. What values does it contain?

NoteSolution

MONAT is read as <chr> (character). It contains "YYYYMM" month codes such as "202501" alongside the string "Summe" for annual total rows. The dataset has 2450 rows in total.

Exercise 0.2: Identify the data quality issues

Before cleaning, note down what you observe:

  • What type is MONAT read as? What should it be?
  • What unexpected values appear in MONAT alongside the month codes?
  • Which columns are mostly NA? Why might that be?
  • What does one row represent — is that consistent throughout the file?
NoteSolution
  • MONAT is read as <chr> — it should be <date> for monthly observations.
  • "Summe" rows appear mixed in with monthly codes. These are annual totals, not monthly observations — they need to be removed before any time-series work.
  • WERT has 84 missing values, all from 2026. These are expected: the current year’s counts have not been published yet.
  • One row represents one accident category (MONATSZAHL), one sub-category (AUSPRAEGUNG), in one calendar month. The "Summe" rows break this — they represent annual totals for a category rather than a monthly count.
  • There are three accident categories: Alkoholunfälle, Fluchtunfälle, Verkehrsunfälle.

Exercise 0.3: Clean the date column with lubridate

The MONAT column uses a "YYYYMM" format (e.g. "202503"), but also contains the string "Summe" for annual summary rows. Clean it by:

  1. Filtering out "Summe" rows (annual totals — not monthly observations)
  2. Parsing the remaining values into a proper date with lubridate::ym()
accidents_clean <- accidents |>
  filter(MONAT != "Summe") |>
  mutate(date = ym(MONAT))

glimpse(accidents_clean)
NoteSolution

After filtering, 182 "Summe" rows are removed, leaving 2268 monthly rows. lubridate::ym() parses "202503" as 2025-03-01 (first day of the month), giving a <date> column that can be used directly in time-series plots.

Exercise 0.4: Quick quality check

Run glimpse() and check how many rows remain and whether date now has type <date>. Are the NA values in WERT for recent months expected or surprising?

NoteSolution
glimpse(accidents_clean)

date should now appear as <date>. The 84 NA values in WERT are all from 2026 — the current year’s counts have not yet been published in the open data portal, so this is expected, not a data error.


Part 1: Proposal framing

The proposal does not need to be polished — it is a working document that helps you and your group get started, and gives us a chance to give early feedback. We will not grade it; just make sure it is complete enough to be useful.

Open proposal.qmd in your group project repository and work through the three exercises below.

Exercise 1.1: Dataset and research questions

Source and licence: Where did the data come from? What licence does it use, and what does that licence allow you to do with it?

What is in it: How many rows and columns does the dataset have? What does each row represent? What are the key variables you expect to use?

Research questions: State 1–3 questions you want to explore. For each question, say what kind of claim you are aiming to make:

  • Descriptive — What does the data show? How has X changed over time?
  • Exploratory — Are there patterns or relationships worth investigating further?
  • Explanatory/causal — Does X seem to influence Y? Note any limitations in making causal claims from observational data.
  • Predictive — Can we build a simple model to predict Y?

Exercise 1.2: Context and target audience

Who is this analysis for, and in what setting? State the context (e.g. business, academic, journalistic), describe the target audience, and note which presentation formats would be appropriate for that audience.


Part 2: Data screening

Exercise 2.1: Load and inspect your dataset

Load your dataset and check how R has interpreted it.

library(readr)   # or readxl for Excel files
library(dplyr)
library(visdat)

# Load your data
dat <- read_csv("data/raw/your-data.csv")

# Check structure
glimpse(dat)

# Visualise column types
vis_dat(dat)

Exercise 2.2: Check whether types were interpreted correctly

Questions to answer:

  • Are dates read as <date> or as character?
  • Are categorical variables read as <chr> or <fct> or <dbl>?
  • Are there any columns where the type is wrong or unexpected?

Exercise 2.3: Fix import types explicitly (if needed)

If you find type problems, fix them with explicit col_types:

dat <- read_csv("data/raw/your-data.csv",
  col_types = cols(
    date_col  = col_date(format = "%Y-%m-%d"),
    group_col = col_factor(),
    value_col = col_double()
  ))

Document what you changed and why in a comment above the import call.


Part 3: Data cleaning

Exercise 3.1: Run summary checks

Use summary statistics and domain knowledge to check for quality issues.

library(skimr)

skim(dat)

Exercise 3.2: Run domain-informed quality checks

Work through these checks for your key variables:

  • Ranges: Are numeric values within plausible bounds? (e.g. percentages between 0–100, counts non-negative)
  • Counts: If a variable should be integer-valued, are there decimal values?
  • Categories: Are the levels of categorical variables what you expect? Any typos or inconsistent capitalisation?
  • Missing values: Which variables have missing values? How many?

Exercise 3.3: Inspect missingness patterns

library(naniar)

# Summary of missings by variable
miss_var_summary(dat)

# Visualise missing pattern
vis_miss(dat)

Exercise 3.4: Record quality issues

Write a short bullet list of data quality issues you found.


Part 4: EDA plots

Exercise 4.1: Plot the outcome distribution

Produce at least two plots. These go directly into your proposal.

Plot 1 — Distribution of the outcome variable:

library(ggplot2)

ggplot(dat, aes(x = outcome_variable)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of [variable name]",
       x = "[variable label]", y = "Count")

Exercise 4.2: Plot a relationship or trend

Plot 2 — Relationship or trend relevant to a research question:

ggplot(dat, aes(x = x_variable, y = outcome_variable, colour = group_variable)) +
  geom_point(alpha = 0.6) +
  labs(title = "[Relationship description]")

Exercise 4.3: Highlight missingness in scatterplots (optional)

If your data has missing values, use geom_miss_point() instead of geom_point() to show where they fall:

ggplot(dat, aes(x = x_variable, y = outcome_variable)) +
  naniar::geom_miss_point()

Part 5: IDA write-up in proposal.qmd

Exercise 5.1: Draft your IDA section

Open your proposal.qmd and fill in the Initial data analysis section. Paste or include:

  1. The glimpse() output (use a code chunk with #| eval: false if it’s long)
  2. Summary statistics for key variables
  3. Your two plots with captions
  4. A bullet list of data quality issues and how you plan to address them

Part 6: Analysis plan & workflow organisation

If you have time in the practical, make a start on these sections. If not, complete them before Mon 8 Jun — there is no class next week.

Exercise 6.1: Analysis plan

In proposal.qmd, add an Analysis plan section. For each research question, outline:

  • Which variables will you use?
  • What analytical approach or method seems appropriate, and why?
  • What cleaning or transformation steps do you anticipate needing?
  • What are the main uncertainties or risks (e.g. not enough data, confounding variables, missing values)?

You do not need to have the answers yet — the goal is to show that you have thought carefully about what is feasible with your data.

Exercise 6.2: Workflow organisation

In proposal.qmd, add a Workflow organisation section covering:

  • Code style: Which style guide will you follow (e.g. tidyverse style)? How will you enforce it — linting with styler, a shared AI system prompt, or a documented convention?
  • Packages: Which packages will you rely on? Will you use renv to lock package versions so the project is reproducible across machines?
  • Git workflow: How will you divide work? Describe your branching strategy — will each person work on their own branch, or will you split by task?
  • .gitignore: What will you exclude from version control (e.g. raw data files, rendered outputs, .env secrets)?

Wrap-up

There is no class next week. Use that time to complete and refine your project proposal as a group. The full proposal requires four sections — make sure all are in proposal.qmd before the Mon 8 Jun deadline:

  1. Dataset & research questions — source, licence, row description, research questions with claim types, context and audience
  2. Initial data analysis — glimpse() output, summary statistics, two plots, data quality notes
  3. Analysis plan — variables, approach, anticipated cleaning, uncertainties per research question
  4. Workflow organisation — code style, packages, git branching, .gitignore

For the full requirements of each section, see the Group project guidelines.

Submit a Moodle link to your proposal.qmd in your group repository.

End-of-practical checklist

Before you leave today, make sure your group has:

  • committed a reproducible import script with any explicit col_types
  • documented source, licence, research questions, and audience in proposal.qmd
  • saved two IDA plots and inserted them (with captions) into proposal.qmd
  • written 3–5 bullets on data quality issues and planned handling
  • made a start on the analysis plan and workflow organisation sections
  • assigned owners for any remaining sections before Mon 8 Jun
Source Code
---
title: "Practical #7"
subtitle: "Advanced Statistical Programming using R — Initial Data Analysis"
author: "Leonhard Kestel, Lisa Bondo Andersen, Cynthia Huang"
date: "May 28, 2026"
format:
  html:
    theme: default
    toc: true
    toc-depth: 2
    code-tools: true
    highlight-style: github
execute:
  eval: false
  message: false
  warning: false
draft: false
---

## Quiz

Before starting, work through this [QUIZ](quiz.qmd){target="_blank"} to check your understanding of the concepts covered in this week's lecture on initial data analysis and data cleaning.

------------------------------------------------------------------------

# Overview

In this practical you will apply the IDA steps from the lecture to your own group project dataset. By the end of the session you should have a working draft of the **Initial data analysis** section of your project proposal, which is due **Mon 8 Jun**.

For full proposal requirements, see [Group project guidelines](../group-project-guidelines.qmd){target="_blank"}.

Work in your project group. At least one person should have the dataset loaded and a `proposal.qmd` open in your project repository.

The practical is structured into seven parts:

0. **Warm-up on untidy data** — practise cleaning on a small messy dataset
1. **Proposal framing** — define dataset scope, research questions, and audience
2. **Data screening** — verify import and data types
3. **Data cleaning** — check plausibility, consistency, and missings
4. **EDA plots** — produce proposal-ready figures
5. **IDA write-up** — draft the IDA section in `proposal.qmd`
6. **Analysis plan & workflow** — complete the remaining proposal sections

---

# Part 0: Warm-up on untidy data

We will use a publicly available Munich open dataset on monthly traffic accidents as the warm-up.
Download it from [opendata.muenchen.de](https://opendata.muenchen.de/dataset/monatszahlen-verkehrsunfaelle){target="_blank"}.

## Exercise 0.1: Load and inspect the raw file

```r
library(readr)
library(dplyr)
library(lubridate)

accidents <- read_csv("data.txt")
glimpse(accidents)
```

Look at the `MONAT` column carefully. What values does it contain?

::: {.callout-note title="Solution" collapse="true"}
`MONAT` is read as `<chr>` (character). It contains `"YYYYMM"` month codes such as `"202501"` alongside the string `"Summe"` for annual total rows. The dataset has 2450 rows in total.
:::

## Exercise 0.2: Identify the data quality issues

Before cleaning, note down what you observe:

- What type is `MONAT` read as? What should it be?
- What unexpected values appear in `MONAT` alongside the month codes?
- Which columns are mostly `NA`? Why might that be?
- What does one row represent — is that consistent throughout the file?

::: {.callout-note title="Solution" collapse="true"}
- `MONAT` is read as `<chr>` — it should be `<date>` for monthly observations.
- `"Summe"` rows appear mixed in with monthly codes. These are annual totals, not monthly observations — they need to be removed before any time-series work.
- `WERT` has 84 missing values, all from 2026. These are expected: the current year's counts have not been published yet.
- One row represents one accident category (`MONATSZAHL`), one sub-category (`AUSPRAEGUNG`), in one calendar month. The `"Summe"` rows break this — they represent annual totals for a category rather than a monthly count.
- There are three accident categories: Alkoholunfälle, Fluchtunfälle, Verkehrsunfälle.
:::

## Exercise 0.3: Clean the date column with `lubridate`

The `MONAT` column uses a `"YYYYMM"` format (e.g. `"202503"`), but also contains the string `"Summe"` for annual summary rows. Clean it by:

1. Filtering out `"Summe"` rows (annual totals — not monthly observations)
2. Parsing the remaining values into a proper date with `lubridate::ym()`

```r
accidents_clean <- accidents |>
  filter(MONAT != "Summe") |>
  mutate(date = ym(MONAT))

glimpse(accidents_clean)
```

::: {.callout-note title="Solution" collapse="true"}
After filtering, 182 `"Summe"` rows are removed, leaving 2268 monthly rows. `lubridate::ym()` parses `"202503"` as `2025-03-01` (first day of the month), giving a `<date>` column that can be used directly in time-series plots.
:::

## Exercise 0.4: Quick quality check

Run `glimpse()` and check how many rows remain and whether `date` now has type `<date>`.
Are the NA values in `WERT` for recent months expected or surprising?

::: {.callout-note title="Solution" collapse="true"}
```r
glimpse(accidents_clean)
```

`date` should now appear as `<date>`. The 84 `NA` values in `WERT` are all from 2026 — the current year's counts have not yet been published in the open data portal, so this is expected, not a data error.
:::

---

# Part 1: Proposal framing

The proposal does not need to be polished — it is a working document that helps you and your group get started, and gives us a chance to give early feedback. We will not grade it; just make sure it is complete enough to be useful.

Open `proposal.qmd` in your group project repository and work through the three exercises below.

## Exercise 1.1: Dataset and research questions

**Source and licence:** Where did the data come from? What licence does it use, and what does that licence allow you to do with it?

**What is in it:** How many rows and columns does the dataset have? What does each row represent? What are the key variables you expect to use?

**Research questions:** State 1–3 questions you want to explore. For each question, say what kind of claim you are aiming to make:

- *Descriptive* — What does the data show? How has X changed over time?
- *Exploratory* — Are there patterns or relationships worth investigating further?
- *Explanatory/causal* — Does X seem to influence Y? Note any limitations in making causal claims from observational data.
- *Predictive* — Can we build a simple model to predict Y?

## Exercise 1.2: Context and target audience

Who is this analysis for, and in what setting? State the context (e.g. business, academic, journalistic), describe the target audience, and note which presentation formats would be appropriate for that audience.

---

# Part 2: Data screening

## Exercise 2.1: Load and inspect your dataset

Load your dataset and check how R has interpreted it.

```r
library(readr)   # or readxl for Excel files
library(dplyr)
library(visdat)

# Load your data
dat <- read_csv("data/raw/your-data.csv")

# Check structure
glimpse(dat)

# Visualise column types
vis_dat(dat)
```

## Exercise 2.2: Check whether types were interpreted correctly

**Questions to answer:**

- Are dates read as `<date>` or as character?
- Are categorical variables read as `<chr>` or `<fct>` or `<dbl>`?
- Are there any columns where the type is wrong or unexpected?

## Exercise 2.3: Fix import types explicitly (if needed)

If you find type problems, fix them with explicit `col_types`:

```r
dat <- read_csv("data/raw/your-data.csv",
  col_types = cols(
    date_col  = col_date(format = "%Y-%m-%d"),
    group_col = col_factor(),
    value_col = col_double()
  ))
```

**Document** what you changed and why in a comment above the import call.

---

# Part 3: Data cleaning

## Exercise 3.1: Run summary checks

Use summary statistics and domain knowledge to check for quality issues.

```r
library(skimr)

skim(dat)
```

## Exercise 3.2: Run domain-informed quality checks

Work through these checks for your key variables:

- **Ranges:** Are numeric values within plausible bounds? (e.g. percentages between 0–100, counts non-negative)
- **Counts:** If a variable should be integer-valued, are there decimal values?
- **Categories:** Are the levels of categorical variables what you expect? Any typos or inconsistent capitalisation?
- **Missing values:** Which variables have missing values? How many?

## Exercise 3.3: Inspect missingness patterns

```r
library(naniar)

# Summary of missings by variable
miss_var_summary(dat)

# Visualise missing pattern
vis_miss(dat)
```

## Exercise 3.4: Record quality issues

Write a short bullet list of data quality issues you found.

---

# Part 4: EDA plots

## Exercise 4.1: Plot the outcome distribution

Produce at least two plots. These go directly into your proposal.

**Plot 1 — Distribution of the outcome variable:**

```r
library(ggplot2)

ggplot(dat, aes(x = outcome_variable)) +
  geom_histogram(bins = 30) +
  labs(title = "Distribution of [variable name]",
       x = "[variable label]", y = "Count")
```

## Exercise 4.2: Plot a relationship or trend

**Plot 2 — Relationship or trend relevant to a research question:**

```r
ggplot(dat, aes(x = x_variable, y = outcome_variable, colour = group_variable)) +
  geom_point(alpha = 0.6) +
  labs(title = "[Relationship description]")
```

## Exercise 4.3: Highlight missingness in scatterplots (optional)

If your data has missing values, use `geom_miss_point()` instead of `geom_point()` to show where they fall:

```r
ggplot(dat, aes(x = x_variable, y = outcome_variable)) +
  naniar::geom_miss_point()
```

---

# Part 5: IDA write-up in `proposal.qmd`

## Exercise 5.1: Draft your IDA section

Open your `proposal.qmd` and fill in the **Initial data analysis** section. Paste or include:

1. The `glimpse()` output (use a code chunk with `#| eval: false` if it's long)
2. Summary statistics for key variables
3. Your two plots with captions
4. A bullet list of data quality issues and how you plan to address them

---

# Part 6: Analysis plan & workflow organisation

If you have time in the practical, make a start on these sections. If not, complete them before Mon 8 Jun — there is no class next week.

## Exercise 6.1: Analysis plan

In `proposal.qmd`, add an **Analysis plan** section. For each research question, outline:

- Which variables will you use?
- What analytical approach or method seems appropriate, and why?
- What cleaning or transformation steps do you anticipate needing?
- What are the main uncertainties or risks (e.g. not enough data, confounding variables, missing values)?

You do not need to have the answers yet — the goal is to show that you have thought carefully about what is feasible with your data.

## Exercise 6.2: Workflow organisation

In `proposal.qmd`, add a **Workflow organisation** section covering:

- **Code style:** Which style guide will you follow (e.g. tidyverse style)? How will you enforce it — linting with `styler`, a shared AI system prompt, or a documented convention?
- **Packages:** Which packages will you rely on? Will you use `renv` to lock package versions so the project is reproducible across machines?
- **Git workflow:** How will you divide work? Describe your branching strategy — will each person work on their own branch, or will you split by task?
- **`.gitignore`:** What will you exclude from version control (e.g. raw data files, rendered outputs, `.env` secrets)?

---

## Wrap-up

There is **no class next week**. Use that time to complete and refine your project proposal as a group. The full proposal requires four sections — make sure all are in `proposal.qmd` before the Mon 8 Jun deadline:

1. **Dataset & research questions** — source, licence, row description, research questions with claim types, context and audience
2. **Initial data analysis** — `glimpse()` output, summary statistics, two plots, data quality notes
3. **Analysis plan** — variables, approach, anticipated cleaning, uncertainties per research question
4. **Workflow organisation** — code style, packages, git branching, `.gitignore`

For the full requirements of each section, see the [Group project guidelines](../group-project-guidelines.qmd#project-proposal).

Submit a **Moodle link** to your `proposal.qmd` in your group repository.

### End-of-practical checklist

Before you leave today, make sure your group has:

- committed a reproducible import script with any explicit `col_types`
- documented source, licence, research questions, and audience in `proposal.qmd`
- saved two IDA plots and inserted them (with captions) into `proposal.qmd`
- written 3–5 bullets on data quality issues and planned handling
- made a start on the analysis plan and workflow organisation sections
- assigned owners for any remaining sections before Mon 8 Jun