Practical #7
Advanced Statistical Programming using R — Initial Data Analysis
Quiz
Before starting, work through this QUIZ to check your understanding of the concepts covered in this week’s lecture on initial data analysis and data cleaning.
Overview
In this practical you will apply the IDA steps from the lecture to your own group project dataset. By the end of the session you should have a working draft of the Initial data analysis section of your project proposal, which is due Mon 8 Jun.
For full proposal requirements, see Group project guidelines.
Work in your project group. At least one person should have the dataset loaded and a proposal.qmd open in your project repository.
The practical is structured into seven parts:
- Warm-up on untidy data — practise cleaning on a small messy dataset
- Proposal framing — define dataset scope, research questions, and audience
- Data screening — verify import and data types
- Data cleaning — check plausibility, consistency, and missings
- EDA plots — produce proposal-ready figures
- IDA write-up — draft the IDA section in
proposal.qmd - Analysis plan & workflow — complete the remaining proposal sections
Part 0: Warm-up on untidy data
We will use a publicly available Munich open dataset on monthly traffic accidents as the warm-up. Download it from opendata.muenchen.de.
Exercise 0.1: Load and inspect the raw file
library(readr)
library(dplyr)
library(lubridate)
accidents <- read_csv("data.txt")
glimpse(accidents)Look at the MONAT column carefully. What values does it contain?
MONAT is read as <chr> (character). It contains "YYYYMM" month codes such as "202501" alongside the string "Summe" for annual total rows. The dataset has 2450 rows in total.
Exercise 0.2: Identify the data quality issues
Before cleaning, note down what you observe:
- What type is
MONATread as? What should it be? - What unexpected values appear in
MONATalongside the month codes? - Which columns are mostly
NA? Why might that be? - What does one row represent — is that consistent throughout the file?
MONATis read as<chr>— it should be<date>for monthly observations."Summe"rows appear mixed in with monthly codes. These are annual totals, not monthly observations — they need to be removed before any time-series work.WERThas 84 missing values, all from 2026. These are expected: the current year’s counts have not been published yet.- One row represents one accident category (
MONATSZAHL), one sub-category (AUSPRAEGUNG), in one calendar month. The"Summe"rows break this — they represent annual totals for a category rather than a monthly count. - There are three accident categories: Alkoholunfälle, Fluchtunfälle, Verkehrsunfälle.
Exercise 0.3: Clean the date column with lubridate
The MONAT column uses a "YYYYMM" format (e.g. "202503"), but also contains the string "Summe" for annual summary rows. Clean it by:
- Filtering out
"Summe"rows (annual totals — not monthly observations) - Parsing the remaining values into a proper date with
lubridate::ym()
accidents_clean <- accidents |>
filter(MONAT != "Summe") |>
mutate(date = ym(MONAT))
glimpse(accidents_clean)After filtering, 182 "Summe" rows are removed, leaving 2268 monthly rows. lubridate::ym() parses "202503" as 2025-03-01 (first day of the month), giving a <date> column that can be used directly in time-series plots.
Exercise 0.4: Quick quality check
Run glimpse() and check how many rows remain and whether date now has type <date>. Are the NA values in WERT for recent months expected or surprising?
glimpse(accidents_clean)date should now appear as <date>. The 84 NA values in WERT are all from 2026 — the current year’s counts have not yet been published in the open data portal, so this is expected, not a data error.
Part 1: Proposal framing
The proposal does not need to be polished — it is a working document that helps you and your group get started, and gives us a chance to give early feedback. We will not grade it; just make sure it is complete enough to be useful.
Open proposal.qmd in your group project repository and work through the three exercises below.
Exercise 1.1: Dataset and research questions
Source and licence: Where did the data come from? What licence does it use, and what does that licence allow you to do with it?
What is in it: How many rows and columns does the dataset have? What does each row represent? What are the key variables you expect to use?
Research questions: State 1–3 questions you want to explore. For each question, say what kind of claim you are aiming to make:
- Descriptive — What does the data show? How has X changed over time?
- Exploratory — Are there patterns or relationships worth investigating further?
- Explanatory/causal — Does X seem to influence Y? Note any limitations in making causal claims from observational data.
- Predictive — Can we build a simple model to predict Y?
Exercise 1.2: Context and target audience
Who is this analysis for, and in what setting? State the context (e.g. business, academic, journalistic), describe the target audience, and note which presentation formats would be appropriate for that audience.
Part 2: Data screening
Exercise 2.1: Load and inspect your dataset
Load your dataset and check how R has interpreted it.
library(readr) # or readxl for Excel files
library(dplyr)
library(visdat)
# Load your data
dat <- read_csv("data/raw/your-data.csv")
# Check structure
glimpse(dat)
# Visualise column types
vis_dat(dat)Exercise 2.2: Check whether types were interpreted correctly
Questions to answer:
- Are dates read as
<date>or as character? - Are categorical variables read as
<chr>or<fct>or<dbl>? - Are there any columns where the type is wrong or unexpected?
Exercise 2.3: Fix import types explicitly (if needed)
If you find type problems, fix them with explicit col_types:
dat <- read_csv("data/raw/your-data.csv",
col_types = cols(
date_col = col_date(format = "%Y-%m-%d"),
group_col = col_factor(),
value_col = col_double()
))Document what you changed and why in a comment above the import call.
Part 3: Data cleaning
Exercise 3.1: Run summary checks
Use summary statistics and domain knowledge to check for quality issues.
library(skimr)
skim(dat)Exercise 3.2: Run domain-informed quality checks
Work through these checks for your key variables:
- Ranges: Are numeric values within plausible bounds? (e.g. percentages between 0–100, counts non-negative)
- Counts: If a variable should be integer-valued, are there decimal values?
- Categories: Are the levels of categorical variables what you expect? Any typos or inconsistent capitalisation?
- Missing values: Which variables have missing values? How many?
Exercise 3.3: Inspect missingness patterns
library(naniar)
# Summary of missings by variable
miss_var_summary(dat)
# Visualise missing pattern
vis_miss(dat)Exercise 3.4: Record quality issues
Write a short bullet list of data quality issues you found.
Part 4: EDA plots
Exercise 4.1: Plot the outcome distribution
Produce at least two plots. These go directly into your proposal.
Plot 1 — Distribution of the outcome variable:
library(ggplot2)
ggplot(dat, aes(x = outcome_variable)) +
geom_histogram(bins = 30) +
labs(title = "Distribution of [variable name]",
x = "[variable label]", y = "Count")Exercise 4.2: Plot a relationship or trend
Plot 2 — Relationship or trend relevant to a research question:
ggplot(dat, aes(x = x_variable, y = outcome_variable, colour = group_variable)) +
geom_point(alpha = 0.6) +
labs(title = "[Relationship description]")Exercise 4.3: Highlight missingness in scatterplots (optional)
If your data has missing values, use geom_miss_point() instead of geom_point() to show where they fall:
ggplot(dat, aes(x = x_variable, y = outcome_variable)) +
naniar::geom_miss_point()Part 5: IDA write-up in proposal.qmd
Exercise 5.1: Draft your IDA section
Open your proposal.qmd and fill in the Initial data analysis section. Paste or include:
- The
glimpse()output (use a code chunk with#| eval: falseif it’s long) - Summary statistics for key variables
- Your two plots with captions
- A bullet list of data quality issues and how you plan to address them
Part 6: Analysis plan & workflow organisation
If you have time in the practical, make a start on these sections. If not, complete them before Mon 8 Jun — there is no class next week.
Exercise 6.1: Analysis plan
In proposal.qmd, add an Analysis plan section. For each research question, outline:
- Which variables will you use?
- What analytical approach or method seems appropriate, and why?
- What cleaning or transformation steps do you anticipate needing?
- What are the main uncertainties or risks (e.g. not enough data, confounding variables, missing values)?
You do not need to have the answers yet — the goal is to show that you have thought carefully about what is feasible with your data.
Exercise 6.2: Workflow organisation
In proposal.qmd, add a Workflow organisation section covering:
- Code style: Which style guide will you follow (e.g. tidyverse style)? How will you enforce it — linting with
styler, a shared AI system prompt, or a documented convention? - Packages: Which packages will you rely on? Will you use
renvto lock package versions so the project is reproducible across machines? - Git workflow: How will you divide work? Describe your branching strategy — will each person work on their own branch, or will you split by task?
.gitignore: What will you exclude from version control (e.g. raw data files, rendered outputs,.envsecrets)?
Wrap-up
There is no class next week. Use that time to complete and refine your project proposal as a group. The full proposal requires four sections — make sure all are in proposal.qmd before the Mon 8 Jun deadline:
- Dataset & research questions — source, licence, row description, research questions with claim types, context and audience
- Initial data analysis —
glimpse()output, summary statistics, two plots, data quality notes - Analysis plan — variables, approach, anticipated cleaning, uncertainties per research question
- Workflow organisation — code style, packages, git branching,
.gitignore
For the full requirements of each section, see the Group project guidelines.
Submit a Moodle link to your proposal.qmd in your group repository.
End-of-practical checklist
Before you leave today, make sure your group has:
- committed a reproducible import script with any explicit
col_types - documented source, licence, research questions, and audience in
proposal.qmd - saved two IDA plots and inserted them (with captions) into
proposal.qmd - written 3–5 bullets on data quality issues and planned handling
- made a start on the analysis plan and workflow organisation sections
- assigned owners for any remaining sections before Mon 8 Jun