StatProg2
  • Home
  • Syllabus
  • Group Project
  • Reflection Prompts
  • Setup

On this page

  • Project purpose
  • Relation to exam
  • Milestones
    • Group Formation
    • Project Proposal
    • Updated Proposal (new)
    • Group Reflection
  • Project Format
    • Choosing a dataset
    • Project repository
    • Presentation output formats
    • Tracking and reporting contributions

Group Project

Project purpose

The purpose of the second part of this course is to give you hands-on experience learning and applying statistical programming ideas and tools on real world data. Each week we will cover a different theme of tasks within typical and fairly simple data-driven workflows. These themes are based on the R4DS workflow, but are not strictly linear stages or iterations, and will not cover every possible data science or statistical task in every possible context.

The aim in each week is to practice and develop your ability to plan and execute tasks within each theme, as well as evaluate and integrate results, artefacts and insights across different themes and collaborators. By working in groups you will have the opportunity to coordinate with and learn from others to craft a meaningful data-driven report or story based on your chosen dataset(s).

For each theme, the minimum recommendation is to:

  • attend the lecture
  • complete the practical exercise
  • add at least one section or additional page to the project repository

This project is adapted from the ModernDive Term Project

Relation to exam

The project is optional but strongly recommended. Students who complete the project will be examined on their own work — you will be asked to explain decisions, defend choices, and reason about alternatives. Students who do not submit a project will need to reason about an unfamiliar dataset and analysis context on the spot during the exam.

A complete project repository is also a useful portfolio piece for demonstrating collaborative data analysis skills.

Submission deadline: 2026-07-22

Late submission will not be accepted for exam preparation purposes.

Milestones

Milestone Due Description
Group registration Due Fri May 22 Form groups of 3–4; choose a dataset
Project proposal Due Mon 8 Jun Initial EDA, research questions, analysis plan
Updated proposal (new) Due Mon 29 Jun Analysis complete — pivot to communication, visualisation & presentation
Final submission Due Weds Jul 22 Complete analysis and presentation in project repository
Group reflection W14 practical (17 Jul) — written submission due with final submission In-class discussion + written group-reflection.qmd submitted with final repo
Individual contributions Due Thurs Jul 23 via Moodle Rendered PDF of individual contribution statements

Group Formation

You may work in groups of up to 4 people. We recommend 3-4, with a mixture of computer science and other majors in your groups. This will allow you to have diverse perspectives, and to practice teaching and learning from each other.

Project Proposal

Due: Mon 8 Jun. Submit link to your Quarto document (within your project repo) on Moodle.

The proposal does not need to be polished — it is a working document that helps you and your group get started, and gives us a chance to give early feedback.

Dataset & research questions

  • Source and licence: Where did the data come from? What licence does it use?
  • What is in it: How many rows and columns? What does each row represent? What are the key variables?
  • Research questions: State 1–3 questions you want to explore with this dataset. For each question, say what kind of claim you are aiming to make:
    • Descriptive: What does the data show? How has X changed over time?
    • Exploratory: Are there patterns or relationships worth investigating further?
    • Explanatory/causal: Does X seem to influence Y? (Note any limitations in making causal claims from observational data.)
    • Predictive: Can we build a simple model to predict Y?
  • Context and target audience: What context and target audience are you exploring your research questions under (business, academic, journalistic)? What type of presentation formats are suitable for this audience?

Initial data analysis

Run an initial exploration of the dataset and include in your proposal:

  • glimpse() or equivalent output showing structure
  • Summary statistics for key variables (missing values, distributions, ranges)
  • At least two plots that help you understand the data — one showing distributions, one showing a relationship or trend relevant to your research questions
  • Notes on data quality issues you have already noticed (missing values, inconsistent formatting, unexpected values)

Analysis plan

Outline how you plan to answer each research question:

  • What variables will you use?
  • What analytical approach or method seems appropriate, and why?
  • What cleaning or transformation steps do you anticipate needing?
  • What are the main uncertainties or risks in your plan (e.g. not enough data, confounding variables, missing values)?

You do not need to have the answers yet — the goal is to show that you have thought carefully about what is feasible with your data.

Updated Proposal (new)

Due: Mon 29 Jun, 23:59. Re-submit on Moodle with updated group members, GitHub repo link, and rendered proposal (PDF or webpage).

By this point you have received general feedback and have three more weeks of analysis content ahead. The goal of the update is to lock in your scope so you can shift focus to communication and presentation.

Your updated proposal should reflect:

  • Focused research questions — at most two, each with a specific planned output (a named plot, table, or model result with the variables you will use)
  • Analysis complete enough to communicate — your core results should be in place; the remaining weeks are for presentation, visualisation, and polishing your hand-in material, not for running new analyses
  • Revised analysis plan — specific enough that each step can be assigned to a group member and tracked as a GitHub Issue

You do not need to rewrite the full proposal — targeted edits to the research questions and analysis plan sections are sufficient.

If you have specific questions you would like feedback on, enter them in the Moodle comments. You may also optionally link to your project website homepage for feedback on your presentation format.

Group Reflection

W14 practical (17 Jul) — written submission due 2026-07-22. Submit link to your rendered group-reflection.qmd page (within your project repo) on Moodle.

In-class discussion (W14 practical)

In the final session, you will discuss with other groups your project experiences in class. This is not a formal presentation of results — it is a structured conversation about what you learned, what surprised you, and what you would do differently.

Come prepared to speak to the following:

About your data and analysis

  • What did you learn about your dataset that you did not expect when you chose it?
  • Which analytical decision was hardest to make, and how did you reason through it?
  • What is one thing your analysis cannot answer, and why?

About your workflow and collaboration

  • What worked well in how your group coordinated? What was harder than expected?
  • How did you use LLM assistance during the project — what was useful, what was not, and when did you have to override or correct it?

About your own skill development

  • What is one concept or tool from the course that made more sense once you applied it to your own data?
  • What would you want to learn next, given what this project revealed?
  • One error or mistake in LLM output that you caught

The reflection is also good preparation for the oral exam: the questions above are representative of the kinds of reasoning you will be asked to demonstrate.

Written submission

Add a group-reflection.qmd to your project repository and render it as part of your Quarto website. It does not need to be polished — honest and specific answers are more useful than complete ones. This document should summarise things you’ve already documented in your git commit messages, GitHub issues, Contributions and AI disclosure statements.

group-reflection.qmd
---
title: "Group Reflection"
---

## Skills inventory

For each skill area, describe what your group actually used it for.
If you did not use a skill, say so and note why.

### Functions & refactoring (W2)

*What we did:*

*What worked well:*

*What was hard:*

### Debugging (W3)

*What we did:*

*What worked well:*

*What was hard:*

### Version control & remotes (W4)

*What we did:*

*What worked well:*

*What was hard:*

### Quarto websites & collaborative coding (W5)

*What we did:*

*What worked well:*

*What was hard:*

### R packages (W6)

*What we did:*

*What worked well:*

*What was hard:*

### Initial data analysis & data cleaning (W7)

*What we did:*

*What worked well:*

*What was hard:*

### Reproducibility, open data & renv (W9)

*What we did:*

*What worked well:*

*What was hard:*

### Modelling & analysis strategies (W10)

*What we did:*

*What worked well:*

*What was hard:*

### Communication & visualisation (W11)

*What we did:*

*What worked well:*

*What was hard:*

## Key analytical decisions

Describe 2–3 decisions that had a meaningful effect on your results or workflow.

### Decision 1: [short title]

*What were we deciding between?*

*How did we reason through it?*

*In hindsight, would we decide differently?*

### Decision 2: [short title]

*What were we deciding between?*

*How did we reason through it?*

*In hindsight, would we decide differently?*

## Workflow & collaboration

*How did we divide work across group members?*

*What branching or Git workflow did we actually use — and did it match the proposal?*

*What would we change about how we coordinated?*

## LLM usage

Describe up to 5 significant uses of an LLM tool during the project.

### Usage 1: [task]

- **Tool / model:**
- **What it produced:**
- **What we had to correct or verify:**

### Usage 2: [task]

- **Tool / model:**
- **What it produced:**
- **What we had to correct or verify:**

Project Format

Projects must be:

  • based around one dataset.
  • be associated with a single GitHub repo
  • be available for viewing on the web.

Choosing a dataset

The dataset:

  • should contain available under an open licence
  • must have variation across at least two dimensions (e.g. time and space, or group and time)
  • may contain more than one data table (e.g. like nycflights13)
  • every data table should have an identification variable (you may create one) that uniquely identifies each observation
  • be no larger than 10 MB (as raw CSV file)

At least one data table in your dataset should contain:

  • > 50 rows,
  • one numeric (metric) outcome variable y (not binary 0/1),
  • two explanatory variables:
    • one numeric variable: x_1 (can be time)
    • one categorical variable: x_2

Recommended sources:

  • Munich Open Data Portal
  • Indikatorenatlas München
  • Google Dataset Search

The quality, size, and topic of the dataset you choose will determine what kinds of analysis and data stories are feasible. A dataset that varies across more dimensions, or has richer metadata, will give you more to work with.

Project repository

The project repository should be on GitHub (GitLab is permitted but not supported). Every group member should contribute commits.

By the final submission, the repository should contain:

  • README.md with dataset description, research questions, and group members
  • Proposal document (due Mon 8 Jun) and updated proposal (due Mon 29 Jun)
  • Reflection document
  • Raw data (or a script that downloads it) with licence documentation
  • Analysis scripts and/or notebooks (Quarto)
  • Presentation graphics
  • Final presentation output reachable from rendered Quarto website (e.g. PDFs should also be available online)
  • Contribution statement including AI Tools disclosure per department guidelines
  • other elements as suitable

Presentation output formats

The most appropriate output formats for your projects will depend on the analysis context and target audience you choose to pursue, and may include:

  • statistical report (e.g., like https://moderndive.github.io/moderndive_labs/static/term_project/resubmission_example.html)
  • collection of charts (e.g., like VisTales example: https://vistales.github.io/Home/_site/chart/ )
  • interactive dashboards (e.g, using Dashboards format)
  • data storytelling article (e.g., using the Closeread extension)
  • R data package (e.g. like palmerpenguins)

For an R package, you should also create a pkgdown website, and link to this from your main project website.

Tracking and reporting contributions

Commit messages

GitHub provides a natural record of contributions through commits, pull requests, and issues. To make this record meaningful:

  • Commit regularly and specifically — commit after completing a discrete task (e.g., “clean district variable”, “add EDA plot for income”), not in one large batch at the end
  • Use issues or a project board to assign tasks to group members before starting work — this creates a record of who was responsible for what
  • Every group member should commit directly — avoid one person committing on behalf of others, as this obscures the actual contribution record

Git history alone is not a complete record. Document decisions, reviews, and non-coding contributions (e.g., data sourcing, writing, feedback) in issues or a CONTRIBUTING.md file as you go.

Contribution statement format

Include one file per group member in a contributions/ folder (e.g. contributions/alice.qmd), or a combined CONTRIBUTING.qmd. For each group member, list:

  1. Sections or files they were primarily responsible for
  2. Key tasks they completed (1–3 bullet points)
  3. Collaborative tasks they participated in

Example format:

Example file: contributions/alice.qmd

## Alice — Contribution Statement

**Primarily responsible for:** data cleaning (`data/clean.R`), EDA section

- Sourced and documented the dataset licence
- Wrote data cleaning pipeline including handling of missing postcodes
- Co-authored the research questions section

**Collaborative tasks:** research question design, final report review

Example file: contributions/bob.qmd

## Bob — Contribution Statement

**Primarily responsible for:** modelling section, final report assembly

- Built and evaluated the regression model
- Produced all presentation graphics
- Reviewed and integrated team members' sections into the final report

**Collaborative tasks:** research question design, data sourcing discussion