group-project-guidelines

Project purpose

The purpose of the second part of this course is to give you hands-on experience learning and applying statistical programming ideas and tools on real world data. Each week we will cover a different theme of tasks within typical and fairly simple data-driven workflows. These themes are based on the R4DS workflow, but are not strictly linear stages or iterations, and will not cover every possible data science or statistical task in every possible context.

The aim in each week is to practice and develop your ability to plan and execute tasks within each theme, as well as evaluate and integrate results, artefacts and insights across different themes and collaborators. By working in groups you will have the opportunity to coordinate with and learn from others to craft a meaningful data-driven report or story based on your chosen dataset(s).

For each theme, the minimum recommendation is to:

attend the lecture
complete the practical exercise
add at least one section or additional page to the project repository

This project is adapted from the ModernDive Term Project

Relation to exam

The project is optional but strongly recommended. Students who complete the project will be examined on their own work — you will be asked to explain decisions, defend choices, and reason about alternatives. Students who do not submit a project will need to reason about an unfamiliar dataset and analysis context on the spot during the exam.

A complete project repository is also a useful portfolio piece for demonstrating collaborative data analysis skills.

Submission deadline: TBD

Late submission will not be accepted for exam preparation purposes.

Milestones

Milestone	Due	Description
Group formation	W05 practical	Form groups of 3–4; choose a dataset
Project proposal	W07 practical	Initial EDA, research questions, analysis plan
Final submission	TBD (after W13, before exams)	Complete analysis and presentation in project repository
Group reflection	W13 practical (17 Jul)	In-class discussion of learnings across the project

Choosing a dataset

The dataset:

should contain available under an open licence
must have variation across at least two dimensions (e.g. time and space, or group and time)
may contain more than one data table (e.g. like nycflights13)
every data table should have an identification variable (you may create one) that uniquely identifies each observation
be no larger than 10 MB (as raw CSV file)

At least one data table in your dataset should contain:

> 50 rows,
one numeric (metric) outcome variable y (not binary 0/1),
two explanatory variables:
- one numeric variable: x_1 (can be time)
- one categorical variable: x_2

Recommended sources:

The quality, size, and topic of the dataset you choose will determine what kinds of analysis and data stories are feasible. A dataset that varies across more dimensions, or has richer metadata, will give you more to work with.

Project repository

The project repository should be on GitHub (GitLab is permitted but not supported). Every group member should contribute commits.

By the final submission, the repository should contain:

README.md with dataset description, research questions, and group members
Raw data (or a script that downloads it) with licence documentation
Proposal document (from W07)
Analysis scripts and/or notebooks (Quarto)
Presentation graphics
Final presentation output as rendered Quarto document or website
Contribution statement including AI Tools disclosure per department guidelines
other elements as suitable

Analysis presentation formats

The most appropriate output formats for your projects will depend on the analysis context and target audience you choose to pursue, and may include:

statistical consulting report,
interactive dashboards
data storytelling article

Project Proposal

Milestone 1: Project proposal

The proposal does not need to be polished — it is a working document that helps you and your group get started, and gives us a chance to give early feedback.

1. Dataset & research questions

Source and licence: Where did the data come from? What licence does it use?
What is in it: How many rows and columns? What does each row represent? What are the key variables?
Research questions: State 1–3 questions you want to explore with this dataset. For each question, say what kind of claim you are aiming to make:
- Descriptive: What does the data show? How has X changed over time?
- Exploratory: Are there patterns or relationships worth investigating further?
- Explanatory/causal: Does X seem to influence Y? (Note any limitations in making causal claims from observational data.)
- Predictive: Can we build a simple model to predict Y?
Context and target audience: What context and target audience are you exploring your research questions under (business, academic, journalistic)? What type of presentation formats are suitable for this audience?

2. Initial data analysis

Run an initial exploration of the dataset and include in your proposal:

glimpse() or equivalent output showing structure
Summary statistics for key variables (missing values, distributions, ranges)
At least two plots that help you understand the data — one showing distributions, one showing a relationship or trend relevant to your research questions
Notes on data quality issues you have already noticed (missing values, inconsistent formatting, unexpected values)

3. Analysis plan

Outline how you plan to answer each research question:

What variables will you use?
What analytical approach or method seems appropriate, and why?
What cleaning or transformation steps do you anticipate needing?
What are the main uncertainties or risks in your plan (e.g. not enough data, confounding variables, missing values)?

You do not need to have the answers yet — the goal is to show that you have thought carefully about what is feasible with your data.

4. Code style and package usage

Outline how you plan to ensure consistent code style and coherent package usage in your code base.

Tracking and reporting contributions

Tracking contributions

GitHub provides a natural record of contributions through commits, pull requests, and issues. To make this record meaningful:

Commit regularly and specifically — commit after completing a discrete task (e.g., “clean district variable”, “add EDA plot for income”), not in one large batch at the end
Use issues or a project board to assign tasks to group members before starting work — this creates a record of who was responsible for what
Every group member should commit directly — avoid one person committing on behalf of others, as this obscures the actual contribution record

Git history alone is not a complete record. Document decisions, reviews, and non-coding contributions (e.g., data sourcing, writing, feedback) in issues or a CONTRIBUTING.md file as you go.

Contribution statement format

Include a CONTRIBUTING.md (or equivalent section in your README.md) with a brief contribution statement. For each group member, list:

Sections or files they were primarily responsible for
Key tasks they completed (1–3 bullet points)
Collaborative tasks they participated in

Example format:

## Contributions

**Alice** — data cleaning (`data/clean.R`), EDA section
- Sourced and documented the dataset licence
- Wrote data cleaning pipeline including handling of missing postcodes
- Co-authored the research questions section

**Bob** — modelling section, final report assembly
- Built and evaluated the regression model
- Produced all presentation graphics
- Reviewed and integrated team members' sections into the final report