Project purpose
The purpose of the second part of this course is to give you hands-on experience learning and applying statistical programming ideas and tools on real world data. Each week we will cover a different theme of tasks within typical and fairly simple data-driven workflows. These themes are based on the R4DS workflow, but are not strictly linear stages or iterations, and will not cover every possible data science or statistical task in every possible context.
The aim in each week is to practice and develop your ability to plan and execute tasks within each theme, as well as evaluate and integrate results, artefacts and insights across different themes and collaborators. By working in groups you will have the opportunity to coordinate with and learn from others to craft a meaningful data-driven report or story based on your chosen dataset(s).
For each theme, the minimum recommendation is to:
- attend the lecture
- complete the practical exercise
- add at least one section or additional page to the project repository
This project is adapted from the ModernDive Term Project
Relation to exam
The project is optional but strongly recommended. Students who complete the project will be examined on their own work — you will be asked to explain decisions, defend choices, and reason about alternatives. Students who do not submit a project will need to reason about an unfamiliar dataset and analysis context on the spot during the exam.
A complete project repository is also a useful portfolio piece for demonstrating collaborative data analysis skills.
Submission deadline: TBD
Late submission will not be accepted for exam preparation purposes.
Milestones
| Milestone | Due | Description |
|---|---|---|
| Group formation | W05 practical | Form groups of 3–4; choose a dataset |
| Project proposal | W07 practical | Initial EDA, research questions, analysis plan |
| Final submission | TBD (after W13, before exams) | Complete analysis and presentation in project repository |
| Group reflection | W13 practical (17 Jul) | In-class discussion of learnings across the project |
Choosing a dataset
The dataset:
- should contain available under an open licence
- must have variation across at least two dimensions (e.g. time and space, or group and time)
- may contain more than one data table (e.g. like
nycflights13) - every data table should have an identification variable (you may create one) that uniquely identifies each observation
- be no larger than 10 MB (as raw CSV file)
At least one data table in your dataset should contain:
- > 50 rows,
- one numeric (metric) outcome variable
y(not binary 0/1), - two explanatory variables:
- one numeric variable:
x_1(can be time) - one categorical variable:
x_2
- one numeric variable:
Recommended sources:
The quality, size, and topic of the dataset you choose will determine what kinds of analysis and data stories are feasible. A dataset that varies across more dimensions, or has richer metadata, will give you more to work with.
Project repository
The project repository should be on GitHub (GitLab is permitted but not supported). Every group member should contribute commits.
By the final submission, the repository should contain:
README.mdwith dataset description, research questions, and group members- Raw data (or a script that downloads it) with licence documentation
- Proposal document (from W07)
- Analysis scripts and/or notebooks (Quarto)
- Presentation graphics
- Final presentation output as rendered Quarto document or website
- Contribution statement including AI Tools disclosure per department guidelines
- other elements as suitable
Analysis presentation formats
The most appropriate output formats for your projects will depend on the analysis context and target audience you choose to pursue, and may include:
- statistical consulting report,
- interactive dashboards
- data storytelling article
Project Proposal
Milestone 1: Project proposal
The proposal does not need to be polished — it is a working document that helps you and your group get started, and gives us a chance to give early feedback.
1. Dataset & research questions
- Source and licence: Where did the data come from? What licence does it use?
- What is in it: How many rows and columns? What does each row represent? What are the key variables?
- Research questions: State 1–3 questions you want to explore with this dataset. For each question, say what kind of claim you are aiming to make:
- Descriptive: What does the data show? How has X changed over time?
- Exploratory: Are there patterns or relationships worth investigating further?
- Explanatory/causal: Does X seem to influence Y? (Note any limitations in making causal claims from observational data.)
- Predictive: Can we build a simple model to predict Y?
- Context and target audience: What context and target audience are you exploring your research questions under (business, academic, journalistic)? What type of presentation formats are suitable for this audience?
2. Initial data analysis
Run an initial exploration of the dataset and include in your proposal:
glimpse()or equivalent output showing structure- Summary statistics for key variables (missing values, distributions, ranges)
- At least two plots that help you understand the data — one showing distributions, one showing a relationship or trend relevant to your research questions
- Notes on data quality issues you have already noticed (missing values, inconsistent formatting, unexpected values)
3. Analysis plan
Outline how you plan to answer each research question:
- What variables will you use?
- What analytical approach or method seems appropriate, and why?
- What cleaning or transformation steps do you anticipate needing?
- What are the main uncertainties or risks in your plan (e.g. not enough data, confounding variables, missing values)?
You do not need to have the answers yet — the goal is to show that you have thought carefully about what is feasible with your data.
4. Code style and package usage
Outline how you plan to ensure consistent code style and coherent package usage in your code base.
Tracking and reporting contributions
Tracking contributions
GitHub provides a natural record of contributions through commits, pull requests, and issues. To make this record meaningful:
- Commit regularly and specifically — commit after completing a discrete task (e.g., “clean district variable”, “add EDA plot for income”), not in one large batch at the end
- Use issues or a project board to assign tasks to group members before starting work — this creates a record of who was responsible for what
- Every group member should commit directly — avoid one person committing on behalf of others, as this obscures the actual contribution record
Git history alone is not a complete record. Document decisions, reviews, and non-coding contributions (e.g., data sourcing, writing, feedback) in issues or a CONTRIBUTING.md file as you go.
Contribution statement format
Include a CONTRIBUTING.md (or equivalent section in your README.md) with a brief contribution statement. For each group member, list:
- Sections or files they were primarily responsible for
- Key tasks they completed (1–3 bullet points)
- Collaborative tasks they participated in
Example format:
## Contributions
**Alice** — data cleaning (`data/clean.R`), EDA section
- Sourced and documented the dataset licence
- Wrote data cleaning pipeline including handling of missing postcodes
- Co-authored the research questions section
**Bob** — modelling section, final report assembly
- Built and evaluated the regression model
- Produced all presentation graphics
- Reviewed and integrated team members' sections into the final report