Advanced Statistical Programming using R

Week 4: Version Control & Collaborative Coding

2026-05-07

Announcements

Reminders

Check the updated schedule: https://soda-lmu.github.io/StatProg2-2026-SoSe/#schedule
Commit into your individual git reflection logs
Group formation this week — talk to each other, sign up on Moodle

Group formation activity

If you are already in a group, please register via Moodle
If not, please stay at the end of class, or answer the matchmaking form on Moodle.

Examination Format

Date is now July 29 ~~Jul 30~~.

Oral exam will:

assess your statistical programming, collaboration and communication skills
assess your judgement and reasoning:
- explain decisions,
- defend choices,
- and reason about alternative options
based on group project and individual practical exercises

More details to be provided closer to the examination date.

Group Project

Learning goals

In the hands-on project, you will:

practice strategies for learning from data
implement statistical programming tasks
discuss and negotiate key decisions in data science workflows
produce a data-driven report, story, dashboard or other final output
leverage LLMs for assistance with statistical reasoning and programming
practice collaborative coding skills

Dataset requirements

Your dataset:

should contain data available under an open licence
must have variation across at least two dimensions (e.g. time and space, or group and time)
may contain more than one data table (e.g. like nycflights13)
every data table should have an identification variable (you may create one) that uniquely identifies each observation
be no larger than 10 MB (as raw csv files)

At least one data table should contain:

> 50 rows,
one numeric outcome variable y (not binary 0/1),
two explanatory variables:
- one numeric variable: x_1 (can be time)
- one categorical variable: x_2

Dataset Ideas

Federal Elections in Munich
- 2025
- 2021
- 2017
- 2013 <!– - Construction sites in Munich
Public Toilets in Munich –>

Tip

Keep a log of data sources you explore! Even datasets you don’t use can be valuable discussion examples in the Oral Exam.

Syllabus

Part 1: Statistical Programming Foundations (W2–6)

W02: Scripts, Functions & Refactoring
W03: Debugging
W04: Version Control & Remotes ~~Collaborative Coding~~`
W05: Quarto websites & Collaborative Coding
W06: R Packages

Last Week

Responsible AI usage & LLM management
Asking for help & writing good questions
Debugging tools and strategies

Errors & Troubleshooting

Three condition types

	Type	What to do
🚨	Error	Must fix
⚠️	Warning	Inspect output
💬	Message	Read it

Troubleshooting steps

Read the message
Search the message
Divide and conquer
Read the docs
Restart R

Minimal Reproducible Examples

Minimal

Remove unrelated code
Limit package dependencies
Use built-in datasets or synthetic data

Reproducible

Include library() calls
Use dput() for real data
Set set.seed() if randomness is involved

Tip

Writing a reprex often finds the bug for you — making code self-contained reveals missing library() calls or stale variables.

Debugging Tools

Goal	Tool
Locate where error occurred	`traceback()`, `rlang::last_trace()`
Pause and inspect inside function	`browser()`
Debug a package function	`debugonce(fn)`, `debug(fn)`
Debug outside R (YAML, paths)	`quarto render` in terminal

This Week

Git history
Git Ignore
Git Remotes & GitHub

Git Review

Motivation, commands so far, and how git works

Why version control?

The file naming trap

analysis_final.R
analysis_final_v2.R
analysis_FINAL_submit.R
analysis_FINAL_submit_fixed.R
analysis_use_this_one.R

The “what did I change?” problem

You get a result. Two days later it’s different.
Which script produced the submitted output?
Did you change the cleaning step or the model?
What did your collaborator edit last night?

Tip

Version control replaces both problems with one system: a named, timestamped history of every change.

What version control gives you

For your own work

A complete, browsable history of every change
Roll back any file to any past state
Confidence to experiment — you can always undo
A record of why you made each change (commit messages)

For collaboration

See exactly what a collaborator changed, line by line
Merge two people’s edits without overwriting each other
Review and discuss changes before they go into the project
Reproduce any past result from the exact code that produced it

Git Workflow

working directory  →  staging area  →  repository
   (your files)       (git add)        (git commit)

Edit files as normal
Stage the changes you want to keep (git add)
Commit a named snapshot (git commit)

Commands so far

Set up & inspect

git init               # start tracking a folder
git status             # what has changed?
git log                # history of commits

Stage & commit

git add <file>         # stage a file
git add .              # stage all changes
git commit -m "msg"    # save a snapshot

What makes a good commit message?

Short imperative summary (≤ 50 chars)
Describes what and why, not how
Each commit captures one logical change

# ✓ good
git commit -m "Add penguin species filter"

# ✗ less helpful
git commit -m "fix"
git commit -m "changes"

Git History

Examples based on:

Chapter 2. Basic Tricks, Git Magic

Going back through the log

So far, we’ve just used Git as a diary or journal to log additions.
What if we want to look back through past versions of our repository? or restore a past version of a file?
Let’s start by looking at what RStudio lets us do.

In RStudio: view log

Tip

The RStudio Git History panel calls git log under the hood — you can also run git log --oneline in the terminal for a compact view.

In RStudio: see changes

Tip

Changes are shown as highlights: red for deletions and green for additions

In RStudio: save past versions

Tip

Rstudio is displaying the results of: `git checkout – path/to/file.R``

In RStudio: view file history

CLI: history commands

Browse the log

git log                      # full log
git log --oneline            # compact one-line view
git log --oneline --graph    # with branch graph
git log -- code/analysis.R  # commits touching one file

Inspect a specific commit

git show <hash>              # full diff for that commit
git show <hash>:code/analysis.R  # file as it was then

CLI: Restore removed files

Imagine you have commits A → B → C → D, where B removes a file. At D, you want to recover the missing files.

1. Checkout and stage the old file.

# bring a file back to how it was at <hash>
git checkout <hash> -- code/analysis.R

gitGraph:
  commit id: "A"
  commit id: "B: remove code/analysis.R"
  commit id: "C"
  commit id: "D (HEAD)"
  commit id: "git checkout A code/analysis.R"  type: HIGHLIGHT
  commit id: "E: add analysis back"

Staged — still need git commit, or you can un-stage the file (e.g. for more editing) with git restore --staged <file>

2. Apply patch to working directory

git diff <hash-2> <hash-1> | git apply

gitGraph:
  commit id: "A"
  commit id: "B: remove code/analysis.R"
  commit id: "C"
  commit id: "D (HEAD)"
  commit id: "git diff B A | git apply" type: HIGHLIGHT
  commit id: "E: add analysis back"

Unstaged — still need git add + git commit

CLI: Restore removed files

3. Revert the offending commit`

git revert <hash>

gitGraph:
  commit id: "A"
  commit id: "B: remove code/analysis.R" type: REVERSE
  commit id: "C"
  commit id: "D (HEAD)"
  commit id: "git revert B" type: HIGHLIGHT
  commit id: "E: add analysis back"

Auto-commits — done!

If we use the ‘Save As’ button in RStudio to restore the file, which option are we in?

1: git diff <B> <A>| git apply
2: git checkout <A> -- <file>
3: git revert <B>

Git Ignore

What should NOT go in git?

Too large or auto-generated

raw data files > a few MB
rendered outputs (_site/, docs/)
compiled artefacts (.o, .so)

Machine-specific or sensitive

.Rhistory, .RData, .Rproj.user/
.DS_Store (macOS), Thumbs.db (Windows)
credentials, API keys, .env files
absolute paths baked into outputs

Warning

Once a file is committed, its content lives in git history — even if you delete it later. Never commit secrets.

`.gitignore` — project level

A .gitignore file at the repo root tells git which files and folders to skip.

Syntax

# ignore a specific file
.RData

# ignore all files with this extension
*.csv

# ignore a whole folder
_site/

# but track one file inside it
!_site/CNAME

Typical R / Quarto .gitignore

# R session artefacts
.Rhistory
.RData
.Rproj.user/

# Quarto build output
/_site/
/.quarto/

# OS noise
.DS_Store
Thumbs.db

Tip

GitHub offers a ready-made R .gitignore when you create a new repo — always tick that box.

Global `.gitignore` — machine level

Some files you want to ignore everywhere, not just in one project.

# create (or edit) your global ignore file
git config --global core.excludesFile ~/.gitignore_global

Add OS and editor noise once, never think about it again:

# macOS
.DS_Store
.AppleDouble

# Windows
Thumbs.db
Desktop.ini

# RStudio
.Rproj.user/

Extensions: gitignore.io

Tip

For generating useful .gitignore files, try 🔗 gitignore.io

ASIDE: How does git actually work?

Git stores snapshots, not diffs.

Each commit records:

a snapshot of all tracked files
a pointer to the parent commit
your name, email, timestamp, and message

commit c3d4e5f
parent b2c3d4e
author Cynthia Huang
date   2026-05-07

Add penguin species filter
    ↓
[snapshot of all files]

You don’t need to know how git works under the hood for this unit, but if you are curious:

Git Remotes

Beyond your computer

So far, everything has lived on your machine:

working directory  →  staging area  →  local repository
   (your files)       (git add)        (git commit)

What if you want to share your work with others? or work on multiple machines?

A remote repository is a related repository hosted on a server:

working directory  →  staging area  →  local repo  →  remote repo
   (your files)       (git add)        (git commit)    (git push)
                                            ↑
                                        (git pull)

Need to use Git commands to compare and sync commits between remote & local

GitHub as remote server

GitHub is an interface and cloud hosting service built on top of the Git version control system.
Git does the version control and Github stores the data remotely.
Makes your work accessible to others (and to yourself on other machines)
Adds collaboration tools: issues, pull requests, code review

Tip

GitLab works the same way and is permitted in this course, but practicals and instructions will use GitHub.

GitHub: Connecting local and remote?

Set up a GitHub account
Authenticate
Connect to existing repo
Pull & Push local changes

Setup & Authentication

You need a GitHub account

Sign up at https://github.com
Your username is public — choose something professional

Authentication = proving you are who you say you are

GitHub needs to verify your identity before you can connect:

HTTPS + PAT — a token you paste like a password
SSH key — a cryptographic key pair on your machine

Tip

We recommend using SSH using these instructions from LMU OSC. The full setup walkthrough is in this week’s practical.

From GitHub to your machine

A fork is your own copy of someone else’s repository — on GitHub’s servers.

original repo          your fork
(soda-lmu/template) → (you/template)
       ↓
  your machine  (git clone)

Changes you make stay in your fork until you open a Pull Request to propose them back.

A clone is a local copy of a GitHub repository — on your machine.

   GitHub repo
(you/template)
       ↓
  your machine  (git clone)

Changes you make stay local until you git push them back to GitHub.

Fork vs Clone

	Fork	Clone
Lives on	GitHub	your machine
Connected to	original repo	a GitHub repo
Used for	contributing to others’ work	working locally

Tip

For the group project: one member forks the template, the rest clone the fork. Everyone pushes to the same shared fork.

Fork an existing repository

Navigate first to the repository you want to fork – e.g.:

🔗 github.com/soda-lmu/our-statprog2-project

Fork an existing repository

Give your fork a name

Fork an existing repository

Forking in progress

Fork an existing repository

Notice the connection to the original repository

Clone an existing repository

Clone to your machine – automatically sets the remote.

git clone <url>    # copy a repo to your machine

Clone an existing repository

Getting the URL

Clone an existing repository

Clone to your machine – automatically sets the remote.

git clone <url>    # copy a repo to your machine

Working with Remotes

The day-to-day loop

GitHub  ──clone──▶  local             (once, to set up)
GitHub  ──pull───▶  local             (start of each session)
local   ──push───▶  GitHub            (end of each session)

git pull — get your collaborators’ latest commits
Edit files, run code
git add + git commit — save your work in logical chunks
git push — share your commits with the team

Tip

Pull at the start of every session, push at the end. This keeps conflicts small and your team in sync.

In RStudio: Edit

In RStudio: Commit

What CLI commands is RStudio performing for you?

In RStudio: Push

In RStudio: Rejected Push???

In RStudio: Pull THEN Push

In RStudio: Success!

Pushing an existing repository?

Let’s say you have a repository on your local machine, and you want to git push it to GitHub

You can’t push without a remote!

So we need to set up a remote repository to connect to!

GitHub: Creating a new repo

GitHub: Add remote

GitHub shows you exactly what to run after creating an empty repo:

git remote add origin <url>
git branch -M main
git push -u origin main

Tip

-u sets origin/main as the upstream — after this, plain git push works.
git remote -v confirms the connection — you should see origin pointing to your GitHub URL.

Extension: GitHub CLI

The GitHub CLI (gh) lets you create repos without leaving the terminal.

From inside an existing local repo:

gh repo create

Follow the interactive prompts to set the name, visibility, and whether to push immediately.

Extension: GitHub CLI

Summary

Git: local version control

init, add, commit — the core loop; each commit is a snapshot with a parent pointer
Good commit messages are imperative, one logical change at a time
Use RStudio’s History panel (or git log) to browse and restore past versions
Specify files you don’t want to track with .gitignore

GitHub: beyond local

If you already have a local repo

git remote add origin <url>
git branch -M main
git push -u origin main

-u sets origin/main as the upstream — after this, plain git push works.

If you’re starting fresh

git clone <url>      # creates local repo
                     # remote is already configured
cd my-project
# ... add files ...
git add .
git commit -m "First commit"
git push

Tip

git remote -v confirms the connection — you should see origin pointing to your GitHub URL.

Pull before you push!

git pull — fetch and merge any remote changes first
Resolve any conflicts if they arise
git push — send your commits to GitHub

Tip

Make it a habit: pull first, then push. This avoids most “rejected push” errors.

Advanced Statistical Programming using R

Announcements

Reminders

Examination Format

Group Project

Learning goals

Dataset requirements

Dataset Ideas

Syllabus

Last Week

Errors & Troubleshooting

Minimal Reproducible Examples

Debugging Tools

This Week

Git Review

Why version control?

What version control gives you

Git Workflow

Commands so far

Git History

Going back through the log

In RStudio: view log

In RStudio: see changes

In RStudio: save past versions

In RStudio: view file history

CLI: history commands

CLI: Restore removed files

CLI: Restore removed files

Git Ignore

What should NOT go in git?

.gitignore — project level

Global .gitignore — machine level

Extensions: gitignore.io

ASIDE: How does git actually work?

ASIDE: How does git actually work?

Git Remotes

Beyond your computer

GitHub as remote server

GitHub: Connecting local and remote?

Setup & Authentication

From GitHub to your machine

Fork vs Clone

Fork an existing repository

Fork an existing repository

Fork an existing repository

Fork an existing repository

Clone an existing repository

Clone an existing repository

Clone an existing repository

Working with Remotes

In RStudio: Edit

In RStudio: Commit

In RStudio: Push

In RStudio: Rejected Push???

In RStudio: Pull THEN Push

In RStudio: Success!

Pushing an existing repository?

GitHub: Creating a new repo

GitHub: Creating a new repo

GitHub: Add remote

Extension: GitHub CLI

Extension: GitHub CLI

Summary

Git: local version control

GitHub: beyond local

Pull before you push!

`.gitignore` — project level

Global `.gitignore` — machine level