Skip to content

Commit a97d0ce

Browse files
committed
Add Processing Complex Data course site
0 parents  commit a97d0ce

13 files changed

Lines changed: 1252 additions & 0 deletions

.gitignore

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Data — too large for the repo (e.g. NSD slices, ~534 MB)
2+
data/
3+
4+
# Jekyll / GitHub Pages build artifacts
5+
_site/
6+
.jekyll-cache/
7+
.jekyll-metadata
8+
.sass-cache/
9+
Gemfile.lock
10+
11+
# Course admin/source notes
12+
admin/
13+
14+
# R
15+
.Rproj.user/
16+
.Rhistory
17+
.RData
18+
.Ruserdata
19+
.r_libs/
20+
*.Rproj
21+
22+
# Python
23+
__pycache__/
24+
*.py[cod]
25+
*.egg-info/
26+
.venv/
27+
venv/
28+
.ipynb_checkpoints/
29+
.mypy_cache/
30+
.ruff_cache/
31+
.pytest_cache/
32+
33+
# Quarto
34+
*_files/
35+
*_cache/
36+
.quarto/
37+
38+
# OS / editor
39+
.DS_Store
40+
Thumbs.db
41+
.idea/
42+
.vscode/
43+
44+
# Feasibility checks (local-only)
45+
project_feasibility_checks/

LICENSE

Lines changed: 395 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Processing Complex Data (PCD)
2+
3+
Course materials for **Processing Complex Data**, a Methodology and Statistics master-level course on handling, processing, and modelling complex scientific data.
4+
5+
The course site is built with Jekyll and hosted on GitHub Pages. The landing page is [`index.md`](index.md). The course manual is [`course_manual.md`](course_manual.md), project guidelines are in [`project_guidelines.md`](project_guidelines.md), and the six group projects live in [`projects/`](projects/).
6+
7+
All materials are licensed [CC-BY-4.0](LICENSE).

_config.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
title: PCD
2+
description: Materials for the Methodology and Statistics master course <i>Processing Complex Data</i>.
3+
theme: jekyll-theme-minimal
4+
github:
5+
is_project_page: false
6+
7+
# Auto-rewrite [text](file.md) links to [text](file.html) on the deployed site.
8+
# Whitelisted on GitHub Pages: https://pages.github.com/versions/
9+
plugins:
10+
- jekyll-optional-front-matter
11+
- jekyll-relative-links
12+
13+
relative_links:
14+
enabled: true
15+
collections: false
16+
17+
# Make Jekyll render every .md file in the repo as a page, even when the
18+
# source file has no YAML front matter.
19+
defaults:
20+
- scope:
21+
path: ""
22+
values:
23+
layout: default
24+
25+
include:
26+
- README.md
27+
28+
# Don't try to render LICENSE, configs, or anything outside the source.
29+
exclude:
30+
- LICENSE
31+
- .DS_Store
32+
- .mypy_cache
33+
- .ruff_cache
34+
- .sass-cache
35+
- .r_libs
36+
- admin
37+
- data
38+
- project_feasibility_checks
39+
- "*.lock"
40+
- Gemfile
41+
- Gemfile.lock

course_manual.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Course Manual
2+
3+
This page is based on the Osiris course text for **Processing Complex Data**.
4+
5+
## Name
6+
7+
Processing Complex Data
8+
9+
## ECTS
10+
11+
2.5
12+
13+
## Course Description
14+
15+
Contrary to what most introductory data science courses and statistics courses teach and use, data in science has an incredible variety of formats, sizes, and procedures. From simple tables to complex multidimensional space-time arrays, including metadata and custom storage formats, the world of data for science is vast, varied, and wildly interesting.
16+
17+
This course is designed to give students an introduction to core real-world data concepts, as well as hands-on experience with handling, processing, and modelling different types of complex data used in various fields of science and beyond. The course leans on student engagement and guided practical group work to create a dynamic learning environment.
18+
19+
## Course Goals
20+
21+
At the end of the course, students will be able to:
22+
23+
- Identify and describe a range of complex scientific data formats, such as multidimensional arrays, spatiotemporal data, and metadata-rich structures, and their associated challenges.
24+
- Apply appropriate preprocessing techniques, such as cleaning, transformation, normalization, filtering, and feature extraction, to different types of real-world scientific datasets.
25+
- Analyze and compare statistical modelling approaches suitable for various data modalities, evaluating their assumptions, strengths, and limitations.
26+
- Defend and communicate statistical findings through a structured report and peer discussions, demonstrating the ability to justify methodological choices and respond to critique.
27+
28+
## Assessment
29+
30+
Assessment is based on a group project, which runs for the duration of the course. The grade for the project is the final grade for the course.
31+
32+
## Materials
33+
34+
All course materials will be made openly available under a CC-BY license. The readings will be based on books, articles, and other sources which are openly available.
35+
36+
- Wickham, H., Cetinkaya-Rundel, M., & Grolemund, G. (2023). *R for Data Science* (2nd ed.). O'Reilly Media. <https://r4ds.hadley.nz/>
37+
- Several relevant open-access articles and materials.
38+
39+
## Theme
40+
41+
DIGITA_DATA_INFORMAT
42+
43+
## Work Form
44+
45+
Hoor/werkcollege

index.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Processing Complex Data
2+
3+
This webpage contains all materials for the Methodology and Statistics master course **Processing Complex Data (PCD)**. The materials on this website are [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) licensed.
4+
5+
![cc](https://mirrors.creativecommons.org/presskit/icons/cc.svg) ![by](https://mirrors.creativecommons.org/presskit/icons/by.svg)
6+
7+
## About the course
8+
9+
Contrary to what most introductory data science and statistics courses teach, real-world scientific data come in an enormous variety of formats, sizes, structures, and procedures — from simple tables to spatiotemporal arrays, normalized relational schemas, nested API responses, raw scraped web pages, networks, and domain-specific scientific standards. This course gives students hands-on experience with handling, processing, and modelling six families of complex data, in a hackathon-style format where each group goes deep on one data type and teaches the rest of the class.
10+
11+
The narrative spine of the course is *from raw traces to defensible claims*. Each group works through a single pipeline: raw source → operationalized clean object → baseline model with one sensitivity check → presentation.
12+
13+
## Course materials
14+
15+
- [Course manual](course_manual.md) — official course description, learning goals, assessment, and materials.
16+
- [Project guidelines](project_guidelines.md) — workflow files, raw-data policy, decision logs, quality checks, and contribution tracking.
17+
18+
## Lectures
19+
20+
| Week | Title | Lecture |
21+
| :--- | :---- | :------ |
22+
| 1 | What Makes Data Complex? | TBD |
23+
| 2 | From Complex Data to Clean Data | TBD |
24+
| 3 | Scaling Up Modeling | TBD |
25+
| 4 | Communicating Research | TBD |
26+
| 5 | Presentations ||
27+
| 6 | Short Report ||
28+
29+
## Group projects
30+
31+
Six project variants. Each group works through the same four-week workflow (`week1_explore.qmd`, `week2_operationalize_clean.qmd`, `week3_model.qmd`, `week4_storytelling.qmd`) on its own data family, and teaches the same six dimensions to the rest of the class: **data structure, storage system, file formats, encoding, model, and key aspects**.
32+
33+
| Variant | Data family | Example research question |
34+
| :------------------- | :------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------- |
35+
| Geospatial | [`projects/geospatial.md`](projects/geospatial.md) | What is the relation between municipal land use and population composition? |
36+
| Networks | [`projects/networks.md`](projects/networks.md) | What is the relationship between gender and cross-program relations in high school? |
37+
| Messy web text | [`projects/messy_web_text.md`](projects/messy_web_text.md) | Do company sustainability pages differ linguistically from public-interest climate information pages? |
38+
| Relational database | [`projects/relational_database.md`](projects/relational_database.md) | Which driver, constructor, grid, circuit, and season characteristics are associated with F1 finishing points? |
39+
| Time series | [`projects/time_series.md`](projects/time_series.md) | How does an fMRI signal change across NSD scan sessions? |
40+
| API data | [`projects/api_data.md`](projects/api_data.md) | Which study attributes are associated with completed versus ongoing clinical trials? |
41+
42+
## Assessment
43+
44+
Assessment is based on the group project, which runs for the full duration of the course. The project grade is the final course grade. Each group submits a short structured report and gives a final presentation; both are evaluated on the raw-to-clean-to-model pipeline, the methodological choices, and the limits of the claim.

project_guidelines.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Project Guidelines
2+
3+
These guidelines apply to all project variants in **Processing Complex Data**.
4+
5+
- Team size: 4 students.
6+
- Level: 1st-year master.
7+
- Language priority: R + Quarto (`.qmd`). Use Python only when a raw format is substantially easier there.
8+
- Difficulty target: all variants should feel comparable in scope and workload.
9+
10+
## Project Management
11+
12+
> [!NOTE]
13+
> Remember that you need to write down how you each contribute.
14+
15+
- How will we work together? Meetings to sit together? Teams channel? Email?
16+
- How will you write the report? Microsoft Office? Overleaf / LaTeX? Quarto? Google Docs? You decide.
17+
- The projects are very short: be strict on your deadlines. Also help each other out.
18+
- Assign weekly roles: coordinator, data lead, modeling lead, and presenter/reviewer.
19+
- Keep a short decision log inside each weekly `.qmd` under `Scope choices`.
20+
- Record how work was divided; contribution tracking is part of the project.
21+
- Use short weekly deadlines because the project is intentionally scoped for fast iteration.
22+
23+
## Exploration and Data
24+
25+
> [!NOTE]
26+
> Start trying to implement small solutions for subquestions quickly.
27+
28+
## Research Question Operationalization
29+
30+
> [!NOTE]
31+
> Keep it small, do not be afraid to make choices, the projects are short.
32+
33+
## Modeling and Report
34+
35+
> [!NOTE]
36+
> Make things understandable for your audience: your peers.
37+
38+
## Raw-Data Policy
39+
40+
- Do not start from a pre-cleaned analysis file committed by the instructor.
41+
- Week 1 and Week 2 should begin from raw website, repository, API, database, or scientific-format data.
42+
- It is fine to create a clean analysis object inside Week 2 code, but that cleaning must happen inside the student workflow.
43+
- Prefer complex formats over plain CSV when a realistic raw source exists.
44+
- At least one variant should explicitly discuss Parquet as a columnar storage alternative when contrasting storage choices.
45+
46+
## Reproducible Structure
47+
48+
- What is a good folder structure for collaboration and sharing? Tip: get inspiration from the [ODISSEI-SODA guide to sharing research code](https://odissei-soda.nl/tutorials/share-your-reserarch-code/).
49+
- Each variant must have exactly four executable workflow files:
50+
- `week1_explore.qmd`
51+
- `week2_operationalize_clean.qmd`
52+
- `week3_model.qmd`
53+
- `week4_storytelling.qmd`
54+
- The weeks form one pipeline rather than four unrelated notebooks.
55+
- `week1_explore.qmd` should download the raw data if they are not already present and then explore those raw files.
56+
- `week2_operationalize_clean.qmd` should read the raw files and write `data/model_data.rds` for the Week 3 model.
57+
- `week3_model.qmd` should read `data/model_data.rds`, fit the model, and write `data/model_results.rds`.
58+
- `week4_storytelling.qmd` should read `data/model_results.rds` and turn those saved results into presentation-ready figures.
59+
- Keep executable code under about 100 non-empty lines per file.
60+
- It is still preferable for most files to stay compact and readable, but preprocessing in Week 2 does not need to be artificially short.
61+
62+
## Week-by-Week Scope
63+
64+
- Week 1: explain origin, purpose, storage system, file format, encoding, download the raw data, and show the first exploratory view of the raw source.
65+
- Week 2: define whether the question is about association, prediction, or causal effect, and build one analysis-ready object from the raw source that is saved for later use.
66+
- Week 3: fit the baseline model on the saved Week 2 data, evaluate it, and save the model output that Week 4 will visualize.
67+
- Week 4: explain the model assumptions, turn the saved model output into figures for the presentation, and state the main limitation clearly.
68+
69+
## Quality Checks
70+
71+
- Week 1: report row counts, feature counts, missingness, and one note on data provenance or power.
72+
- Week 2: test keys, parsing, impossible values, and one alternative cleaning choice.
73+
- Week 3: report at least one fit metric and one sensitivity or robustness check.
74+
- Week 4: include one uncertainty statement and one limitation slide or paragraph.
75+
76+
## Documentation Minimum
77+
78+
- Keep source links in the project description `.md` files.
79+
- Keep operational definitions in Week 2.
80+
- Keep model interpretation and caveats in Week 4.
81+
- If AI assistance was used, state where it entered the workflow.

0 commit comments

Comments
 (0)