ProcessingComplexData
diff --git a/‎.gitignore‎
Lines changed: 45 additions & 0 deletions b/‎.gitignore‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 395 additions & 0 deletions b/‎LICENSE‎
Lines changed: 395 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 7 additions & 0 deletions b/‎README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎_config.yml‎
Lines changed: 41 additions & 0 deletions b/‎_config.yml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎course_manual.md‎
Lines changed: 45 additions & 0 deletions b/‎course_manual.md‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎index.md‎
Lines changed: 44 additions & 0 deletions b/‎index.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎project_guidelines.md‎
Lines changed: 81 additions & 0 deletions b/‎project_guidelines.md‎
Lines changed: 81 additions & 0 deletions
@@ -0,0 +1,45 @@
+# Data — too large for the repo (e.g. NSD slices, ~534 MB)
+data/
+
+# Jekyll / GitHub Pages build artifacts
+_site/
+.jekyll-cache/
+.jekyll-metadata
+.sass-cache/
+Gemfile.lock
+
+# Course admin/source notes
+admin/
+
+# R
+.Rproj.user/
+.Rhistory
+.RData
+.Ruserdata
+.r_libs/
+*.Rproj
+
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+.venv/
+venv/
+.ipynb_checkpoints/
+.mypy_cache/
+.ruff_cache/
+.pytest_cache/
+
+# Quarto
+*_files/
+*_cache/
+.quarto/
+
+# OS / editor
+.DS_Store
+Thumbs.db
+.idea/
+.vscode/
+
+# Feasibility checks (local-only)
+project_feasibility_checks/
@@ -0,0 +1,7 @@
+# Processing Complex Data (PCD)
+
+Course materials for **Processing Complex Data**, a Methodology and Statistics master-level course on handling, processing, and modelling complex scientific data.
+
+The course site is built with Jekyll and hosted on GitHub Pages. The landing page is [`index.md`](index.md). The course manual is [`course_manual.md`](course_manual.md), project guidelines are in [`project_guidelines.md`](project_guidelines.md), and the six group projects live in [`projects/`](projects/).
+
+All materials are licensed [CC-BY-4.0](LICENSE).
@@ -0,0 +1,41 @@
+title: PCD
+description: Materials for the Methodology and Statistics master course <i>Processing Complex Data</i>.
+theme: jekyll-theme-minimal
+github:
+    is_project_page: false
+
+# Auto-rewrite [text](file.md) links to [text](file.html) on the deployed site.
+# Whitelisted on GitHub Pages: https://pages.github.com/versions/
+plugins:
+  - jekyll-optional-front-matter
+  - jekyll-relative-links
+
+relative_links:
+  enabled: true
+  collections: false
+
+# Make Jekyll render every .md file in the repo as a page, even when the
+# source file has no YAML front matter.
+defaults:
+  - scope:
+      path: ""
+    values:
+      layout: default
+
+include:
+  - README.md
+
+# Don't try to render LICENSE, configs, or anything outside the source.
+exclude:
+  - LICENSE
+  - .DS_Store
+  - .mypy_cache
+  - .ruff_cache
+  - .sass-cache
+  - .r_libs
+  - admin
+  - data
+  - project_feasibility_checks
+  - "*.lock"
+  - Gemfile
+  - Gemfile.lock
@@ -0,0 +1,45 @@
+# Course Manual
+
+This page is based on the Osiris course text for **Processing Complex Data**.
+
+## Name
+
+Processing Complex Data
+
+## ECTS
+
+2.5
+
+## Course Description
+
+Contrary to what most introductory data science courses and statistics courses teach and use, data in science has an incredible variety of formats, sizes, and procedures. From simple tables to complex multidimensional space-time arrays, including metadata and custom storage formats, the world of data for science is vast, varied, and wildly interesting.
+
+This course is designed to give students an introduction to core real-world data concepts, as well as hands-on experience with handling, processing, and modelling different types of complex data used in various fields of science and beyond. The course leans on student engagement and guided practical group work to create a dynamic learning environment.
+
+## Course Goals
+
+At the end of the course, students will be able to:
+
+- Identify and describe a range of complex scientific data formats, such as multidimensional arrays, spatiotemporal data, and metadata-rich structures, and their associated challenges.
+- Apply appropriate preprocessing techniques, such as cleaning, transformation, normalization, filtering, and feature extraction, to different types of real-world scientific datasets.
+- Analyze and compare statistical modelling approaches suitable for various data modalities, evaluating their assumptions, strengths, and limitations.
+- Defend and communicate statistical findings through a structured report and peer discussions, demonstrating the ability to justify methodological choices and respond to critique.
+
+## Assessment
+
+Assessment is based on a group project, which runs for the duration of the course. The grade for the project is the final grade for the course.
+
+## Materials
+
+All course materials will be made openly available under a CC-BY license. The readings will be based on books, articles, and other sources which are openly available.
+
+- Wickham, H., Cetinkaya-Rundel, M., & Grolemund, G. (2023). *R for Data Science* (2nd ed.). O'Reilly Media. <https://r4ds.hadley.nz/>
+- Several relevant open-access articles and materials.
+
+## Theme
+
+DIGITA_DATA_INFORMAT
+
+## Work Form
+
+Hoor/werkcollege
@@ -0,0 +1,44 @@
+# Processing Complex Data
+
+This webpage contains all materials for the Methodology and Statistics master course **Processing Complex Data (PCD)**. The materials on this website are [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) licensed.
+
+![cc](https://mirrors.creativecommons.org/presskit/icons/cc.svg) ![by](https://mirrors.creativecommons.org/presskit/icons/by.svg)
+
+## About the course
+
+Contrary to what most introductory data science and statistics courses teach, real-world scientific data come in an enormous variety of formats, sizes, structures, and procedures — from simple tables to spatiotemporal arrays, normalized relational schemas, nested API responses, raw scraped web pages, networks, and domain-specific scientific standards. This course gives students hands-on experience with handling, processing, and modelling six families of complex data, in a hackathon-style format where each group goes deep on one data type and teaches the rest of the class.
+
+The narrative spine of the course is *from raw traces to defensible claims*. Each group works through a single pipeline: raw source → operationalized clean object → baseline model with one sensitivity check → presentation.
+
+## Course materials
+
+- [Course manual](course_manual.md) — official course description, learning goals, assessment, and materials.
+- [Project guidelines](project_guidelines.md) — workflow files, raw-data policy, decision logs, quality checks, and contribution tracking.
+
+## Lectures
+
+| Week | Title | Lecture |
+| :--- | :---- | :------ |
+| 1 | What Makes Data Complex? | TBD |
+| 2 | From Complex Data to Clean Data | TBD |
+| 3 | Scaling Up Modeling | TBD |
+| 4 | Communicating Research | TBD |
+| 5 | Presentations | — |
+| 6 | Short Report | — |
+
+## Group projects
+
+Six project variants. Each group works through the same four-week workflow (`week1_explore.qmd`, `week2_operationalize_clean.qmd`, `week3_model.qmd`, `week4_storytelling.qmd`) on its own data family, and teaches the same six dimensions to the rest of the class: **data structure, storage system, file formats, encoding, model, and key aspects**.
+
+| Variant              | Data family                                                    | Example research question                                                                                  |
+| :------------------- | :------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------- |
+| Geospatial           | [`projects/geospatial.md`](projects/geospatial.md)             | What is the relation between municipal land use and population composition?                                |
+| Networks             | [`projects/networks.md`](projects/networks.md)                 | What is the relationship between gender and cross-program relations in high school?                        |
+| Messy web text       | [`projects/messy_web_text.md`](projects/messy_web_text.md)     | Do company sustainability pages differ linguistically from public-interest climate information pages?      |
+| Relational database  | [`projects/relational_database.md`](projects/relational_database.md) | Which driver, constructor, grid, circuit, and season characteristics are associated with F1 finishing points? |
+| Time series          | [`projects/time_series.md`](projects/time_series.md)           | How does an fMRI signal change across NSD scan sessions? |
+| API data             | [`projects/api_data.md`](projects/api_data.md)                 | Which study attributes are associated with completed versus ongoing clinical trials?                       |
+
+## Assessment
+
+Assessment is based on the group project, which runs for the full duration of the course. The project grade is the final course grade. Each group submits a short structured report and gives a final presentation; both are evaluated on the raw-to-clean-to-model pipeline, the methodological choices, and the limits of the claim.
@@ -0,0 +1,81 @@
+# Project Guidelines
+
+These guidelines apply to all project variants in **Processing Complex Data**.
+
+- Team size: 4 students.
+- Level: 1st-year master.
+- Language priority: R + Quarto (`.qmd`). Use Python only when a raw format is substantially easier there.
+- Difficulty target: all variants should feel comparable in scope and workload.
+
+## Project Management
+
+> [!NOTE]
+> Remember that you need to write down how you each contribute.
+
+- How will we work together? Meetings to sit together? Teams channel? Email?
+- How will you write the report? Microsoft Office? Overleaf / LaTeX? Quarto? Google Docs? You decide.
+- The projects are very short: be strict on your deadlines. Also help each other out.
+- Assign weekly roles: coordinator, data lead, modeling lead, and presenter/reviewer.
+- Keep a short decision log inside each weekly `.qmd` under `Scope choices`.
+- Record how work was divided; contribution tracking is part of the project.
+- Use short weekly deadlines because the project is intentionally scoped for fast iteration.
+
+## Exploration and Data
+
+> [!NOTE]
+> Start trying to implement small solutions for subquestions quickly.
+
+## Research Question Operationalization
+
+> [!NOTE]
+> Keep it small, do not be afraid to make choices, the projects are short.
+
+## Modeling and Report
+
+> [!NOTE]
+> Make things understandable for your audience: your peers.
+
+## Raw-Data Policy
+
+- Do not start from a pre-cleaned analysis file committed by the instructor.
+- Week 1 and Week 2 should begin from raw website, repository, API, database, or scientific-format data.
+- It is fine to create a clean analysis object inside Week 2 code, but that cleaning must happen inside the student workflow.
+- Prefer complex formats over plain CSV when a realistic raw source exists.
+- At least one variant should explicitly discuss Parquet as a columnar storage alternative when contrasting storage choices.
+
+## Reproducible Structure
+
+- What is a good folder structure for collaboration and sharing? Tip: get inspiration from the [ODISSEI-SODA guide to sharing research code](https://odissei-soda.nl/tutorials/share-your-reserarch-code/).
+- Each variant must have exactly four executable workflow files:
+  - `week1_explore.qmd`
+  - `week2_operationalize_clean.qmd`
+  - `week3_model.qmd`
+  - `week4_storytelling.qmd`
+- The weeks form one pipeline rather than four unrelated notebooks.
+- `week1_explore.qmd` should download the raw data if they are not already present and then explore those raw files.
+- `week2_operationalize_clean.qmd` should read the raw files and write `data/model_data.rds` for the Week 3 model.
+- `week3_model.qmd` should read `data/model_data.rds`, fit the model, and write `data/model_results.rds`.
+- `week4_storytelling.qmd` should read `data/model_results.rds` and turn those saved results into presentation-ready figures.
+- Keep executable code under about 100 non-empty lines per file.
+- It is still preferable for most files to stay compact and readable, but preprocessing in Week 2 does not need to be artificially short.
+
+## Week-by-Week Scope
+
+- Week 1: explain origin, purpose, storage system, file format, encoding, download the raw data, and show the first exploratory view of the raw source.
+- Week 2: define whether the question is about association, prediction, or causal effect, and build one analysis-ready object from the raw source that is saved for later use.
+- Week 3: fit the baseline model on the saved Week 2 data, evaluate it, and save the model output that Week 4 will visualize.
+- Week 4: explain the model assumptions, turn the saved model output into figures for the presentation, and state the main limitation clearly.
+
+## Quality Checks
+
+- Week 1: report row counts, feature counts, missingness, and one note on data provenance or power.
+- Week 2: test keys, parsing, impossible values, and one alternative cleaning choice.
+- Week 3: report at least one fit metric and one sensitivity or robustness check.
+- Week 4: include one uncertainty statement and one limitation slide or paragraph.
+
+## Documentation Minimum
+
+- Keep source links in the project description `.md` files.
+- Keep operational definitions in Week 2.
+- Keep model interpretation and caveats in Week 4.
+- If AI assistance was used, state where it entered the workflow.