Project Guidelines

These guidelines apply to all project variants in Processing Complex Data.

Team size: 4 students.
Level: 1st-year master.
Language priority: R + Quarto (.qmd). Use Python only when a raw format is substantially easier there.
Difficulty target: all variants should feel comparable in scope and workload.

Project Management

Note

Remember that you need to write down how you each contribute.

How will we work together? Meetings to sit together? Teams channel? Email?
How will you write the report? Microsoft Office? Overleaf / LaTeX? Quarto? Google Docs? You decide.
The projects are very short: be strict on your deadlines. Also help each other out.
Assign weekly roles: coordinator, data lead, modeling lead, and presenter/reviewer.
Keep a short decision log inside each weekly .qmd under Scope choices.
Record how work was divided; contribution tracking is part of the project.
Use short weekly deadlines because the project is intentionally scoped for fast iteration.

Exploration and Data

Note

Start trying to implement small solutions for subquestions quickly.

Research Question Operationalization

Note

Keep it small, do not be afraid to make choices, the projects are short.

Modeling and Report

Note

Make things understandable for your audience: your peers.

Raw-Data Policy

Do not start from a pre-cleaned analysis file committed by the instructor.
Week 1 and Week 2 should begin from raw website, repository, API, database, or scientific-format data.
It is fine to create a clean analysis object inside Week 2 code, but that cleaning must happen inside the student workflow.
Prefer complex formats over plain CSV when a realistic raw source exists.
At least one variant should explicitly discuss Parquet as a columnar storage alternative when contrasting storage choices.

Reproducible Structure

What is a good folder structure for collaboration and sharing? Tip: get inspiration from the ODISSEI-SODA guide to sharing research code.
Each variant must have exactly four executable workflow files:
- week1_explore.qmd
- week2_operationalize_clean.qmd
- week3_model.qmd
- week4_storytelling.qmd
The weeks form one pipeline rather than four unrelated notebooks.
week1_explore.qmd should download the raw data if they are not already present and then explore those raw files.
week2_operationalize_clean.qmd should read the raw files and write data/model_data.rds for the Week 3 model.
week3_model.qmd should read data/model_data.rds, fit the model, and write data/model_results.rds.
week4_storytelling.qmd should read data/model_results.rds and turn those saved results into presentation-ready figures.
Keep executable code under about 100 non-empty lines per file.
It is still preferable for most files to stay compact and readable, but preprocessing in Week 2 does not need to be artificially short.

Week-by-Week Scope

Week 1: explain origin, purpose, storage system, file format, encoding, download the raw data, and show the first exploratory view of the raw source.
Week 2: define whether the question is about association, prediction, or causal effect, and build one analysis-ready object from the raw source that is saved for later use.
Week 3: fit the baseline model on the saved Week 2 data, evaluate it, and save the model output that Week 4 will visualize.
Week 4: explain the model assumptions, turn the saved model output into figures for the presentation, and state the main limitation clearly.

Quality Checks

Week 1: report row counts, feature counts, missingness, and one note on data provenance or power.
Week 2: test keys, parsing, impossible values, and one alternative cleaning choice.
Week 3: report at least one fit metric and one sensitivity or robustness check.
Week 4: include one uncertainty statement and one limitation slide or paragraph.

Documentation Minimum

Keep source links in the project description .md files.
Keep operational definitions in Week 2.
Keep model interpretation and caveats in Week 4.
If AI assistance was used, state where it entered the workflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Guidelines

Project Management

Exploration and Data

Research Question Operationalization

Modeling and Report

Raw-Data Policy

Reproducible Structure

Week-by-Week Scope

Quality Checks

Documentation Minimum

FilesExpand file tree

project_guidelines.md

Latest commit

History

project_guidelines.md

File metadata and controls

Project Guidelines

Project Management

Exploration and Data

Research Question Operationalization

Modeling and Report

Raw-Data Policy

Reproducible Structure

Week-by-Week Scope

Quality Checks

Documentation Minimum