|
| 1 | +# Project Guidelines |
| 2 | + |
| 3 | +These guidelines apply to all project variants in **Processing Complex Data**. |
| 4 | + |
| 5 | +- Team size: 4 students. |
| 6 | +- Level: 1st-year master. |
| 7 | +- Language priority: R + Quarto (`.qmd`). Use Python only when a raw format is substantially easier there. |
| 8 | +- Difficulty target: all variants should feel comparable in scope and workload. |
| 9 | + |
| 10 | +## Project Management |
| 11 | + |
| 12 | +> [!NOTE] |
| 13 | +> Remember that you need to write down how you each contribute. |
| 14 | +
|
| 15 | +- How will we work together? Meetings to sit together? Teams channel? Email? |
| 16 | +- How will you write the report? Microsoft Office? Overleaf / LaTeX? Quarto? Google Docs? You decide. |
| 17 | +- The projects are very short: be strict on your deadlines. Also help each other out. |
| 18 | +- Assign weekly roles: coordinator, data lead, modeling lead, and presenter/reviewer. |
| 19 | +- Keep a short decision log inside each weekly `.qmd` under `Scope choices`. |
| 20 | +- Record how work was divided; contribution tracking is part of the project. |
| 21 | +- Use short weekly deadlines because the project is intentionally scoped for fast iteration. |
| 22 | + |
| 23 | +## Exploration and Data |
| 24 | + |
| 25 | +> [!NOTE] |
| 26 | +> Start trying to implement small solutions for subquestions quickly. |
| 27 | +
|
| 28 | +## Research Question Operationalization |
| 29 | + |
| 30 | +> [!NOTE] |
| 31 | +> Keep it small, do not be afraid to make choices, the projects are short. |
| 32 | +
|
| 33 | +## Modeling and Report |
| 34 | + |
| 35 | +> [!NOTE] |
| 36 | +> Make things understandable for your audience: your peers. |
| 37 | +
|
| 38 | +## Raw-Data Policy |
| 39 | + |
| 40 | +- Do not start from a pre-cleaned analysis file committed by the instructor. |
| 41 | +- Week 1 and Week 2 should begin from raw website, repository, API, database, or scientific-format data. |
| 42 | +- It is fine to create a clean analysis object inside Week 2 code, but that cleaning must happen inside the student workflow. |
| 43 | +- Prefer complex formats over plain CSV when a realistic raw source exists. |
| 44 | +- At least one variant should explicitly discuss Parquet as a columnar storage alternative when contrasting storage choices. |
| 45 | + |
| 46 | +## Reproducible Structure |
| 47 | + |
| 48 | +- What is a good folder structure for collaboration and sharing? Tip: get inspiration from the [ODISSEI-SODA guide to sharing research code](https://odissei-soda.nl/tutorials/share-your-reserarch-code/). |
| 49 | +- Each variant must have exactly four executable workflow files: |
| 50 | + - `week1_explore.qmd` |
| 51 | + - `week2_operationalize_clean.qmd` |
| 52 | + - `week3_model.qmd` |
| 53 | + - `week4_storytelling.qmd` |
| 54 | +- The weeks form one pipeline rather than four unrelated notebooks. |
| 55 | +- `week1_explore.qmd` should download the raw data if they are not already present and then explore those raw files. |
| 56 | +- `week2_operationalize_clean.qmd` should read the raw files and write `data/model_data.rds` for the Week 3 model. |
| 57 | +- `week3_model.qmd` should read `data/model_data.rds`, fit the model, and write `data/model_results.rds`. |
| 58 | +- `week4_storytelling.qmd` should read `data/model_results.rds` and turn those saved results into presentation-ready figures. |
| 59 | +- Keep executable code under about 100 non-empty lines per file. |
| 60 | +- It is still preferable for most files to stay compact and readable, but preprocessing in Week 2 does not need to be artificially short. |
| 61 | + |
| 62 | +## Week-by-Week Scope |
| 63 | + |
| 64 | +- Week 1: explain origin, purpose, storage system, file format, encoding, download the raw data, and show the first exploratory view of the raw source. |
| 65 | +- Week 2: define whether the question is about association, prediction, or causal effect, and build one analysis-ready object from the raw source that is saved for later use. |
| 66 | +- Week 3: fit the baseline model on the saved Week 2 data, evaluate it, and save the model output that Week 4 will visualize. |
| 67 | +- Week 4: explain the model assumptions, turn the saved model output into figures for the presentation, and state the main limitation clearly. |
| 68 | + |
| 69 | +## Quality Checks |
| 70 | + |
| 71 | +- Week 1: report row counts, feature counts, missingness, and one note on data provenance or power. |
| 72 | +- Week 2: test keys, parsing, impossible values, and one alternative cleaning choice. |
| 73 | +- Week 3: report at least one fit metric and one sensitivity or robustness check. |
| 74 | +- Week 4: include one uncertainty statement and one limitation slide or paragraph. |
| 75 | + |
| 76 | +## Documentation Minimum |
| 77 | + |
| 78 | +- Keep source links in the project description `.md` files. |
| 79 | +- Keep operational definitions in Week 2. |
| 80 | +- Keep model interpretation and caveats in Week 4. |
| 81 | +- If AI assistance was used, state where it entered the workflow. |
0 commit comments