Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions backend/scripts/import_alma_json_to_d1.py
Original file line number Diff line number Diff line change
Expand Up @@ -497,6 +497,32 @@ def _get_or_create_lecturer(lecturer_id_by_name: dict[str, int], name: str) -> i
JOIN study_areas AS sa ON sa.code = je.value
WHERE f."key" = '_categories_json';

-- Some programs expose study-area membership under codes that differ from the
-- seeded study_areas.code: M.Sc. Machine Learning detail pages use MACH-*
-- (seeded as ML-*), and B.Sc. Informatik Wahlpflicht modules appear as their
-- INFM module numbers. Map those aliases so cross-listed courses still link to
-- the right study area. The alias destination is scoped to its program
-- (study_areas.code is only unique per program; the B.Sc. codes PRAK/THEO/
-- TECH/INFO are deliberately generic), and the original scraped code is kept
-- as source_code.
INSERT OR IGNORE INTO course_study_area_links (course_id, study_area_id, source_code)
SELECT f.course_id, sa.id, je.value
FROM course_fields AS f
JOIN json_each(f.value) AS je
JOIN (
SELECT 'MACH-FML' AS src, 'MSC_ML_2021' AS prog, 'ML-FOUND' AS dst
UNION ALL SELECT 'MACH-DTML', 'MSC_ML_2021', 'ML-DIVERSE'
UNION ALL SELECT 'MACH-GCS', 'MSC_ML_2021', 'ML-CS'
UNION ALL SELECT 'MACH-EP', 'MSC_ML_2021', 'ML-EXP'
UNION ALL SELECT 'INFM3110', 'BSC_INFO_2021', 'PRAK'
UNION ALL SELECT 'INFM3410', 'BSC_INFO_2021', 'THEO'
UNION ALL SELECT 'INFM3310', 'BSC_INFO_2021', 'TECH'
UNION ALL SELECT 'INFM2510', 'BSC_INFO_2021', 'INFO'
) AS alias ON alias.src = je.value
JOIN study_programs AS sp ON sp.code = alias.prog
JOIN study_areas AS sa ON sa.program_id = sp.id AND sa.code = alias.dst
WHERE f."key" = '_categories_json';

INSERT OR IGNORE INTO course_curriculum_matches (course_id, module_id, match_type, confidence)
SELECT f.course_id, cm.id, 'category_code', 0.9
FROM course_fields AS f
Expand Down
99 changes: 99 additions & 0 deletions data_collection/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# CLAUDE.md — data_collection

Project instructions for the ALMA course-catalog scraper. Read this before
changing anything under `data_collection/`. For runnable commands see
[`QUICKSTART.md`](QUICKSTART.md); for environment setup see [`SETUP.md`](SETUP.md).

## What this is

A standalone Python scraper for the public ALMA course catalog at
`alma.uni-tuebingen.de`. It crawls the JSF catalog tree, fetches course detail
pages, and writes a JSON file. That JSON is then turned into the D1 seed by
`backend/scripts/import_alma_json_to_d1.py` (a separate step — the scraper does
not touch the database).

- `alma/scraper.py` — `AlmaScraper` (session, JSF navigation, parsing) + pure
parse helpers.
- `alma/cli.py` — argparse entry point (`python -m alma.cli`), single-period and
multi-period orchestration.
- Output: `output/<timestamp>/courses_multi_semester.json` (multi-period) or
`courses.json` (single).

## The two catalog trees (important)

ALMA exposes the same courses through two different trees of the public
`showCourseCatalog-flow`:

1. **VVZ** — "Gesamtverzeichnis Lehrveranstaltungen Informatik". A flat-ish
per-faculty listing. Includes department-wide offerings (Oberseminare,
Kolloquien, info events, Mathe-Vorkurs) that are **not** tied to any degree
module. None of those award ECTS.
2. **studiesOffered** — degree programs (B.Sc./M.Sc. ...). Each program tree is
`[Modul] <study-area> → [Veranstaltungskonto] → [Veranstaltungsgruppe] (N CP)
→ [Veranstaltung]`. It lists courses **cross-listed from other faculties**
(KOG, GTCNEURO, MEDZ, BIOINF) that count toward a study area but are absent
from the VVZ Informatik branch.

Neither tree is a superset of the other, so the scraper crawls **both**: the VVZ
branch (`INFORMATICS_BRANCH_CHAIN`) plus the degree-program branches
(`PROGRAM_BRANCH_CHAINS`: M.Sc. CS, B.Sc. Informatik, M.Sc. ML). Courses are
deduplicated by `unit_id`; `ScrapeOptions.skip_unit_ids` stops later branches
from re-fetching detail pages a previous branch already got. `--no-programs`
falls back to VVZ-only. See `cli._scrape_period_branches`.

The logged-in "Studienplaner mit Modulplan" (`studyPlanner-flow`) returns **403
anonymously** — do not target it. The studiesOffered tree above is the
anonymously-reachable equivalent.

## Period ids ↔ semesters

Period ids are opaque ALMA ints; the mapping is **not** chronological by number:

| id | semester | id | semester |
|----|----------|----|----------|
| 225 | SoSe 2022 | 233 | WiSe 2022/23 |
| 226 | SoSe 2023 | 234 | WiSe 2023/24 |
| 227 | SoSe 2024 | 235 | WiSe 2024/25 |
| 228 | SoSe 2025 | 236 | WiSe 2025/26 |
| 229 | SoSe 2026 | | |

`--from-semester LABEL` selects every period at or after `LABEL`
(`parse_semester_tuple` understands e.g. `"Sommer 2026"`, `"Winter 2022/23"`).
Deep-path `title:NNNN` ids differ per period, so branches are rediscovered each
period by title chain via `find_branch_permalink`.

## Study-area attribution (how courses link to INFO-INFO etc.)

Each course detail page has a "Module / Studiengänge" table; the scraper stores
those codes as the `_categories_json` course field. The importer joins them to
`study_areas.code`. Codes mostly match directly (M.Sc. CS: `INFO-BASIS`,
`INFO-FOKUS`, `INFO-INFO`, `INFO-PRAK`, `INFO-TECH`, `INFO-THEO`), but some need
aliasing — handled in `import_alma_json_to_d1.py`:

- M.Sc. ML detail pages use `MACH-*`; seeded study areas are `ML-*`
(`MACH-FML→ML-FOUND`, `MACH-DTML→ML-DIVERSE`, `MACH-GCS→ML-CS`, `MACH-EP→ML-EXP`).
- B.Sc. Wahlpflicht appears as `INFM####` (`INFM3110→PRAK`, `INFM3410→THEO`,
`INFM3310→TECH`, `INFM2510→INFO`).

Enumeration is the hard part: once a cross-listed course is scraped and its
detail page fetched, the existing category-code join attributes it. B.Sc.
*compulsory* modules (Mathe, Teamprojekt) carry no Wahlpflicht code, so they are
enumerated but not category-linked (known gap).

## Gotchas

- **Mojibake**: ALMA text often arrives UTF-8-as-cp1252. Use `repair_mojibake`
before comparing/printing titles; never assume clean text.
- **Politeness**: keep `polite_delay` between requests; do not parallelize.
- **Progress**: `tqdm` shows an outer "semesters" bar and an inner per-branch
detail bar. Log lines go through `tqdm.write` so they don't corrupt the bars;
`--quiet` disables both. `progress.json` is still written every course.
- Coverage is scoped to the three Informatik programs above. Adding e.g.
Medieninformatik (which is where `User Experience` lives) is a one-entry
addition to `PROGRAM_BRANCH_CHAINS` plus any needed code aliases.

## Conventions

Follow the repo-wide `AGENTS.md`. Python: explicit type hints, small pure
helpers, comments explain *why*. New scraper logic should be exercised by a real
run against one period before merging (no DB write needed).
31 changes: 19 additions & 12 deletions data_collection/QUICKSTART.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

2. **Run scraper:**
```powershell
uv run python -m alma_scraper.cli --details
uv run python -m alma.cli --details
```

### Option 2: Using `pip`
Expand All @@ -29,7 +29,7 @@

3. **Run scraper:**
```powershell
python -m alma_scraper.cli --details
python -m alma.cli --details
```

## Usage
Expand All @@ -40,7 +40,7 @@ Scrape the Informatik course catalog (Gesamtverzeichnis Lehrveranstaltungen
Informatik) with course details:

```powershell
uv run python -m alma_scraper.cli --details
uv run python -m alma.cli --details
```

Each course detail includes a `categories` list — the module/study-program
Expand All @@ -51,22 +51,29 @@ Output: `output/YYYY-MM-DD_HH-MM-SS/courses.json`
### Multiple semesters

Scrape every semester from a given label up to the most recent. Per-period
the scraper switches via ALMA's Semesterauswahl dropdown and rediscovers the
Informatik branch by title chain (the deep-path IDs differ between
semesters):
the scraper switches via ALMA's Semesterauswahl dropdown and rediscovers each
branch by title chain (the deep-path IDs differ between semesters):

```powershell
uv run python -m alma_scraper.cli --details --from-semester "Sommer 2022"
uv run python -m alma.cli --details --from-semester "Sommer 2022"
```

In multi-period mode the scraper crawls the VVZ "Gesamtverzeichnis
Lehrveranstaltungen Informatik" branch **and** the degree-program branches
(M.Sc. Computer Science, B.Sc. Informatik, M.Sc. Machine Learning). The
program branches surface courses cross-listed from other faculties that count
toward a study area but are missing from the VVZ branch. Courses shared
between branches are deduplicated by `unit_id` and their detail pages are
fetched only once. Pass `--no-programs` to crawl the VVZ branch alone.

Each course in the output gets `period_id` and `period_label` fields so you
can tell semesters apart. The output file is rewritten after every period,
so an interrupted run still leaves a usable file.

If a run was interrupted, resume it without redoing the completed semesters:

```powershell
uv run python -m alma_scraper.cli --details --continue output/<timestamp>/courses_multi_semester.json
uv run python -m alma.cli --details --continue output/<timestamp>/courses_multi_semester.json
```

Fully completed periods are kept and skipped; partial or skipped ones are
Expand All @@ -75,23 +82,23 @@ redone. Output is written back to the same path.
### List available semesters

```powershell
uv run python -m alma_scraper.cli --list-periods
uv run python -m alma.cli --list-periods
```

### Quick Test (2 minutes)

Test scraping:

```powershell
uv run python -m alma_scraper.cli --details --max-runtime-seconds 120
uv run python -m alma.cli --details --max-runtime-seconds 120
```

### Full Catalog

Scrape entire university:

```powershell
uv run python -m alma_scraper.cli --full-catalog
uv run python -m alma.cli --full-catalog
```

### Watch Progress
Expand Down Expand Up @@ -122,4 +129,4 @@ output/
- `--pretty` - Pretty-print JSON
- `--list-periods` - Print available period IDs and labels

For full help: `uv run python -m alma_scraper.cli --help`
For full help: `uv run python -m alma.cli --help`
4 changes: 2 additions & 2 deletions data_collection/SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

2. **Run scraper:**
```powershell
uv run python -m alma_scraper.cli --details
uv run python -m alma.cli --details
```

## Option 2: Using `pip` (Virtual Environment)
Expand All @@ -27,7 +27,7 @@

3. **Run scraper:**
```powershell
python -m alma_scraper.cli --details
python -m alma.cli --details
```

## Output
Expand Down
Loading
Loading