Skip to content

feat: faculty-agnostic researcher-tree builder + integrity check#64

Open
ValentinJSchmidt wants to merge 8 commits into
mainfrom
feat/prof-phd-paper-tree-schema
Open

feat: faculty-agnostic researcher-tree builder + integrity check#64
ValentinJSchmidt wants to merge 8 commits into
mainfrom
feat/prof-phd-paper-tree-schema

Conversation

@ValentinJSchmidt

Copy link
Copy Markdown
Collaborator

Summary

Turns the OpenAlex index builder into a script the agent can run to build or check any faculty's researcher tree, instead of being tied to hardcoded Tübingen paths and a fixed set of columns.

Part of #48 (tree schema + referential integrity). The PhD-level fields (role / advisor edge) are left for a follow-up.

What changed

  • Build any faculty: new flags --researchers-index, --chairs-index, --papers-dir point the builder at any faculty's files. Defaults still point at the bundled Tübingen data.
  • No more fixed columns: tables are now read by column name, not position. A faculty can add or reorder descriptive columns without breaking anything. Only the slug/link columns matter.
  • Integrity check: new --validate-only mode checks that every link resolves (chair→researcher, researcher→chair, paper→person, paper→chair) and exits with an error code if anything is orphaned.
  • CI guard: skills/tests/test_tree_integrity.py runs the check on the real tree and proves it fails on broken links.
  • Docs: schema doc, update-openalex-paper-index/SKILL.md, and AGENTS.md explain the rules and how to reuse the builder for a new faculty.

How to test

  1. Install deps: pip install -e ".[dev]" (or just pip install pytest).
  2. Run the suite: python -m pytest -q → all pass (includes 5 new tree-integrity tests).
  3. Check the real tree: python scripts/update_openalex_index.py --validate-only → prints nothing, exits 0.
  4. See it catch a broken link: copy the data to a temp folder, change a paper's researchers slug to a name that doesn't exist, then run --validate-only pointed at that folder → it prints ERROR: paper '...' references missing person ... and exits 2.
  5. Different faculty: point the --*-index / --papers-dir flags at another folder of the same Markdown shape → builds/validates with no code change.

🤖 Generated with Claude Code

ValentinJSchmidt and others added 8 commits June 22, 2026 23:00
Captures current PhD rosters for the three pilot chairs as the recall
benchmark for PhD discovery (#47):
- Autonomous Learning / Distributed Intelligence (Georg Martius)
- Theory of Machine Learning (Ulrike von Luxburg)
- Autonomous Vision Group (Andreas Geiger)

Rosters were captured from each chair's official team page on
2026-06-22 via automated fetch. NOT final: entries still need
human verification against the live pages before this is treated as
authoritative ground truth. status values active/incoming/associated/
former are documented in the file header; only 'active' counts as a
recall target. Postdocs, research engineers and admin staff excluded.

Refs #46

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ground-truth PhD fixture for the 3 pilot chairs drafted and
committed as WIP; pending human verification.

Refs #46

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Former PhD alumni are removed from the ground-truth roster entirely.
People listed under a chair's 'Researcher' role are now treated as
active PhDs (Martius: Kloss, Kolev, Geist), while research engineers
remain excluded. Martius roster: 20 active, 0 former.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a 'postdoc' status counted as a recall target. von Luxburg gains
5 postdocs (Bhattacharjee, Bordt, König, Thiessen, Waller) confirmed
from the live team page. Martius and Geiger have no current postdoc
section, so none added there. Also backfill profile URLs for the three
Martius Researchers (Kloss, Kolev, Geist) found on the team page.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Turn the OpenAlex index builder into a script the agent can run to build or
validate any faculty's Markdown tree, instead of relying on hardcoded Tuebingen
paths and a fixed column layout.

- add --researchers-index / --chairs-index / --papers-dir to target any faculty
- parse tables by column name, so extra faculty columns do not break readers
- add validate_references + --validate-only: checks every chair/researcher/paper
  link resolves and exits non-zero on orphans
- wire the check into CI via skills/tests/test_tree_integrity.py
- document reuse + integrity rules in the schema doc, SKILL.md, and AGENTS.md

PhD-level schema fields (role/advisor edge) are deferred to a follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adding findings/ left setuptools with two top-level dirs (skills, findings)
and no way to choose, so `pip install -e ".[dev]"` failed in CI. The repo is
not an importable package, so declare no py-modules to skip discovery.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants