isamplesorg · rdhyee · Jun 3, 2026 · Jun 6, 2026 · Jun 6, 2026 · Jun 6, 2026
diff --git a/.github/workflows/pipeline-tests.yml b/.github/workflows/pipeline-tests.yml
@@ -0,0 +1,36 @@
+name: Derived-parquet pipeline tests
+
+# Fast, AI-free gate for the data pipeline. Runs the fixture-based unit tests
+# (no network for data, no large parquet) whenever the pipeline code changes.
+on:
+  pull_request:
+    paths:
+      - "scripts/build_frontend_derived.py"
+      - "scripts/validate_frontend_derived.py"
+      - "tests/test_frontend_derived.py"
+      - "scripts/requirements.txt"
+      - "Makefile"
+      - ".github/workflows/pipeline-tests.yml"
+  push:
+    branches: [main]
+    paths:
+      - "scripts/build_frontend_derived.py"
+      - "scripts/validate_frontend_derived.py"
+      - "tests/test_frontend_derived.py"
+  workflow_dispatch:
+
+jobs:
+  fixture-tests:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Install deps
+        run: pip install -r scripts/requirements.txt
+      - name: Run pipeline fixture tests
+        # builds tiny synthetic wides (WKB BLOB + DuckDB GEOMETRY), runs the real
+        # builder + algebraic validator, asserts the contract. Exits non-zero on
+        # any failure -> PR is blocked.
+        run: python -m pytest tests/test_frontend_derived.py -q
diff --git a/DATA_PROVENANCE.md b/DATA_PROVENANCE.md
@@ -0,0 +1,83 @@
+# iSamples Explorer — Data Provenance
+
+How every parquet file the explorer uses is generated, from root to publish.
+*Reviewed 2026-06-02 (CC, via codebase audit). Complements `SERIALIZATIONS.md` (format/schema reference); this file is the end-to-end build chain + the automation gaps.*
+
+> **Load-bearing constraint:** the **root export cannot be regenerated.** It was produced from the iSamples Central Solr API (`central.isample.xyz`), **offline since Aug 2025**. The Zenodo-archived export is a **frozen root**. Any *new* data (e.g. concept URIs, thumbnails) therefore must come from a **per-source supplementary file merged into the base by `pid`** — the "sidecar" pattern (see Stage 3) — not from re-exporting.
+
+## Pipeline DAG
+
+```
+Source collections (SESAR · OpenContext · GEOME · Smithsonian)
+   │  iSamples Central Solr API  ── OFFLINE since Aug 2025 (cannot re-run) ──┐
+   ▼                                                                         │
+STAGE 0/1  export_client → JSONL → GeoParquet                               │ frozen
+   → isamples_export_*_geo.parquet   (Export format; ~300MB, 6.7M; Zenodo doi:10.5281/zenodo.15278211)
+   ▼
+STAGE 2  pqg/pqg/sql_converter.py  (export → base PQG; 7-stage DuckDB SQL)
+   →  narrow (…_narrow.parquet, ~844MB, 106M rows)   and   wide (…_wide.parquet, ~282MB, 20M rows)
+   ▼
+STAGE 3  sidecar/enrichment merge (LEFT JOIN by pid)        ← Eric's independently-maintained OC PQG (GCS)
+   scripts/enrich_wide_with_oc_thumbnails.py  →  isamples_202604_wide.parquet (+47K thumbnails)
+   ▼
+STAGE 4  wide → frontend derived files  (NOW SCRIPTED: scripts/build_frontend_derived.py)
+   → wide_h3 · h3_summary_res4/6/8 · samples_map_lite · sample_facets_v2 · facet_summaries · facet_cross_filter
+   → vocab_labels  (scripts/build_vocab_labels.py — built separately from SKOS TTLs)
+   → {tag}_manifest.json  (build identity: input+output sha256, argv, git SHA, DuckDB/extension versions)
+   ▼
+STAGE 5  publish to R2 (bucket isamples-ry) + Cloudflare Worker (data.isamples.org, /current/ aliases)
+   ▼
+DuckDB-WASM in the browser (explorer.qmd; parquet URLs ~L767-781)
+```
+
+## Stages (script / command per step)
+
+| Stage | Input → Output | How (file:line) | Automated? |
+|---|---|---|---|
+| **0/1 Export** | Solr API → `isamples_export_*_geo.parquet` | `export_client` `ExportClient.perform_full_download()` (`export_client.py:423-469`) → `write_geoparquet_from_json_lines()`; schema `SOURCE_COLUMNS` (`duckdb_utilities.py:9-42`, incl. `keywords: STRUCT(keyword VARCHAR)[]` — **text only, no URI**, L17) | ❌ API offline; **frozen** |
+| **2 Base PQG** | export → `*_narrow.parquet` / `*_wide.parquet` | `pqg/pqg/sql_converter.py` `convert_isamples_sql(input, output, wide=…)` (CLI `python pqg/sql_converter.py in.parquet out.parquet [--wide]`); 7 stages, decomposes nested structs → nodes+edges; site dedupe by rounded lat/lon+label | ✅ scripted (exact prod invocation not recorded — gap) |
+| **3 Sidecar merge** | base wide + Eric's OC PQG → `isamples_202604_wide.parquet` | `scripts/enrich_wide_with_oc_thumbnails.py` — `LEFT JOIN` OC `(pid, thumbnail_url)` into wide (`COALESCE`). **This is the precedent for merging ANY per-source supplement (incl. concept URIs) by pid.** Drift check: `scripts/check_oc_pqg_drift.py` (detects only; no mirror) | ⚠️ merge scripted; OC mirror + R2 upload manual |
+| **4 Frontend derived** | wide → 7 explorer files | The 6 map/facet files (`wide_h3`, `h3_summary_res4/6/8`, `samples_map_lite`, `sample_facets_v2`, `facet_summaries`, `facet_cross_filter`) ← **`scripts/build_frontend_derived.py`** (deterministic; geometry-agnostic; emits a manifest). `vocab_labels.parquet` ← `scripts/build_vocab_labels.py` (SKOS TTLs). Gated by `scripts/validate_frontend_derived.py` (algebraic + `--wide` semantic re-derivation) + `tests/test_frontend_derived.py` (fixtures, CI). | ✅ scripted; facet/map files semantic-tested; wide_h3 column-smoke-tested |
+| **5 Publish** | files → R2 + Worker | Worker `workers/data-isamples-org/src/index.js` (`wrangler deploy`); immutable cache for `isamples_\d{6}_*.parquet`; `/current/<flavor>.parquet` → 302 via `current/manifest.json`. Bucket `isamples-ry` | ⚠️ Worker scripted; **file upload + manifest update are manual** |
+
+## The sidecar/enrichment pattern (how new data gets in)
+
+Because the export is frozen, new per-source data is added by **merging a supplementary parquet keyed by `pid` into the base wide** — exactly what the thumbnail enrichment does:
+
+```sql
+-- scripts/enrich_wide_with_oc_thumbnails.py (core)
+CREATE TEMP TABLE oc_thumbs AS
+  SELECT DISTINCT pid, thumbnail_url FROM read_parquet('<eric_oc_pqg>') WHERE thumbnail_url IS NOT NULL;
+COPY (SELECT p.* REPLACE (COALESCE(oc.thumbnail_url, p.thumbnail_url) AS thumbnail_url)
+      FROM read_parquet('<base_wide>') p LEFT JOIN oc_thumbs oc ON p.pid = oc.pid)
+  TO '<out>' (FORMAT PARQUET, COMPRESSION ZSTD);
+```
+
+Eric Kansa maintains OpenContext PQG **independently** on GCS (`storage.googleapis.com/opencontext-parquet/oc_isamples_pqg.parquet`), so it can carry data the frozen iSamples export lacks. This is the channel for **#263** (external concept URIs): Eric's OC PQG carries them → merged into wide by pid → flows to the derived files. *(Sidecar design endorsed 2026-04-17; the spec `project_isamples_sidecar_pattern.md` lives in the Obsidian vault, not a repo — gap.)*
+
+## Stage 4 builder contract (`scripts/build_frontend_derived.py`)
+
+- **Geometry-agnostic input.** The `geometry` column may be **WKB BLOB** (e.g. `isamples_202604_wide`) or DuckDB **GEOMETRY** (e.g. `isamples_202601_wide`, the Zenodo wide). The builder detects the type at runtime — earlier ad-hoc SQL assumed BLOB and threw `BinderException` on GEOMETRY wides.
+- **Material selection (#265/#271).** `material` = the **first NON-ROOT** concept in `p__has_material_category` (the root `.../material/1.0/material` "Material" can sit at any array position). Samples tagged only at the root get `NULL` material (excluded from the facet). This is **NOT leaf/most-specific** selection — the arrays are not clean SKOS paths. `context`/`object_type` use `[1]`; their root-dropping is deferred.
+- **Determinism.** Every COPY has `ORDER BY`; `dominant_source` ties break on source name (ASC); center lat/lng rounded to 6 dp.
+- **Reproducibility & build identity.** Each run writes `{tag}_manifest.json` (input + per-output sha256, argv, git SHA, DuckDB + extension versions). DuckDB pinned in `scripts/requirements.txt`.
+- **Tested.** `tests/test_frontend_derived.py` (fixtures, CI via `.github/workflows/pipeline-tests.yml`) + `scripts/validate_frontend_derived.py` (algebraic: `facet_summaries == GROUP BY sample_facets_v2`, `facet_cross_filter == conditional GROUP BY`, `facets.pid == map_lite.pid`, pid uniqueness, H3 sums). `make test` / `make all`.
+
+## Documentation / automation gaps (remaining)
+
+- ⚠️ **The deployed `202601` derived files are NOT reproducible** from any available wide. A rebuild yields **528,983** root-material rows (pre-#271); the deployed `sample_facets_v2` has **346,768** — so the live files came from a different/unrecorded Stage-4 process, *and* the data has since rolled (wide is now `202604`). Treat a fresh `build_frontend_derived.py` run as the new source of truth, not as a bit-for-bit reproduction of the deployed files.
+- **Version skew:** the deployed derived files are `202601` while the wide they should derive from is `202604` (the popup reads `202604`). Rebuilding from `202604` resolves it (tracked in the pipeline epic).
+- **No R2 upload automation** — file upload to bucket `isamples-ry` + `current/manifest.json` update are manual `wrangler`/dashboard steps.
+- **No OC mirror script** — `check_oc_pqg_drift.py` detects GCS↔R2 drift but doesn't perform the mirror.
+- **Stage-2 prod invocation** that produced `zenodo_narrow_2025-12-12` / `zenodo_wide_2026-01-09` from the Zenodo export is still unrecorded (dedupe options unknown).
+- **`SERIALIZATIONS.md:80`** claims every file "can be rebuilt by a script" — now true for the Stage-4 files; still aspirational for Stage-2.
+- **Sidecar spec** is in Obsidian only, not version-controlled with the code.
+
+## Key files
+- `export_client/isamples_export_client/duckdb_utilities.py` — export schema (keywords narrowing @ L17)
+- `pqg/pqg/sql_converter.py` — export→PQG engine; `pqg/docs/PQG_SPECIFICATION.md` — format spec
+- `isamplesorg.github.io/scripts/enrich_wide_with_oc_thumbnails.py` — the sidecar-merge precedent
+- `isamplesorg.github.io/scripts/build_vocab_labels.py` — the one scripted derived file
+- `isamplesorg.github.io/scripts/check_oc_pqg_drift.py` — OC drift check
+- `isamplesorg.github.io/workers/data-isamples-org/{src/index.js,wrangler.toml}` — Worker + R2 config
+- `isamplesorg.github.io/SERIALIZATIONS.md` — format/schema reference (DAG companion to this file)
diff --git a/Makefile b/Makefile
@@ -0,0 +1,46 @@
+# Frontend-derived parquet pipeline — reproducible, AI-free.
+#
+#   make test       # fast fixture tests (no network, no big data) — the CI gate
+#   make wide       # download + checksum the canonical wide parquet
+#   make derived    # build the derived files from $(WIDE) into $(OUTDIR)
+#   make validate   # algebraic trust gate over the built files (non-zero exit on failure)
+#   make all        # wide -> derived -> validate
+#
+# Override on the command line, e.g.:
+#   make all WIDE_URL=https://data.isamples.org/isamples_202604_wide.parquet TAG=isamples_202606
+#
+# Requirements: python with `pip install -r scripts/requirements.txt`, plus
+# network access on first run (DuckDB pulls the h3 community extension).
+
+PY      ?= python
+WIDE_URL ?= https://data.isamples.org/isamples_202604_wide.parquet
+OUTDIR  ?= build/derived
+WIDE    ?= $(OUTDIR)/wide.parquet
+TAG     ?= isamples_dev
+BUILD   := scripts/build_frontend_derived.py
+VALIDATE := scripts/validate_frontend_derived.py
+
+.PHONY: help test wide derived validate all clean
+help:
+	@grep -E '^#   make' Makefile | sed 's/^#   /  /'
+
+# Fast, deterministic fixture tests — the gate a human (or CI) runs without any AI.
+test:
+	$(PY) -m pytest tests/test_frontend_derived.py -q
+
+wide: $(WIDE)
+$(WIDE):
+	@mkdir -p $(OUTDIR)
+	curl -fSL -o $(WIDE) "$(WIDE_URL)"
+	@echo "sha256: $$(shasum -a 256 $(WIDE) | cut -d' ' -f1)  $(WIDE)"
+
+derived: $(WIDE)
+	$(PY) $(BUILD) --wide $(WIDE) --outdir $(OUTDIR) --tag $(TAG) --skip wide_h3
+
+validate:
+	$(PY) $(VALIDATE) --dir $(OUTDIR) --tag $(TAG)
+
+all: wide derived validate
+
+clean:
+	rm -rf $(OUTDIR)
diff --git a/SERIALIZATIONS.md b/SERIALIZATIONS.md
@@ -77,9 +77,12 @@ vocab_labels.parquet           (58 KB, 537 SKOS concepts)
   └─► consumed by Search Explorer to render facet URIs as prefLabels
 ```
 
-Arrows indicate derivation, not containment. Every file in the left
-column can be rebuilt from its parent by a script in
-`isamples-python/` or `isamplesorg.github.io/scripts/`.
+Arrows indicate derivation, not containment. The Stage-4 frontend-derived
+files are rebuilt by `isamplesorg.github.io/scripts/build_frontend_derived.py`
+(+ `build_vocab_labels.py`); the Stage-2 narrow/wide files are rebuilt by
+`pqg/`. Note: the **currently deployed** `isamples_202601_*` files predate that
+builder — a fresh build is NOT bit-for-bit identical to them (see
+`DATA_PROVENANCE.md`, "deployed 202601 not reproducible").
 
 ## 3. Catalog
 
@@ -226,7 +229,7 @@ for the alias when you want "latest."
 ### 4.6 `isamples_202601_h3_summary_res{4,6,8}.parquet`
 
 - **Role**: Zoom-adaptive aggregates that back the Cesium progressive globe and the Python Explorer's "H3 tier" rendering mode.
-- **Headline schema** (7 cols, identical across resolutions): `h3_cell` (BIGINT), `sample_count` (INT), `center_lat`, `center_lng` (DOUBLE), `dominant_source` (VARCHAR), `source_count` (INT), `resolution` (INT).
+- **Headline schema** (7 cols, identical across resolutions): `h3_cell` (**UBIGINT** — H3 cells are unsigned 64-bit; a signed BIGINT would go negative for high-bit cells), `sample_count` (INT), `center_lat`, `center_lng` (DOUBLE, rounded 6 dp), `dominant_source` (VARCHAR; ties broken by source name ASC for determinism), `source_count` (INT), `resolution` (INT).
 - **Query pattern**: fetch the right resolution for the current zoom; no join needed.
 - **DuckDB**:
   ```sql
@@ -247,8 +250,10 @@ for the alias when you want "latest."
 
 ### 4.8 `isamples_202601_sample_facets_v2.parquet`
 
-- **Role**: Cross-dimension facet filtering — one row per sample, each facet column holds a single controlled-vocabulary URI (the leaf concept the sample is tagged with at that dimension).
-- **Headline schema** (8 cols, all VARCHAR): `pid, source, material, context, object_type, label, description, place_name`. `material`/`context`/`object_type` are scalar URI strings, NOT arrays — the file's grain is one row per sample, so a sample tagged with multiple material URIs is represented by a single chosen URI (currently the first/leaf). For multi-material accuracy, JOIN back to `wide.p__has_material_category`.
+> ⚠️ **Deployed-file caveat:** the live `isamples_202601_sample_facets_v2.parquet` still contains **346,768** bare-root "Material" rows — it predates the #271 selection rule below. The rule describes the **builder contract** for the next rebuild (verified to drop the root → 0), not the file currently served.
+
+- **Role**: Cross-dimension facet filtering — one row per sample, each facet column holds a single controlled-vocabulary URI.
+- **Headline schema** (8 cols, all VARCHAR): `pid, source, material, context, object_type, label, description, place_name`. `material`/`context`/`object_type` are scalar URI strings, NOT arrays — one row per sample, so a sample tagged with multiple URIs is represented by a single chosen URI. **Selection rule:** `material` = the **first NON-ROOT** concept in the array (the broad root `.../material/1.0/material` is dropped — #265/#271); root-only samples → NULL material. This is **NOT** necessarily the leaf/most-specific concept (the arrays are not clean SKOS paths). `context`/`object_type` = the first array element (`[1]`). `place_name` is a VARCHAR cast of the wide's `VARCHAR[]` (note: `samples_map_lite` keeps `place_name` as `VARCHAR[]`). For multi-value accuracy, JOIN back to `wide.p__has_*_category`.
 - **Query pattern**: `WHERE material = '<uri>'` for exact match; `WHERE material ILIKE '%rock%'` to substring-match URI fragments.
 - **DuckDB**:
   ```sql
@@ -272,7 +277,7 @@ for the alias when you want "latest."
 ### 4.10 `isamples_202601_facet_cross_filter.parquet`
 
 - **Role**: Cross-facet counts for the single-active-filter case (QUERY_SPEC §3.3 tier 2a). Avoids recomputing when one facet dimension is active.
-- **Headline schema** (7 cols, 526 rows): `filter_source, filter_material, filter_context, filter_object_type, facet_type, facet_value, count`. Exactly one `filter_*` column is non-NULL per row.
+- **Headline schema** (7 cols): `filter_source, filter_material, filter_context, filter_object_type, facet_type, facet_value, count`. Two row kinds: **baseline** rows have **all** `filter_*` NULL (these equal `facet_summaries`); **single-dimension** rows have **exactly one** `filter_*` non-NULL. Single-dimension rows include self-dimension counts (`facet_type == filter dim`), which the explorer ignores. (Both kinds are emitted by `build_frontend_derived.py` and asserted by `validate_frontend_derived.py`.)
 - **Query pattern**: lookup by the active filter to get counts for the remaining dimensions.
 - **DuckDB**:
   ```sql