Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/pipeline-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Derived-parquet pipeline tests

# Fast, AI-free gate for the data pipeline. Runs the fixture-based unit tests
# (no network for data, no large parquet) whenever the pipeline code changes.
on:
pull_request:
paths:
- "scripts/build_frontend_derived.py"
- "scripts/validate_frontend_derived.py"
- "tests/test_frontend_derived.py"
- "scripts/requirements.txt"
- "Makefile"
- ".github/workflows/pipeline-tests.yml"
push:
branches: [main]
paths:
- "scripts/build_frontend_derived.py"
- "scripts/validate_frontend_derived.py"
- "tests/test_frontend_derived.py"
workflow_dispatch:

jobs:
fixture-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install deps
run: pip install -r scripts/requirements.txt
- name: Run pipeline fixture tests
# builds tiny synthetic wides (WKB BLOB + DuckDB GEOMETRY), runs the real
# builder + algebraic validator, asserts the contract. Exits non-zero on
# any failure -> PR is blocked.
run: python -m pytest tests/test_frontend_derived.py -q
83 changes: 83 additions & 0 deletions DATA_PROVENANCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# iSamples Explorer — Data Provenance

How every parquet file the explorer uses is generated, from root to publish.
*Reviewed 2026-06-02 (CC, via codebase audit). Complements `SERIALIZATIONS.md` (format/schema reference); this file is the end-to-end build chain + the automation gaps.*

> **Load-bearing constraint:** the **root export cannot be regenerated.** It was produced from the iSamples Central Solr API (`central.isample.xyz`), **offline since Aug 2025**. The Zenodo-archived export is a **frozen root**. Any *new* data (e.g. concept URIs, thumbnails) therefore must come from a **per-source supplementary file merged into the base by `pid`** — the "sidecar" pattern (see Stage 3) — not from re-exporting.

## Pipeline DAG

```
Source collections (SESAR · OpenContext · GEOME · Smithsonian)
│ iSamples Central Solr API ── OFFLINE since Aug 2025 (cannot re-run) ──┐
▼ │
STAGE 0/1 export_client → JSONL → GeoParquet │ frozen
→ isamples_export_*_geo.parquet (Export format; ~300MB, 6.7M; Zenodo doi:10.5281/zenodo.15278211)
STAGE 2 pqg/pqg/sql_converter.py (export → base PQG; 7-stage DuckDB SQL)
→ narrow (…_narrow.parquet, ~844MB, 106M rows) and wide (…_wide.parquet, ~282MB, 20M rows)
STAGE 3 sidecar/enrichment merge (LEFT JOIN by pid) ← Eric's independently-maintained OC PQG (GCS)
scripts/enrich_wide_with_oc_thumbnails.py → isamples_202604_wide.parquet (+47K thumbnails)
STAGE 4 wide → frontend derived files (NOW SCRIPTED: scripts/build_frontend_derived.py)
→ wide_h3 · h3_summary_res4/6/8 · samples_map_lite · sample_facets_v2 · facet_summaries · facet_cross_filter
→ vocab_labels (scripts/build_vocab_labels.py — built separately from SKOS TTLs)
→ {tag}_manifest.json (build identity: input+output sha256, argv, git SHA, DuckDB/extension versions)
STAGE 5 publish to R2 (bucket isamples-ry) + Cloudflare Worker (data.isamples.org, /current/ aliases)
DuckDB-WASM in the browser (explorer.qmd; parquet URLs ~L767-781)
```

## Stages (script / command per step)

| Stage | Input → Output | How (file:line) | Automated? |
|---|---|---|---|
| **0/1 Export** | Solr API → `isamples_export_*_geo.parquet` | `export_client` `ExportClient.perform_full_download()` (`export_client.py:423-469`) → `write_geoparquet_from_json_lines()`; schema `SOURCE_COLUMNS` (`duckdb_utilities.py:9-42`, incl. `keywords: STRUCT(keyword VARCHAR)[]` — **text only, no URI**, L17) | ❌ API offline; **frozen** |
| **2 Base PQG** | export → `*_narrow.parquet` / `*_wide.parquet` | `pqg/pqg/sql_converter.py` `convert_isamples_sql(input, output, wide=…)` (CLI `python pqg/sql_converter.py in.parquet out.parquet [--wide]`); 7 stages, decomposes nested structs → nodes+edges; site dedupe by rounded lat/lon+label | ✅ scripted (exact prod invocation not recorded — gap) |
| **3 Sidecar merge** | base wide + Eric's OC PQG → `isamples_202604_wide.parquet` | `scripts/enrich_wide_with_oc_thumbnails.py` — `LEFT JOIN` OC `(pid, thumbnail_url)` into wide (`COALESCE`). **This is the precedent for merging ANY per-source supplement (incl. concept URIs) by pid.** Drift check: `scripts/check_oc_pqg_drift.py` (detects only; no mirror) | ⚠️ merge scripted; OC mirror + R2 upload manual |
| **4 Frontend derived** | wide → 7 explorer files | The 6 map/facet files (`wide_h3`, `h3_summary_res4/6/8`, `samples_map_lite`, `sample_facets_v2`, `facet_summaries`, `facet_cross_filter`) ← **`scripts/build_frontend_derived.py`** (deterministic; geometry-agnostic; emits a manifest). `vocab_labels.parquet` ← `scripts/build_vocab_labels.py` (SKOS TTLs). Gated by `scripts/validate_frontend_derived.py` (algebraic + `--wide` semantic re-derivation) + `tests/test_frontend_derived.py` (fixtures, CI). | ✅ scripted; facet/map files semantic-tested; wide_h3 column-smoke-tested |
| **5 Publish** | files → R2 + Worker | Worker `workers/data-isamples-org/src/index.js` (`wrangler deploy`); immutable cache for `isamples_\d{6}_*.parquet`; `/current/<flavor>.parquet` → 302 via `current/manifest.json`. Bucket `isamples-ry` | ⚠️ Worker scripted; **file upload + manifest update are manual** |

## The sidecar/enrichment pattern (how new data gets in)

Because the export is frozen, new per-source data is added by **merging a supplementary parquet keyed by `pid` into the base wide** — exactly what the thumbnail enrichment does:

```sql
-- scripts/enrich_wide_with_oc_thumbnails.py (core)
CREATE TEMP TABLE oc_thumbs AS
SELECT DISTINCT pid, thumbnail_url FROM read_parquet('<eric_oc_pqg>') WHERE thumbnail_url IS NOT NULL;
COPY (SELECT p.* REPLACE (COALESCE(oc.thumbnail_url, p.thumbnail_url) AS thumbnail_url)
FROM read_parquet('<base_wide>') p LEFT JOIN oc_thumbs oc ON p.pid = oc.pid)
TO '<out>' (FORMAT PARQUET, COMPRESSION ZSTD);
```

Eric Kansa maintains OpenContext PQG **independently** on GCS (`storage.googleapis.com/opencontext-parquet/oc_isamples_pqg.parquet`), so it can carry data the frozen iSamples export lacks. This is the channel for **#263** (external concept URIs): Eric's OC PQG carries them → merged into wide by pid → flows to the derived files. *(Sidecar design endorsed 2026-04-17; the spec `project_isamples_sidecar_pattern.md` lives in the Obsidian vault, not a repo — gap.)*

## Stage 4 builder contract (`scripts/build_frontend_derived.py`)

- **Geometry-agnostic input.** The `geometry` column may be **WKB BLOB** (e.g. `isamples_202604_wide`) or DuckDB **GEOMETRY** (e.g. `isamples_202601_wide`, the Zenodo wide). The builder detects the type at runtime — earlier ad-hoc SQL assumed BLOB and threw `BinderException` on GEOMETRY wides.
- **Material selection (#265/#271).** `material` = the **first NON-ROOT** concept in `p__has_material_category` (the root `.../material/1.0/material` "Material" can sit at any array position). Samples tagged only at the root get `NULL` material (excluded from the facet). This is **NOT leaf/most-specific** selection — the arrays are not clean SKOS paths. `context`/`object_type` use `[1]`; their root-dropping is deferred.
- **Determinism.** Every COPY has `ORDER BY`; `dominant_source` ties break on source name (ASC); center lat/lng rounded to 6 dp.
- **Reproducibility & build identity.** Each run writes `{tag}_manifest.json` (input + per-output sha256, argv, git SHA, DuckDB + extension versions). DuckDB pinned in `scripts/requirements.txt`.
- **Tested.** `tests/test_frontend_derived.py` (fixtures, CI via `.github/workflows/pipeline-tests.yml`) + `scripts/validate_frontend_derived.py` (algebraic: `facet_summaries == GROUP BY sample_facets_v2`, `facet_cross_filter == conditional GROUP BY`, `facets.pid == map_lite.pid`, pid uniqueness, H3 sums). `make test` / `make all`.

## Documentation / automation gaps (remaining)

- ⚠️ **The deployed `202601` derived files are NOT reproducible** from any available wide. A rebuild yields **528,983** root-material rows (pre-#271); the deployed `sample_facets_v2` has **346,768** — so the live files came from a different/unrecorded Stage-4 process, *and* the data has since rolled (wide is now `202604`). Treat a fresh `build_frontend_derived.py` run as the new source of truth, not as a bit-for-bit reproduction of the deployed files.
- **Version skew:** the deployed derived files are `202601` while the wide they should derive from is `202604` (the popup reads `202604`). Rebuilding from `202604` resolves it (tracked in the pipeline epic).
- **No R2 upload automation** — file upload to bucket `isamples-ry` + `current/manifest.json` update are manual `wrangler`/dashboard steps.
- **No OC mirror script** — `check_oc_pqg_drift.py` detects GCS↔R2 drift but doesn't perform the mirror.
- **Stage-2 prod invocation** that produced `zenodo_narrow_2025-12-12` / `zenodo_wide_2026-01-09` from the Zenodo export is still unrecorded (dedupe options unknown).
- **`SERIALIZATIONS.md:80`** claims every file "can be rebuilt by a script" — now true for the Stage-4 files; still aspirational for Stage-2.
- **Sidecar spec** is in Obsidian only, not version-controlled with the code.

## Key files
- `export_client/isamples_export_client/duckdb_utilities.py` — export schema (keywords narrowing @ L17)
- `pqg/pqg/sql_converter.py` — export→PQG engine; `pqg/docs/PQG_SPECIFICATION.md` — format spec
- `isamplesorg.github.io/scripts/enrich_wide_with_oc_thumbnails.py` — the sidecar-merge precedent
- `isamplesorg.github.io/scripts/build_vocab_labels.py` — the one scripted derived file
- `isamplesorg.github.io/scripts/check_oc_pqg_drift.py` — OC drift check
- `isamplesorg.github.io/workers/data-isamples-org/{src/index.js,wrangler.toml}` — Worker + R2 config
- `isamplesorg.github.io/SERIALIZATIONS.md` — format/schema reference (DAG companion to this file)
46 changes: 46 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Frontend-derived parquet pipeline — reproducible, AI-free.
#
# make test # fast fixture tests (no network, no big data) — the CI gate
# make wide # download + checksum the canonical wide parquet
# make derived # build the derived files from $(WIDE) into $(OUTDIR)
# make validate # algebraic trust gate over the built files (non-zero exit on failure)
# make all # wide -> derived -> validate
#
# Override on the command line, e.g.:
# make all WIDE_URL=https://data.isamples.org/isamples_202604_wide.parquet TAG=isamples_202606
#
# Requirements: python with `pip install -r scripts/requirements.txt`, plus
# network access on first run (DuckDB pulls the h3 community extension).

PY ?= python
WIDE_URL ?= https://data.isamples.org/isamples_202604_wide.parquet
OUTDIR ?= build/derived
WIDE ?= $(OUTDIR)/wide.parquet
TAG ?= isamples_dev
BUILD := scripts/build_frontend_derived.py
VALIDATE := scripts/validate_frontend_derived.py

.PHONY: help test wide derived validate all clean
help:
@grep -E '^# make' Makefile | sed 's/^# / /'

# Fast, deterministic fixture tests — the gate a human (or CI) runs without any AI.
test:
$(PY) -m pytest tests/test_frontend_derived.py -q

wide: $(WIDE)
$(WIDE):
@mkdir -p $(OUTDIR)
curl -fSL -o $(WIDE) "$(WIDE_URL)"
@echo "sha256: $$(shasum -a 256 $(WIDE) | cut -d' ' -f1) $(WIDE)"

derived: $(WIDE)
$(PY) $(BUILD) --wide $(WIDE) --outdir $(OUTDIR) --tag $(TAG) --skip wide_h3

validate:
$(PY) $(VALIDATE) --dir $(OUTDIR) --tag $(TAG)

all: wide derived validate

clean:
rm -rf $(OUTDIR)
19 changes: 12 additions & 7 deletions SERIALIZATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,12 @@ vocab_labels.parquet (58 KB, 537 SKOS concepts)
└─► consumed by Search Explorer to render facet URIs as prefLabels
```

Arrows indicate derivation, not containment. Every file in the left
column can be rebuilt from its parent by a script in
`isamples-python/` or `isamplesorg.github.io/scripts/`.
Arrows indicate derivation, not containment. The Stage-4 frontend-derived
files are rebuilt by `isamplesorg.github.io/scripts/build_frontend_derived.py`
(+ `build_vocab_labels.py`); the Stage-2 narrow/wide files are rebuilt by
`pqg/`. Note: the **currently deployed** `isamples_202601_*` files predate that
builder — a fresh build is NOT bit-for-bit identical to them (see
`DATA_PROVENANCE.md`, "deployed 202601 not reproducible").

## 3. Catalog

Expand Down Expand Up @@ -226,7 +229,7 @@ for the alias when you want "latest."
### 4.6 `isamples_202601_h3_summary_res{4,6,8}.parquet`

- **Role**: Zoom-adaptive aggregates that back the Cesium progressive globe and the Python Explorer's "H3 tier" rendering mode.
- **Headline schema** (7 cols, identical across resolutions): `h3_cell` (BIGINT), `sample_count` (INT), `center_lat`, `center_lng` (DOUBLE), `dominant_source` (VARCHAR), `source_count` (INT), `resolution` (INT).
- **Headline schema** (7 cols, identical across resolutions): `h3_cell` (**UBIGINT** — H3 cells are unsigned 64-bit; a signed BIGINT would go negative for high-bit cells), `sample_count` (INT), `center_lat`, `center_lng` (DOUBLE, rounded 6 dp), `dominant_source` (VARCHAR; ties broken by source name ASC for determinism), `source_count` (INT), `resolution` (INT).
- **Query pattern**: fetch the right resolution for the current zoom; no join needed.
- **DuckDB**:
```sql
Expand All @@ -247,8 +250,10 @@ for the alias when you want "latest."

### 4.8 `isamples_202601_sample_facets_v2.parquet`

- **Role**: Cross-dimension facet filtering — one row per sample, each facet column holds a single controlled-vocabulary URI (the leaf concept the sample is tagged with at that dimension).
- **Headline schema** (8 cols, all VARCHAR): `pid, source, material, context, object_type, label, description, place_name`. `material`/`context`/`object_type` are scalar URI strings, NOT arrays — the file's grain is one row per sample, so a sample tagged with multiple material URIs is represented by a single chosen URI (currently the first/leaf). For multi-material accuracy, JOIN back to `wide.p__has_material_category`.
> ⚠️ **Deployed-file caveat:** the live `isamples_202601_sample_facets_v2.parquet` still contains **346,768** bare-root "Material" rows — it predates the #271 selection rule below. The rule describes the **builder contract** for the next rebuild (verified to drop the root → 0), not the file currently served.

- **Role**: Cross-dimension facet filtering — one row per sample, each facet column holds a single controlled-vocabulary URI.
- **Headline schema** (8 cols, all VARCHAR): `pid, source, material, context, object_type, label, description, place_name`. `material`/`context`/`object_type` are scalar URI strings, NOT arrays — one row per sample, so a sample tagged with multiple URIs is represented by a single chosen URI. **Selection rule:** `material` = the **first NON-ROOT** concept in the array (the broad root `.../material/1.0/material` is dropped — #265/#271); root-only samples → NULL material. This is **NOT** necessarily the leaf/most-specific concept (the arrays are not clean SKOS paths). `context`/`object_type` = the first array element (`[1]`). `place_name` is a VARCHAR cast of the wide's `VARCHAR[]` (note: `samples_map_lite` keeps `place_name` as `VARCHAR[]`). For multi-value accuracy, JOIN back to `wide.p__has_*_category`.
- **Query pattern**: `WHERE material = '<uri>'` for exact match; `WHERE material ILIKE '%rock%'` to substring-match URI fragments.
- **DuckDB**:
```sql
Expand All @@ -272,7 +277,7 @@ for the alias when you want "latest."
### 4.10 `isamples_202601_facet_cross_filter.parquet`

- **Role**: Cross-facet counts for the single-active-filter case (QUERY_SPEC §3.3 tier 2a). Avoids recomputing when one facet dimension is active.
- **Headline schema** (7 cols, 526 rows): `filter_source, filter_material, filter_context, filter_object_type, facet_type, facet_value, count`. Exactly one `filter_*` column is non-NULL per row.
- **Headline schema** (7 cols): `filter_source, filter_material, filter_context, filter_object_type, facet_type, facet_value, count`. Two row kinds: **baseline** rows have **all** `filter_*` NULL (these equal `facet_summaries`); **single-dimension** rows have **exactly one** `filter_*` non-NULL. Single-dimension rows include self-dimension counts (`facet_type == filter dim`), which the explorer ignores. (Both kinds are emitted by `build_frontend_derived.py` and asserted by `validate_frontend_derived.py`.)
- **Query pattern**: lookup by the active filter to get counts for the remaining dimensions.
- **DuckDB**:
```sql
Expand Down
Loading
Loading