Skip to content

build: stop SKOS root 'Material' leaking into the material facet (#265)#271

Open
rdhyee wants to merge 2 commits into
isamplesorg:mainfrom
rdhyee:fix/material-concept-selection-265
Open

build: stop SKOS root 'Material' leaking into the material facet (#265)#271
rdhyee wants to merge 2 commits into
isamplesorg:mainfrom
rdhyee:fix/material-concept-selection-265

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Jun 5, 2026

Addresses #265 (the bogus 'material' material-type).

Cause: base_samples_sql picked p__has_material_category[1], but source arrays (esp. SESAR) carry the full SKOS ancestry and the broad root (.../material/1.0/material, label 'Material') can sit at position 1.

Fix: resolve the whole concept array, drop the root, take the FIRST remaining concept. Conservative — only samples whose [1] was the root change. Verified against the 2026-01 wide file:

  • 528,983 root-[1] samples change: ~318k get a real concept, ~210k are root-only → NULL (excluded from the facet)
  • 0 samples retain the root → the bogus 'Material' facet disappears
  • correctly-classified samples untouched, incl. Eric's ceramic ark:/28722/k2p55x96j (stays anthropogenicmetal, not flipped to a deeper-but-wrong array entry like 'rock')

Limits: this does not pick the most-specific concept — the arrays aren't clean SKOS paths (some OC arrays end in an unrelated 'rock'), so true leaf selection needs the SKOS hierarchy (follow-up). context/object_type roots left as-is for now.

⚠️ Stacks on #264 (which introduces build_frontend_derived.py) — the diff will collapse to one file once #264 merges. Also note this is a build-time change: it takes effect only when the derived parquet is rebuilt + redeployed.


— 🤖 rbotyee (RY's bot). RY skimmed; the bot did the work. Ping @rdhyee.

rdhyee and others added 2 commits June 3, 2026 07:25
… derived parquet

(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar
merge → frontend derived → R2/Worker), per-stage script/command + the key
constraint (the iSamples export is frozen — Central API offline since Aug 2025;
new per-source data must come via the pid sidecar merge, not re-export). Folds
the sidecar pattern (previously only in the Obsidian vault) into the repo.

(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had
no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite,
wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one
`wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts
vs published.

Validated vs the published isamples_202601 files (built from 202604 wide):
EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and
h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and
facet_cross_filter (+86) are schema-correct, with small deltas from the
202604-vs-202601 version gap + the original cross-filter pruning self-pairs
(this build is an exhaustive superset) — can be reconciled if exact parity is needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…samplesorg#265)

Andrea (isamplesorg#265) saw a bogus 'material' entry in the material-type facet.
Cause: base_samples_sql picked p__has_material_category[1], but source
arrays (esp. SESAR) carry the full SKOS ancestry and the broad root
('.../material/1.0/material', label 'Material') can sit at position 1.

Fix: resolve the whole concept array, drop the root, and take the FIRST
remaining concept. Conservative by design — only samples whose [1] was the
root change (verified 528,983 of them: ~318k get a real concept, ~210k are
root-only -> NULL/excluded). Correctly-classified samples are untouched,
incl. the ceramic ark:/28722/k2p55x96j (stays anthropogenicmetal, not a
deeper-but-wrong array entry). True leaf selection needs SKOS hierarchy —
tracked as a isamplesorg#265 follow-up. context/object_type roots left as-is for now.

Stacks on isamplesorg#264 (which introduces this build script).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rdhyee added a commit to rdhyee/isamplesorg.github.io that referenced this pull request Jun 5, 2026
…mplesorg#264, isamplesorg#271)

DATA_PROVENANCE.md answers isamplesorg#268. build_frontend_derived.py includes the
isamplesorg#271 first-non-root material selection (data-track; inert until rebuild).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Jun 6, 2026

⚠️ Do not merge as-is — this build change has a performance regression found by actually running it. Mixing the new MAP cross-join (material) with the original correlated subqueries (context/object_type) blows up the planner: a full rebuild from the 202604 wide sat in "building base sample tables" for >16 min before I killed it.

Also: the script assumes geometry is WKB BLOB (ST_GeomFromWKB) — it throws BinderException on GEOMETRY-typed wides (202601 / Zenodo).

The material-selection logic is correct (validated: root → 0, k2p55x96j preserved), but the builder needs hardening before it can produce files. Tracked in the new pipeline epic (#273), workstream 2. Static review (incl. Codex) did not catch either — only execution did.

— 🤖 rbotyee (RY directing)

@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Jun 6, 2026

Superseded by #274 — the material first-non-root fix is folded into the hardened, tested builder there (the standalone change here had a perf regression + geometry contract bug that only surfaced on execution). Recommend closing this in favor of #274. — 🤖 rbotyee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant