build: stop SKOS root 'Material' leaking into the material facet (#265)#271
build: stop SKOS root 'Material' leaking into the material facet (#265)#271rdhyee wants to merge 2 commits into
Conversation
… derived parquet
(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar
merge → frontend derived → R2/Worker), per-stage script/command + the key
constraint (the iSamples export is frozen — Central API offline since Aug 2025;
new per-source data must come via the pid sidecar merge, not re-export). Folds
the sidecar pattern (previously only in the Obsidian vault) into the repo.
(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had
no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite,
wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one
`wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts
vs published.
Validated vs the published isamples_202601 files (built from 202604 wide):
EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and
h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and
facet_cross_filter (+86) are schema-correct, with small deltas from the
202604-vs-202601 version gap + the original cross-filter pruning self-pairs
(this build is an exhaustive superset) — can be reconciled if exact parity is needed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…samplesorg#265) Andrea (isamplesorg#265) saw a bogus 'material' entry in the material-type facet. Cause: base_samples_sql picked p__has_material_category[1], but source arrays (esp. SESAR) carry the full SKOS ancestry and the broad root ('.../material/1.0/material', label 'Material') can sit at position 1. Fix: resolve the whole concept array, drop the root, and take the FIRST remaining concept. Conservative by design — only samples whose [1] was the root change (verified 528,983 of them: ~318k get a real concept, ~210k are root-only -> NULL/excluded). Correctly-classified samples are untouched, incl. the ceramic ark:/28722/k2p55x96j (stays anthropogenicmetal, not a deeper-but-wrong array entry). True leaf selection needs SKOS hierarchy — tracked as a isamplesorg#265 follow-up. context/object_type roots left as-is for now. Stacks on isamplesorg#264 (which introduces this build script). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mplesorg#264, isamplesorg#271) DATA_PROVENANCE.md answers isamplesorg#268. build_frontend_derived.py includes the isamplesorg#271 first-non-root material selection (data-track; inert until rebuild). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Also: the script assumes The material-selection logic is correct (validated: root → 0, — 🤖 rbotyee (RY directing) |
Addresses #265 (the bogus 'material' material-type).
Cause:
base_samples_sqlpickedp__has_material_category[1], but source arrays (esp. SESAR) carry the full SKOS ancestry and the broad root (.../material/1.0/material, label 'Material') can sit at position 1.Fix: resolve the whole concept array, drop the root, take the FIRST remaining concept. Conservative — only samples whose
[1]was the root change. Verified against the 2026-01 wide file:[1]samples change: ~318k get a real concept, ~210k are root-only → NULL (excluded from the facet)ark:/28722/k2p55x96j(staysanthropogenicmetal, not flipped to a deeper-but-wrong array entry like 'rock')Limits: this does not pick the most-specific concept — the arrays aren't clean SKOS paths (some OC arrays end in an unrelated 'rock'), so true leaf selection needs the SKOS hierarchy (follow-up).
context/object_typeroots left as-is for now.build_frontend_derived.py) — the diff will collapse to one file once #264 merges. Also note this is a build-time change: it takes effect only when the derived parquet is rebuilt + redeployed.— 🤖 rbotyee (RY's bot). RY skimmed; the bot did the work. Ping @rdhyee.