You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Epic / tracking issue. Goal: a reproducible data pipeline for the explorer's derived parquet, plus tests a human can run and trust without any AI in the loop. Scope hardened by an adversarial Codex audit (findings below). AI sign-off is explicitly not the gate — the executable test suite + human review are.
Why now
Building/verifying #271 (drop the SKOS root "Material" leak) surfaced defects that document review and AI review did NOT catch — only execution did. The Stage-4 "frontend derived" pipeline (DATA_PROVENANCE.md) is ad-hoc, unpinned, and untested.
Evidence (executable, AI-free)
scripts/validate_frontend_derived.py run against live production today → exit code 1:
facets: no root 'Material' value FAIL 346,768 rows still = root (want 0)
summaries: no root material row FAIL 1
cross_filter: no root material FAIL 23
ark:/28722/k2p55x96j preserved PASS
facets: non-empty PASS 5,980,282 rows
Defects to fix/track
Reproducibility & provenance
Deployed 202601 facets are not reproducible: rebuild from any available wide gives 528,983 root-material rows vs deployed 346,768; exact prod invocation unrecorded.
[Codex Switching to quarto for rendering #6] Concept-selection contract contradiction: script = "first-remaining" vs SERIALIZATIONS.md = "leaf concept".
Validator is too weak (scripts/validate_frontend_derived.py)
[Codex Added background to oc data #17] Can pass very wrong data (no-root + 1 sentinel + >1M rows + >50% populated passes even if materials collapsed/counts wrong).
Acceptance — tests assert the derived-file algebra, not spot checks
A wrong rebuild must FAIL. Required (all human-runnable, make test / pytest, non-zero exit on failure):
facet_summaries == GROUP BY sample_facets_v2; facet_cross_filter == conditional GROUP BY sample_facets_v2; sample_facets_v2.pid == samples_map_lite.pid
PID uniqueness on every pid-keyed file
Exact schema tests (types, column order, nullability, value ranges)
H3: int/hex equivalence, resolution correctness, summary counts sum to geo sample count, deterministic tie policy
Epic / tracking issue. Goal: a reproducible data pipeline for the explorer's derived parquet, plus tests a human can run and trust without any AI in the loop. Scope hardened by an adversarial Codex audit (findings below). AI sign-off is explicitly not the gate — the executable test suite + human review are.
Why now
Building/verifying #271 (drop the SKOS root "Material" leak) surfaced defects that document review and AI review did NOT catch — only execution did. The Stage-4 "frontend derived" pipeline (
DATA_PROVENANCE.md) is ad-hoc, unpinned, and untested.Evidence (executable, AI-free)
scripts/validate_frontend_derived.pyrun against live production today → exit code 1:Defects to fix/track
Reproducibility & provenance
202601facets are not reproducible: rebuild from any available wide gives 528,983 root-material rows vs deployed 346,768; exact prod invocation unrecorded.202601vs canonical wide202604; stale default--tag.Builder correctness (
scripts/build_frontend_derived.py)ST_GeomFromWKB);202601/Zenodo wides areGEOMETRY→BinderException. Make geometry-agnostic.MAPcross-join mixed with correlated subqueries → planner blowup (base build >16 min, killed). [Codex Github account setup #2]context/object_typestill use per-row correlated lookups. Resolve all 3 concept columns by one consistent, decorrelated method.--only/--skipnot isolated — always buildssamp/samp_geo+ H3/spatial before honoring them.--only/--skipnames silently succeed (typo → emits nothing, exit 0).IdentifiedConceptrow-ids → NULL, no integrity threshold).ORDER BYon COPY;MODE(source)nondeterministic on ties; floatAVGparallel-variance.facet_cross_filteremits self-dimension rows the UI ignores (counts should exclude the active dimension).Schema / contract drift
place_nameunstable:VARCHAR[]→ cast to VARCHAR in facets but stays array insamples_map_lite; docs say VARCHAR.facet_cross_filterbaseline rows have allfilter_*NULL, contradictingSERIALIZATIONS.md("exactly one non-null").COUNT(*)/UBIGINTvs docs'INT/BIGINT.SERIALIZATIONS.md= "leaf concept".Validator is too weak (
scripts/validate_frontend_derived.py)facet_summaries/facet_cross_filterfromsample_facets_v2→ drift uncaught.fetchone()can mask a wrong duplicate).samples_map_lite, H3 summaries, vocab_labels, manifest,currentaliases.--validate-againstis print-only, not a gate (skips missing files, compares only column names, never exits non-zero).Docs
DATA_PROVENANCE.mdstale (says Stage 4 ad-hoc / "no build script" while shipping it).SERIALIZATIONS.md.Acceptance — tests assert the derived-file algebra, not spot checks
A wrong rebuild must FAIL. Required (all human-runnable,
make test/pytest, non-zero exit on failure):facet_summaries == GROUP BY sample_facets_v2;facet_cross_filter == conditional GROUP BY sample_facets_v2;sample_facets_v2.pid == samples_map_lite.pidplace_namefixtures (null/empty/single/multi/quotes/serialization)--onlyfails; missing output fails;--validate-againstexits non-zero on mismatch)202601/202604artifacts unless documented;current/manifest.jsoncoherent; remote HEAD/checksum checksWorkstreams (Codex-recommended split — file as sub-issues if/when picked up)
DATA_PROVENANCE.md,SERIALIZATIONS.md)Related
#271 (material fix — carries the perf regression above) · #272 (OC sidecar) · #268/#264 (provenance docs) · #265/#260 (reports that surfaced this) · #131 #135 #138
Scope hardened by an adversarial Codex audit (20 defects + algebra-not-spot-checks). 🤖 rbotyee; RY directing. Process: Codex attacks, executable tests gate, human approves.