fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2
Open
poche123 wants to merge 3 commits into
Open
fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2poche123 wants to merge 3 commits into
poche123 wants to merge 3 commits into
Conversation
added 3 commits
June 2, 2026 10:08
…ion, pipeline robustness Surface column values into the grounding/generation prompts and harden the query + ingestion pipelines so the NL→SQL agents stop guessing literals and the control flow stops reporting wrong-but-empty results as confident answers. Grounding / value anchoring: - profile: store the full distinct value set for low-cardinality columns (was top-5) - retrieval: value_fingerprint() on hits; render a "Column value catalogue" into the grounding and generation prompts; instructions to copy WHERE/CASE literals verbatim, handle tall/EAV layouts and scaled-duplicate measures - structurally detect self-referencing hierarchy columns (value containment, with a row-wise discriminator) and teach "a person's team/reports/structure" -> filter the hierarchy column Control flow / governance: - treat 0-row / degenerate aggregate results as suspicious -> clarification + confidence cap - gate auto-learning on row_count > 0 (stop poisoning the example store) - prefer a candidate that passes the firewall and returns non-empty rows over candidates[0] - exclude CTE names from AST table_refs (firewall no longer rejects valid CTE SQL) - snapshot-scope retrieval; honor and persist conversation snapshot pins - unify /stream, :explain, :validate on the rendered prompts + firewall/scope guards - thread a re-firewalled extra_filter through the semantic-metric bind() Ingestion robustness / determinism: - header-quality scoring (skip numeric spacer rows) + stable leading-blank handling - temperature=0 for the naming/describe agents (cuts re-ingest schema churn ~94%) - rewrite Parquet column names on the no-key fallback (fixes profile/sample Binder errors) - index column values into embeddings + content_tsv - warn when the cross-encoder reranker degrades to a no-op
- builder.py is a lock-step (canon-pinned) module; revert it and pass the
determinism setting via the existing extra_settings={"temperature": 0.0}
instead of a new parameter, so the lockstep gate stays green
- move the table-kinds lookup out of the query controller into TableResolver
(service layer) so the "no raw SQL in controllers" gate passes
- ruff: inline a negated return (SIM103), drop a forward-ref quote (UP037),
and apply ruff format
- tests: extend the fake TableResolvers with pins / current_snapshots /
table_kinds_by_name; seed current_snapshot_id in the retrieval integration
fixture (retrieval is now scoped to current_snapshot_id, as publish sets it)
Financial exports (Orbis/BvD) place one date-header row above a stack of sub-sections. Section-splitting orphaned that header above each sub-section's title, dropping the year columns and making the detailed financial tables (Ratios de Rentabilidad, Memo lineas, etc.) unqueryable by period. A data-first section now inherits the nearest recent period header as its column row, encoded as a backward-compatible 3-index section path ([header:data_start:end]); contiguous sections keep the legacy 2-index form. Guards against misfires: - period values are full-match only (INV-2024-0007 / Form 2020 are not years) - only a single leftmost label column may precede the inherited years - a label-headed table closes the period band (no stale-header bleed) - phantom null columns are compacted; header-row index is bounds-checked Adds unit coverage for inheritance, band-closing/no-false-inherit, path round-trip and legacy-path parsing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
End-to-end testing against two real, messy Excel workbooks (a 25k-row financial MTP table and an 8-sheet sales-structure workbook) exposed that the NL→SQL pipeline guessed SQL literals (e.g.
Year IN (2023)when values areFY23,Brand='DAPA'when DAPA lives inFamily/TA, summing a scaledFYinstead ofFY (Real)) and that the control flow reported wrong-but-empty results as confident answers. This PR makes the agents anchor on real data and hardens the query + ingestion pipelines.Result: a question set that previously answered ~0/9 now answers Section 1 (financial) 4/4 exact and Section 2 (org/sales) 4/5, with the remaining miss flagged rather than confidently wrong.
What changed
Grounding / value anchoring
profile: store the full distinct value set for low-cardinality columns (was top-5) so filter/CASE literals are copied verbatim.retrieval:value_fingerprint()on hits; render a "Column value catalogue" into the grounding and generation prompts; instructions for tall/EAV layouts and scaled-duplicate measures.Control flow / governance
row_count > 0.candidates[0].table_refs(the firewall was rejecting valid CTE SQL)./stream,:explain,:validateon the rendered prompts + firewall/scope guards (they previously ran blind).extra_filterthrough the semantic-metricbind().Ingestion robustness / determinism
temperature=0for naming/describe agents → re-ingest schema churn 302 → 17 (~94%).content_tsv; warn when the reranker degrades to a no-op.Validation (vs ground truth computed directly from the Parquet)
Known limitations / follow-ups
Total_*buckets); a structuralvalue-aggregate == sum-of-othersconfirmation would re-enable it.task lintnot run in this PR.Note
pyproject.tomlis intentionally excluded (it carried a machine-local pyfly path tweak).