Skip to content

fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2

Open
poche123 wants to merge 3 commits into
mainfrom
fix/nl2sql-grounding-and-robustness
Open

fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2
poche123 wants to merge 3 commits into
mainfrom
fix/nl2sql-grounding-and-robustness

Conversation

@poche123
Copy link
Copy Markdown

@poche123 poche123 commented Jun 2, 2026

Why

End-to-end testing against two real, messy Excel workbooks (a 25k-row financial MTP table and an 8-sheet sales-structure workbook) exposed that the NL→SQL pipeline guessed SQL literals (e.g. Year IN (2023) when values are FY23, Brand='DAPA' when DAPA lives in Family/TA, summing a scaled FY instead of FY (Real)) and that the control flow reported wrong-but-empty results as confident answers. This PR makes the agents anchor on real data and hardens the query + ingestion pipelines.

Result: a question set that previously answered ~0/9 now answers Section 1 (financial) 4/4 exact and Section 2 (org/sales) 4/5, with the remaining miss flagged rather than confidently wrong.

What changed

Grounding / value anchoring

  • profile: store the full distinct value set for low-cardinality columns (was top-5) so filter/CASE literals are copied verbatim.
  • retrieval: value_fingerprint() on hits; render a "Column value catalogue" into the grounding and generation prompts; instructions for tall/EAV layouts and scaled-duplicate measures.
  • Structurally detect self-referencing hierarchy columns (value containment, with a row-wise discriminator so a duplicate name column isn't mistaken for a manager column) → "a person's team/reports/structure" filters the hierarchy column.

Control flow / governance

  • 0-row / degenerate aggregate → clarification + confidence cap (no more confident empties).
  • Auto-learn gated on row_count > 0.
  • Prefer a candidate that passes the firewall and returns non-empty rows over candidates[0].
  • Exclude CTE names from AST table_refs (the firewall was rejecting valid CTE SQL).
  • Snapshot-scope retrieval; honor + persist conversation snapshot pins.
  • Unify /stream, :explain, :validate on the rendered prompts + firewall/scope guards (they previously ran blind).
  • Thread a re-firewalled extra_filter through the semantic-metric bind().

Ingestion robustness / determinism

  • Header-quality scoring (skip numeric spacer rows) + stable leading-blank handling.
  • temperature=0 for naming/describe agents → re-ingest schema churn 302 → 17 (~94%).
  • Rewrite Parquet column names on the no-key fallback (fixes profile/sample Binder errors).
  • Index column values into embeddings + content_tsv; warn when the reranker degrades to a no-op.

Validation (vs ground truth computed directly from the Parquet)

Result
DAPA Brazil revenue FY23/24/25 ✅ 175.6 / 127.5 / 119.9M
Operating profit CAMCAR FY24 ✅ $252,519,261
Field:HQ Mexico FY23/FY26 ✅ 3.49 / 3.40
CAGR Imfinzi/Lynparza/Tagrisso ✅ 96.1 / 22.1 / 19.0%
días-en-campo, consecución, team dimensioning, structure movements ✅ 4/5 (e.g. ANA consecución 0.911 exact)

Known limitations / follow-ups

  • Two-hop questions ("the BUM of Calquence's direct reports") aren't resolved — needs nested-entity grounding (the generator stops at the first hop). Tracked as a follow-up.
  • Subtotal detection is implemented conservatively but not surfaced as an exclusion directive (name-based exclusion was net-harmful on additive Total_* buckets); a structural value-aggregate == sum-of-others confirmation would re-enable it.
  • New logic is not yet unit-tested; task lint not run in this PR.

Note

pyproject.toml is intentionally excluded (it carried a machine-local pyfly path tweak).

Jose Agustin Puente added 3 commits June 2, 2026 10:08
…ion, pipeline robustness

Surface column values into the grounding/generation prompts and harden the
query + ingestion pipelines so the NL→SQL agents stop guessing literals and the
control flow stops reporting wrong-but-empty results as confident answers.

Grounding / value anchoring:
- profile: store the full distinct value set for low-cardinality columns (was top-5)
- retrieval: value_fingerprint() on hits; render a "Column value catalogue" into the
  grounding and generation prompts; instructions to copy WHERE/CASE literals verbatim,
  handle tall/EAV layouts and scaled-duplicate measures
- structurally detect self-referencing hierarchy columns (value containment, with a
  row-wise discriminator) and teach "a person's team/reports/structure" -> filter the
  hierarchy column

Control flow / governance:
- treat 0-row / degenerate aggregate results as suspicious -> clarification + confidence cap
- gate auto-learning on row_count > 0 (stop poisoning the example store)
- prefer a candidate that passes the firewall and returns non-empty rows over candidates[0]
- exclude CTE names from AST table_refs (firewall no longer rejects valid CTE SQL)
- snapshot-scope retrieval; honor and persist conversation snapshot pins
- unify /stream, :explain, :validate on the rendered prompts + firewall/scope guards
- thread a re-firewalled extra_filter through the semantic-metric bind()

Ingestion robustness / determinism:
- header-quality scoring (skip numeric spacer rows) + stable leading-blank handling
- temperature=0 for the naming/describe agents (cuts re-ingest schema churn ~94%)
- rewrite Parquet column names on the no-key fallback (fixes profile/sample Binder errors)
- index column values into embeddings + content_tsv
- warn when the cross-encoder reranker degrades to a no-op
- builder.py is a lock-step (canon-pinned) module; revert it and pass the
  determinism setting via the existing extra_settings={"temperature": 0.0}
  instead of a new parameter, so the lockstep gate stays green
- move the table-kinds lookup out of the query controller into TableResolver
  (service layer) so the "no raw SQL in controllers" gate passes
- ruff: inline a negated return (SIM103), drop a forward-ref quote (UP037),
  and apply ruff format
- tests: extend the fake TableResolvers with pins / current_snapshots /
  table_kinds_by_name; seed current_snapshot_id in the retrieval integration
  fixture (retrieval is now scoped to current_snapshot_id, as publish sets it)
Financial exports (Orbis/BvD) place one date-header row above a stack of
sub-sections. Section-splitting orphaned that header above each
sub-section's title, dropping the year columns and making the detailed
financial tables (Ratios de Rentabilidad, Memo lineas, etc.) unqueryable
by period.

A data-first section now inherits the nearest recent period header as its
column row, encoded as a backward-compatible 3-index section path
([header:data_start:end]); contiguous sections keep the legacy 2-index form.

Guards against misfires:
- period values are full-match only (INV-2024-0007 / Form 2020 are not years)
- only a single leftmost label column may precede the inherited years
- a label-headed table closes the period band (no stale-header bleed)
- phantom null columns are compacted; header-row index is bounds-checked

Adds unit coverage for inheritance, band-closing/no-false-inherit, path
round-trip and legacy-path parsing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant