You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cli#103 re-vendored internal/schema/ingest.v1.json after data-ingestors#805 added the causal_language_modeling task category, and registered it in internal/push/category.go as recognize-but-not-supported (CLISupported: false + an UnsupportedNote). So the CLI now fails cleanly on such a dataset instead of mis-handling it — but it still can't push one. This ticket implements full support and flips it on.
Schema contract (from ingest.v1.json)
causal_language_modeling (Family: text):
Requires texts — a directory of files referenced by the labels CSV, each either raw .txt (plain text) or a tab-separated prompt\tcompletion pair.
Self-supervised → unlike text_classification / token_classification, it does NOT require a label column.
Distinct from the supported masked_language_modeling (which uses sequences, not texts).
Scope (internal/push)
Implement discover/build for the texts layout — the closest implemented template is the FamilyText sibling masked_language_modeling / text_classification, adapted for: the texts/ directory, no label column, and accepting both the raw-.txt and the prompt\tcompletion forms.
Wire the category's discoverFn / buildFn / allowedFlags through the CategorySpec registry (the backend#828 P2 dispatch seam) so it no longer routes to the unsupported path.
Flip CategorySpec{causal_language_modeling}.CLISupported = true and drop its UnsupportedNote.
Parity test (the cli↔ingestor contract): cover the texts layout (raw-.txt and prompt\tcompletion), mirroring the masked_language_modeling tests, so the Go discover/build can't drift from the data-ingestors validator.
Acceptance
A causal_language_modeling dataset (raw .txt) pushes end-to-end; one with prompt\tcompletion pairs pushes; IsCLISupported("causal_language_modeling") is true; the recognize-but-unsupported error path no longer fires for it.
A dataset that supplies a label column for this category gets a clear self-supervised/no-label error (not a raw backend 400).
Parity + push tests green (go build/vet/test ./...).
Notes
Same disease-family as the other ingestion work under backend#828 (the category dispatch + deep-preflight axis); ships the discover/build for one more category. Sibling masked_language_modeling is the implementation template.
Work Type: Feature · Squad: DevEx / Data Science (backend#828) · Area: SDK (CLI) · Priority: P2
Follow-up to cli#103 (deferred there) · Epic: backend#828 (CLI ingestion consolidation) · Schema source: data-ingestors#805
Why
cli#103 re-vendored
internal/schema/ingest.v1.jsonafter data-ingestors#805 added thecausal_language_modelingtask category, and registered it ininternal/push/category.goas recognize-but-not-supported (CLISupported: false+ anUnsupportedNote). So the CLI now fails cleanly on such a dataset instead of mis-handling it — but it still can't push one. This ticket implements full support and flips it on.Schema contract (from
ingest.v1.json)causal_language_modeling(Family: text):texts— a directory of files referenced by the labels CSV, each either raw.txt(plain text) or a tab-separatedprompt\tcompletionpair.text_classification/token_classification, it does NOT require alabelcolumn.masked_language_modeling(which usessequences, nottexts).Scope (
internal/push)textslayout — the closest implemented template is the FamilyText siblingmasked_language_modeling/text_classification, adapted for: thetexts/directory, no label column, and accepting both the raw-.txtand theprompt\tcompletionforms.discoverFn/buildFn/allowedFlagsthrough the CategorySpec registry (the backend#828 P2 dispatch seam) so it no longer routes to the unsupported path.CategorySpec{causal_language_modeling}.CLISupported = trueand drop itsUnsupportedNote.texts/is present and referenced by the CSV; reject a straylabelcolumn for this category with a clear "self-supervised — no label" message (mirrors the existing shallow-preflight class, cli#69/dataset push: tabular pre-flight silently collapses duplicate header columns and accepts zero-data-row CSVs #73); validate theprompt\tcompletiontab format when that form is used.textslayout (raw-.txtandprompt\tcompletion), mirroring themasked_language_modelingtests, so the Go discover/build can't drift from the data-ingestors validator.Acceptance
causal_language_modelingdataset (raw.txt) pushes end-to-end; one withprompt\tcompletionpairs pushes;IsCLISupported("causal_language_modeling")istrue; the recognize-but-unsupported error path no longer fires for it.labelcolumn for this category gets a clear self-supervised/no-label error (not a raw backend 400).go build/vet/test ./...).Notes
Same disease-family as the other ingestion work under backend#828 (the category dispatch + deep-preflight axis); ships the discover/build for one more category. Sibling
masked_language_modelingis the implementation template.Work Type: Feature · Squad: DevEx / Data Science (backend#828) · Area: SDK (CLI) · Priority: P2