Skip to content

cli — full causal_language_modeling push support: discover/build for the texts layout + flip CLISupported (cli#103 follow-up) #105

Description

@LukasWodka

Follow-up to cli#103 (deferred there) · Epic: backend#828 (CLI ingestion consolidation) · Schema source: data-ingestors#805

Why

cli#103 re-vendored internal/schema/ingest.v1.json after data-ingestors#805 added the causal_language_modeling task category, and registered it in internal/push/category.go as recognize-but-not-supported (CLISupported: false + an UnsupportedNote). So the CLI now fails cleanly on such a dataset instead of mis-handling it — but it still can't push one. This ticket implements full support and flips it on.

Schema contract (from ingest.v1.json)

causal_language_modeling (Family: text):

  • Requires texts — a directory of files referenced by the labels CSV, each either raw .txt (plain text) or a tab-separated prompt\tcompletion pair.
  • Self-supervised → unlike text_classification / token_classification, it does NOT require a label column.
  • Distinct from the supported masked_language_modeling (which uses sequences, not texts).

Scope (internal/push)

  • Implement discover/build for the texts layout — the closest implemented template is the FamilyText sibling masked_language_modeling / text_classification, adapted for: the texts/ directory, no label column, and accepting both the raw-.txt and the prompt\tcompletion forms.
  • Wire the category's discoverFn / buildFn / allowedFlags through the CategorySpec registry (the backend#828 P2 dispatch seam) so it no longer routes to the unsupported path.
  • Flip CategorySpec{causal_language_modeling}.CLISupported = true and drop its UnsupportedNote.
  • Preflight/validation: confirm texts/ is present and referenced by the CSV; reject a stray label column for this category with a clear "self-supervised — no label" message (mirrors the existing shallow-preflight class, cli#69/dataset push: tabular pre-flight silently collapses duplicate header columns and accepts zero-data-row CSVs #73); validate the prompt\tcompletion tab format when that form is used.
  • Parity test (the cli↔ingestor contract): cover the texts layout (raw-.txt and prompt\tcompletion), mirroring the masked_language_modeling tests, so the Go discover/build can't drift from the data-ingestors validator.

Acceptance

  • A causal_language_modeling dataset (raw .txt) pushes end-to-end; one with prompt\tcompletion pairs pushes; IsCLISupported("causal_language_modeling") is true; the recognize-but-unsupported error path no longer fires for it.
  • A dataset that supplies a label column for this category gets a clear self-supervised/no-label error (not a raw backend 400).
  • Parity + push tests green (go build/vet/test ./...).

Notes

Same disease-family as the other ingestion work under backend#828 (the category dispatch + deep-preflight axis); ships the discover/build for one more category. Sibling masked_language_modeling is the implementation template.

Work Type: Feature · Squad: DevEx / Data Science (backend#828) · Area: SDK (CLI) · Priority: P2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions