cli — full causal_language_modeling push support: discover/build for the texts layout + flip CLISupported (cli#103 follow-up)

**Follow-up to cli#103** (deferred there) · **Epic:** backend#828 (CLI ingestion consolidation) · **Schema source:** data-ingestors#805

## Why

cli#103 re-vendored `internal/schema/ingest.v1.json` after data-ingestors#805 added the **`causal_language_modeling`** task category, and registered it in `internal/push/category.go` as **recognize-but-not-supported** (`CLISupported: false` + an `UnsupportedNote`). So the CLI now *fails cleanly* on such a dataset instead of mis-handling it — but it still **can't push one**. This ticket implements full support and flips it on.

## Schema contract (from `ingest.v1.json`)

`causal_language_modeling` (Family: **text**):
- Requires **`texts`** — a directory of files referenced by the labels CSV, each either **raw `.txt`** (plain text) **or** a tab-separated **`prompt\tcompletion`** pair.
- **Self-supervised** → unlike `text_classification` / `token_classification`, it does **NOT** require a `label` column.
- Distinct from the supported `masked_language_modeling` (which uses `sequences`, not `texts`).

## Scope (`internal/push`)

- Implement **discover/build** for the `texts` layout — the closest implemented template is the FamilyText sibling `masked_language_modeling` / `text_classification`, adapted for: the `texts/` directory, **no label column**, and accepting **both** the raw-`.txt` and the `prompt\tcompletion` forms.
- Wire the category's `discoverFn` / `buildFn` / `allowedFlags` through the CategorySpec registry (the backend#828 P2 dispatch seam) so it no longer routes to the unsupported path.
- Flip `CategorySpec{causal_language_modeling}.CLISupported = true` and drop its `UnsupportedNote`.
- **Preflight/validation:** confirm `texts/` is present and referenced by the CSV; reject a stray `label` column for this category with a clear "self-supervised — no label" message (mirrors the existing shallow-preflight class, cli#69/#73); validate the `prompt\tcompletion` tab format when that form is used.
- **Parity test** (the cli↔ingestor contract): cover the `texts` layout (raw-`.txt` and `prompt\tcompletion`), mirroring the `masked_language_modeling` tests, so the Go discover/build can't drift from the data-ingestors validator.

## Acceptance

- A `causal_language_modeling` dataset (raw `.txt`) pushes end-to-end; one with `prompt\tcompletion` pairs pushes; `IsCLISupported("causal_language_modeling")` is `true`; the recognize-but-unsupported error path no longer fires for it.
- A dataset that supplies a `label` column for this category gets a clear self-supervised/no-label error (not a raw backend 400).
- Parity + push tests green (`go build`/`vet`/`test ./...`).

## Notes

Same disease-family as the other ingestion work under backend#828 (the category dispatch + deep-preflight axis); ships the discover/build for one more category. Sibling `masked_language_modeling` is the implementation template.

_Work Type: Feature · Squad: DevEx / Data Science (backend#828) · Area: SDK (CLI) · Priority: P2_


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cli — full causal_language_modeling push support: discover/build for the texts layout + flip CLISupported (cli#103 follow-up) #105

Why

Schema contract (from `ingest.v1.json`)

Scope (`internal/push`)

Acceptance

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

cli — full causal_language_modeling push support: discover/build for the texts layout + flip CLISupported (cli#103 follow-up) #105

Description

Why

Schema contract (from ingest.v1.json)

Scope (internal/push)

Acceptance

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Schema contract (from `ingest.v1.json`)

Scope (`internal/push`)