Teich's public API is designed around three levels:
- high-level training prep with
prepare_data()andmask_data() - trace loading and conversion with
load_traces() - preflight helpers for validation, fitting, and previews
from teich import (
prepare_data,
mask_data,
load_traces,
detect_trace_type,
validate_tool_calls,
row_fits_context,
trace_is_complete,
preview_sft_example,
Config,
TrainingExample,
)Recommended entry point for training.
train_dataset = prepare_data(
source_or_dataset,
tokenizer,
max_length=32768,
oversized_policy="trim_followups",
tokenize=True,
chat_template_kwargs={"enable_thinking": True},
)Accepts:
- local file or folder
- Hugging Face dataset id
datasets.Dataset- list of sources
- source mix mapping with explicit ratios
Returns a trainer-friendly dataset with rendered text, Teich supervision spans, and optionally input_ids / attention_mask.
Useful options:
splitrevisiontoken/hf_tokencache_dirlocal_dirmax_examplesmax_lengthoversized_policypreserve_columnsreturn_reportvalidate_toolsstrictteich_maskingtokenizechat_template_kwargs
See Preparing Data.
Apply response-only labels to a trainer after trainer tokenization.
trainer = mask_data(
trainer,
tokenizer=tokenizer,
train_on_reasoning=True,
train_on_final_answers=True,
train_on_tools=True,
)By default, Teich supervises assistant reasoning, final answers, and tool calls. Prompt/context tokens stay -100.
Policy options:
train_on_reasoningtrain_on_final_answerstrain_on_toolstrain_on_usertrain_on_systemtrain_on_developertrain_on_tool_responsesmax_supervised_tokensaudittext_column
See Training.
Load and convert raw traces without running the full preparation pipeline.
dataset = load_traces("./output")Use this when you want to own rendering, filtering, tokenization, masking, and packing yourself.
By default, rows that end on a tool result are dropped because they are incomplete. Pass drop_incomplete_traces=False only for inspection or repair workflows.
Detect supported parsed raw trace events.
from teich import detect_trace_type
trace_type = detect_trace_type(events)Returns one of:
codexclaude_codedroidpiopenclawhermesexternal_agentNone
Factory droid CLI sessions are supported as a conversion-only source. Point prepare_data() or load_traces() at session JSONL files from ~/.factory/sessions/...; Teich reads the adjacent <session-id>.settings.json sidecar for model and token usage metadata when present.
result = validate_tool_calls(example)
result.raise_for_errors()Checks that assistant tool calls reference declared tools and include required arguments.
fits = row_fits_context(
example,
tokenizer,
max_length=32768,
chat_template_kwargs={"enable_thinking": True},
)Renders one row with the target chat template and checks whether it fits the target context window.
if not trace_is_complete(example):
...Returns False when a row ends on a tool result without a follow-up assistant turn.
Use preview_sft_example() before training or the dataset preview helper attached by mask_data().
from teich import preview_sft_example
preview = preview_sft_example(tokenizer, input_ids, labels)
print(preview)After mask_data():
print(trainer.train_dataset.preview(0, tokenizer))Previewing is the quickest way to confirm that reasoning, tool calls, and final answers are supervised while context is masked.
Config loads generation config:
from teich import Config
config = Config.from_yaml("config.yaml")TrainingExample is the typed representation used internally for converted rows.