This guide is for an ML engineer who has a training script and wants to know which TrainCheck command to run next.
TrainCheck uses a reference run to learn invariants: rules that describe normal training behavior. It then checks a target run against those invariants and reports violations.
Start with a short, known-good run. The run can come from your own training code, a previous clean run, or an official example that uses the same framework features.
traincheck-collect \
--pyscript reference.py \
--models-to-track model \
--output-dir reference_trace--models-to-track model names the Python variable that holds the model in reference.py. If your script uses a different variable name, use that name instead.
TrainCheck writes trace files and an env_dump.txt file into reference_trace/.
Use the reference trace to produce an invariant file:
traincheck-infer -f reference_trace -o invariants.jsonYou can pass multiple reference trace folders when one run does not cover enough behavior:
traincheck-infer -f reference_trace_1 reference_trace_2 -o invariants.jsonMore diverse reference traces reduce overfitting. Short runs are usually enough because training loops repeat the same operations many times.
Run the target script with the inferred invariants:
traincheck-collect \
--pyscript target.py \
--models-to-track model \
--invariants invariants.json \
--output-dir target_tracePassing --invariants enables selective trace collection. TrainCheck traces the APIs and variables needed by the invariant file instead of collecting a full reference-style trace.
For long target runs, sample steps during collection:
traincheck-collect \
--pyscript target.py \
--models-to-track model \
--invariants invariants.json \
--sampling-interval 10 \
--warm-up-steps 10 \
--output-dir target_traceThis traces the warm-up steps, then traces every tenth step. Sampling is a traincheck-collect option. The checker reads the collected trace; it does not control which steps were traced.
For live checking, start the online checker while traincheck-collect is still writing into target_trace/:
traincheck-onlinecheck -f target_trace -i invariants.jsonIf traincheck-onlinecheck fails with a missing watchdog package, install it in the same environment:
pip install watchdogThe easier path is offline checking. Wait for trace collection to finish, then run:
traincheck-check -f target_trace -i invariants.jsonUse offline checking first when you are learning TrainCheck or reproducing an issue locally. Use live checking when you want violations while the training job is still running.
traincheck-collect writes:
trace_*.jsonfiles for API traces.proxy_log.jsonwhen model-variable tracking is active.env_dump.txtwith the collection arguments.- an instrumented copy of the training script and execution logs.
traincheck-infer writes:
invariants.jsonby default, or the path passed with-o.
traincheck-check and traincheck-onlinecheck write:
failed.logfor violated invariants.passed.logfor triggered invariants that passed.not_triggered.logfor invariants that never ran on the trace.violations_summary.jsonfor machine-readable violation summaries.report.htmlfor a browser-readable report.
Use a reference run that should be correct. If you have a clean run of the same pipeline, use it. If you are debugging a new pipeline, start with an official example or a smaller training job that uses the same optimizer, precision mode, and distributed setup.
Use a target run for the script you want to check. It can be a modified version of the reference script, a larger training job, or a run that already looks suspicious.
Keep both runs short while you are iterating. TrainCheck needs representative behavior more than long training time.
Use a config file when the command line gets long:
traincheck-collect --use-config --config traincheck.ymlA minimal config looks like this:
pyscript: ./train.py
models_to_track:
- model
modules_to_instr:
- torch
output_dir: traincheck_traceWhen using a config file, use underscores in config keys, such as output_dir, instead of CLI-style hyphens.
Use a shell script when your training command needs environment variables or launcher arguments:
pyscript: ./train.py
shscript: ./run.sh
models_to_track:
- modelUse --copy-all-files if the training script reads local files through relative paths.
TrainCheck currently instruments PyTorch eager-mode execution. If your script uses torch.compile, pass --use-torch-compile so TrainCheck can keep compatibility behavior explicit.
Tracing adds overhead. Start with short runs, selective target tracing, and step sampling before scaling to a long job.