TrainCheck is a lightweight, invariant-based instrumentation and analysis tool for identifying silent correctness issues in PyTorch training pipelines. It infers behavioral invariants from correct reference runs (e.g., official examples or clean configurations), then checks other scripts for behavioral violations. TrainCheck is designed to be minimally intrusive—requiring no code modifications or rewrites of training logic.
TrainCheck consists of four user-facing command-line utilities:
- traincheck-collect – Instruments a training pipeline and collects trace logs.
- traincheck-infer – Infers behavioral invariants from the collected traces.
- traincheck-onlinecheck – Checks a target trace folder while training is still running.
- traincheck-check – Checks completed traces against inferred invariants.
TrainCheck workflows are organized into two stages:
-
Inference Stage
- traincheck-collect collects execution traces from reference training pipelines.
- traincheck-infer analyzes traces and produces invariants that describe correct/expected runtime behavior.
-
Checking Stage
- traincheck-collect traces the target pipeline with
--invariantsfor selective collection. - traincheck-onlinecheck verifies traces while the target run is active.
- traincheck-check verifies completed traces after collection finishes.
- traincheck-collect traces the target pipeline with
In common use cases, users typically do not need to infer invariants manually. TrainCheck provides a high-quality set of pre-inferred invariants that work out-of-the-box with popular libraries such as PyTorch, HuggingFace Transformers, and DeepSpeed.
You may still want to run inference in the following cases:
- When using certain niche or uncommon features not covered by the default invariants.
- When working with custom training stacks outside supported libraries.
- When you want to increase specificity by inferring invariants from a set of related, known-good pipelines (e.g. in industrial settings).
Start with Use TrainCheck for the workflow. Use these pages as command references:
-
Collecting Traces with traincheck-collect
Collection modes, config files, model tracking, selective collection, and sampling. -
Inferring Invariants with traincheck-infer
Trace inputs, invariant outputs, relation filtering, and inference options. -
Checking Violations
Live checking, offline checking, reports, and integrations.