tests: unskip three tests with individual issues in train#5894
Conversation
The MPI driver script used absolute imports (from sagemaker.train.container_drivers...) which fail at runtime in the training container because sagemaker-train is not installed there. The driver scripts are copied to /opt/ml/input/data/sm_drivers/ and executed directly by the container entrypoint. Changed to sys.path-based relative imports matching the pattern used by torchrun_driver.py, which works correctly in the container environment.
Remove cross-account dependency in test_benchmark_evaluation_nova_model by replacing resources from account 052150106756 with our test account (729646638167) in us-east-1. Also removed mlflow_tracking_server_arn since no MLflow server exists in us-east-1. Test remains skipped pending us-east-1 test infrastructure migration to a dedicated test account.
There was a problem hiding this comment.
Is this failure of sagemaker.train not being available a cause of sagemaker not being installed within the training container (or the correct version of sagemaker)? I'm not sure relative paths is the correct fix here, although feel free to correct me if I'm wrong
There was a problem hiding this comment.
Good question! The driver scripts (mpi_driver.py, torchrun_driver.py, etc.) are not meant to run as part of an installed sagemaker-train package. They're uploaded as standalone files to /opt/ml/input/data/sm_drivers/ in the training container and executed directly by the entrypoint bash script. The container uses a standard AWS DLC image (e.g., pytorch-training:2.0.0-cpu-py310) which doesn't have sagemaker-train installed — and it shouldn't need to.
torchrun_driver.py already uses sys.path.insert + relative module imports and works correctly. mpi_driver.py was the only driver using absolute imports, which was inconsistent with the rest of the codebase. I'm actually mirroring what we have done and is running fine here.
There was a problem hiding this comment.
Makes sense, I see in PySDK V2 that a similar approach is being used here too, thanks for the clarification!
test_hp_contract_mpi_script-- unskip with fixtest_benchmark_evaluation_base_model_only-- unskiptest_benchmark_evaluation_nova_model-- remain skipped but eliminate dependency to cross account resourcestest_hp_contract_mpi_scriptfixProblem:
mpi_driver.pyused absolute imports (from sagemaker.train.container_drivers.distributed_drivers.mpi_utils import ...), but thesagemaker-trainpackage is not installed in the training container. The driver scripts are uploaded as standalone files to/opt/ml/input/data/sm_drivers/and executed directly, so the absolute imports fail withModuleNotFoundError: No module named 'sagemaker.train'.Fix: Changed
mpi_driver.pyto usesys.path.insert+ relative module name imports, matching the pattern already used bytorchrun_driver.py, which works correctly in the container environment.