-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
PySDK Version
- PySDK V2 (2.x)
- PySDK V3 (3.x)
Describe the bug
When submitting a standalone Sagemaker Training Job using ModelTrainer, the job completes successfully while the same training job terminates with the following error when executed within a HyperparameterTuner instance.
RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0
(in my config I have: per_device_train_batch_size=4, gradient_accumulation_steps=4)
Note: The standalone job uses the sm_drivers channel and torchrun_driver.py, launching 4 processes across 4 GPUs. The tuning job uses the legacy framework container path (sagemaker_pytorch_container), which invokes python train.py directly — single GPU only.
Claude log analysis:
In TRL 0.24.0's compute_loss:
entropy = torch.sum(per_token_entropy * attention_mask) / attention_mask.sum()
With your config: per_device_train_batch_size=4, gradient_accumulation_steps=4. In single-GPU mode, TRL's num_items_in_batch logic produces per_token_entropy with batch dim 4, but attention_mask reflects the accumulated batch 4×4=16. This mismatch doesn't occur in distributed mode because batch dimensions are correctly partitioned across 4 ranks.
Quick verification: Check what model_trainer.input_data_config contains when submitted standalone vs. via tuner — the sm_drivers channel is very likely missing from the tuner's job definition, which is the proximate cause of the different execution path.
To reproduce
Launching Sagemaker Training Job:
training_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0"
training_mode = Mode.SAGEMAKER_TRAINING_JOB
distributed_strategy = Torchrun()
model_trainer = ModelTrainer(
training_mode=training_mode,
sagemaker_session=sagemaker_session,
role=role,
training_image=training_image,
base_job_name=job_name,
source_code=source_code,
compute=compute,
distributed=distributed_strategy,
output_data_config=output,
checkpoint_config=checkpoint_config,
stopping_condition=stopping,
environment={"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"},
hyperparameters={"learning_rate": 5e-5}
)
model_trainer.train(wait=False, logs=True, input_data_config=[training_data, validation_data, config_data])
Launching HyperParameter tuning jobs
from sagemaker.core.parameter import ContinuousParameter
from sagemaker.train.tuner import HyperparameterTuner
metric_definitions = [{
"Name": "eval_macro_f1",
"Regex": "eval_macro_f1: (.*?)"}]
learning_rate = ContinuousParameter(
min_value=1e-5,
max_value=5e-4,
scaling_type='Logarithmic')
hyperparameter_ranges = {"learning_rate": learning_rate}
tuner = HyperparameterTuner(model_trainer=model_trainer,
objective_metric_name="eval_macro_f1",
metric_definitions=metric_definitions,
hyperparameter_ranges=hyperparameter_ranges,
max_jobs=2,
max_parallel_jobs=2)
tuner.tune(wait=False, inputs=[training_data, validation_data, config_data])
Expected behavior
Training job terminate without errors when spawned by the HyperParameter tuner
Screenshots or logs
Sagemaker Training Job:
Running Torchrun Driver
torchrun --nnodes=1 --nproc_per_node=4 train.py
Distributed training initialized: Rank 0/4, 1/4, 2/4, 3/4
Training mode: Distributed | World size: 4
HyperParameter Tuning job:
Invoking script with the following command:
/usr/local/bin/python train.py --learning_rate 5.336332469583832e-05
2026-03-09 09:52:33 - root - WARNING - No LOCAL_RANK found in environment...
2026-03-09 09:52:33 - root - INFO - Using single GPU: cuda:0
Training mode: Single-process | World size: 1
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 3.5.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
- Framework version:
- Python version: 3.13
- CPU or GPU: GPU
- Custom Docker image (Y/N): N
Additional context
Add any other context about the problem here.