Skip to content

Sagemaker HyperParameterTuner fails to start distributed training #5610

@CoolFish88

Description

@CoolFish88

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When submitting a standalone Sagemaker Training Job using ModelTrainer, the job completes successfully while the same training job terminates with the following error when executed within a HyperparameterTuner instance.

RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0
(in my config I have: per_device_train_batch_size=4, gradient_accumulation_steps=4)

Note: The standalone job uses the sm_drivers channel and torchrun_driver.py, launching 4 processes across 4 GPUs. The tuning job uses the legacy framework container path (sagemaker_pytorch_container), which invokes python train.py directly — single GPU only.

Claude log analysis:

In TRL 0.24.0's compute_loss:
entropy = torch.sum(per_token_entropy * attention_mask) / attention_mask.sum()
With your config: per_device_train_batch_size=4, gradient_accumulation_steps=4. In single-GPU mode, TRL's num_items_in_batch logic produces per_token_entropy with batch dim 4, but attention_mask reflects the accumulated batch 4×4=16. This mismatch doesn't occur in distributed mode because batch dimensions are correctly partitioned across 4 ranks.
Quick verification: Check what model_trainer.input_data_config contains when submitted standalone vs. via tuner — the sm_drivers channel is very likely missing from the tuner's job definition, which is the proximate cause of the different execution path.

To reproduce
Launching Sagemaker Training Job:

training_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0"
training_mode = Mode.SAGEMAKER_TRAINING_JOB
distributed_strategy = Torchrun()

model_trainer = ModelTrainer(
    training_mode=training_mode,
    sagemaker_session=sagemaker_session,
    role=role,
    training_image=training_image,
    base_job_name=job_name,
    source_code=source_code,
    compute=compute,
    distributed=distributed_strategy,
    output_data_config=output,
    checkpoint_config=checkpoint_config,
    stopping_condition=stopping,
    environment={"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"},
    hyperparameters={"learning_rate": 5e-5}
)
model_trainer.train(wait=False, logs=True, input_data_config=[training_data, validation_data, config_data])

Launching HyperParameter tuning jobs

from sagemaker.core.parameter import ContinuousParameter
from sagemaker.train.tuner import HyperparameterTuner


metric_definitions = [{
    "Name": "eval_macro_f1",
    "Regex": "eval_macro_f1: (.*?)"}]

learning_rate = ContinuousParameter(
    min_value=1e-5,
    max_value=5e-4,
    scaling_type='Logarithmic')

hyperparameter_ranges = {"learning_rate": learning_rate}

tuner = HyperparameterTuner(model_trainer=model_trainer,
                            objective_metric_name="eval_macro_f1",
                            metric_definitions=metric_definitions,
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=2,
                            max_parallel_jobs=2)
tuner.tune(wait=False, inputs=[training_data, validation_data, config_data])

Expected behavior
Training job terminate without errors when spawned by the HyperParameter tuner

Screenshots or logs

Sagemaker Training Job:

Running Torchrun Driver
torchrun --nnodes=1 --nproc_per_node=4 train.py
Distributed training initialized: Rank 0/4, 1/4, 2/4, 3/4
Training mode: Distributed | World size: 4

HyperParameter Tuning job:

Invoking script with the following command:
/usr/local/bin/python train.py --learning_rate 5.336332469583832e-05
2026-03-09 09:52:33 - root - WARNING - No LOCAL_RANK found in environment...
2026-03-09 09:52:33 - root - INFO - Using single GPU: cuda:0
Training mode: Single-process | World size: 1

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 3.5.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
  • Framework version:
  • Python version: 3.13
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions