Sagemaker HyperParameterTuner fails to start distributed training

**PySDK Version**
- [ ] PySDK V2 (2.x)
- [x] PySDK V3 (3.x)

**Describe the bug**
When submitting a standalone Sagemaker Training Job using ModelTrainer, the job completes successfully while the same training job terminates with the following error when executed within a HyperparameterTuner instance.  

`RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0`
(in my config I have: per_device_train_batch_size=4, gradient_accumulation_steps=4)

**Note:** The standalone job uses the sm_drivers channel and torchrun_driver.py, launching 4 processes across 4 GPUs. The tuning job uses the legacy framework container path (sagemaker_pytorch_container), which invokes python train.py directly — single GPU only.

**Claude log analysis**: 

> In TRL 0.24.0's compute_loss:
> entropy = torch.sum(per_token_entropy * attention_mask) / attention_mask.sum()
> With your config: per_device_train_batch_size=4, gradient_accumulation_steps=4. In single-GPU mode, TRL's num_items_in_batch logic produces per_token_entropy with batch dim 4, but attention_mask reflects the accumulated batch 4×4=16. This mismatch doesn't occur in distributed mode because batch dimensions are correctly partitioned across 4 ranks.
> Quick verification: Check what model_trainer.input_data_config contains when submitted standalone vs. via tuner — the sm_drivers channel is very likely missing from the tuner's job definition, which is the proximate cause of the different execution path.

**To reproduce**
**Launching Sagemaker Training Job:**
```
training_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04-v1.0"
training_mode = Mode.SAGEMAKER_TRAINING_JOB
distributed_strategy = Torchrun()

model_trainer = ModelTrainer(
    training_mode=training_mode,
    sagemaker_session=sagemaker_session,
    role=role,
    training_image=training_image,
    base_job_name=job_name,
    source_code=source_code,
    compute=compute,
    distributed=distributed_strategy,
    output_data_config=output,
    checkpoint_config=checkpoint_config,
    stopping_condition=stopping,
    environment={"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True"},
    hyperparameters={"learning_rate": 5e-5}
)
model_trainer.train(wait=False, logs=True, input_data_config=[training_data, validation_data, config_data])
```
**Launching HyperParameter tuning jobs**
```
from sagemaker.core.parameter import ContinuousParameter
from sagemaker.train.tuner import HyperparameterTuner


metric_definitions = [{
    "Name": "eval_macro_f1",
    "Regex": "eval_macro_f1: (.*?)"}]

learning_rate = ContinuousParameter(
    min_value=1e-5,
    max_value=5e-4,
    scaling_type='Logarithmic')

hyperparameter_ranges = {"learning_rate": learning_rate}

tuner = HyperparameterTuner(model_trainer=model_trainer,
                            objective_metric_name="eval_macro_f1",
                            metric_definitions=metric_definitions,
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=2,
                            max_parallel_jobs=2)
tuner.tune(wait=False, inputs=[training_data, validation_data, config_data])
```

**Expected behavior**
Training job terminate without errors when spawned by the HyperParameter tuner


**Screenshots or logs**

_Sagemaker Training Job:_
> Running Torchrun Driver
> torchrun --nnodes=1 --nproc_per_node=4 train.py
> Distributed training initialized: Rank 0/4, 1/4, 2/4, 3/4
> Training mode: Distributed | World size: 4

_HyperParameter Tuning job:_
> Invoking script with the following command:
> /usr/local/bin/python train.py --learning_rate 5.336332469583832e-05
> 2026-03-09 09:52:33 - root - WARNING - No LOCAL_RANK found in environment...
> 2026-03-09 09:52:33 - root - INFO - Using single GPU: cuda:0
> Training mode: Single-process | World size: 1

**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: 3.5.0
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: Pytorch
- **Framework version**: 
- **Python version**: 3.13
- **CPU or GPU**: GPU
- **Custom Docker image (Y/N)**: N

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sagemaker HyperParameterTuner fails to start distributed training #5610

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sagemaker HyperParameterTuner fails to start distributed training #5610

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions