[Feature Request] Add `hugepages` support to hyp-pytorch-job config

The `hyp-pytorch-job` template doesn't support requesting `hugepages`. EFA's RDMA stack uses `hugepages` (2Mi) to reduce TLB misses during DMA transfers for inter-node NCCL communication. Without this, users have to bypass the CLI and write raw YAML.

HyperPod nodes already pre-allocate hugepages (e.g. ~10562Mi on `ml.p4d.24xlarge`), but the CLI template has no way to request them in the pod spec. The official [SageMaker docs for HyperPodPyTorchJob](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-usage.html) show `hugepages-2Mi: 5120Mi` in the example job spec, confirming this is an expected resource to configure.

Suggested config fields:

```yaml
hugepages_2mi: 5120
hugepages_2mi_limit: 5120
```

Which would generate:

```yaml
resources:
 requests:
   hugepages-2Mi: 5120Mi
 limits:
   hugepages-2Mi: 5120Mi
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add `hugepages` support to hyp-pytorch-job config #384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Add hugepages support to hyp-pytorch-job config #384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature Request] Add `hugepages` support to hyp-pytorch-job config #384