The hyp-pytorch-job template doesn't support requesting hugepages. EFA's RDMA stack uses hugepages (2Mi) to reduce TLB misses during DMA transfers for inter-node NCCL communication. Without this, users have to bypass the CLI and write raw YAML.
HyperPod nodes already pre-allocate hugepages (e.g. ~10562Mi on ml.p4d.24xlarge), but the CLI template has no way to request them in the pod spec. The official SageMaker docs for HyperPodPyTorchJob show hugepages-2Mi: 5120Mi in the example job spec, confirming this is an expected resource to configure.
Suggested config fields:
hugepages_2mi: 5120
hugepages_2mi_limit: 5120
Which would generate:
resources:
requests:
hugepages-2Mi: 5120Mi
limits:
hugepages-2Mi: 5120Mi