Skip to content

[Bug] node_count and explicit resource fields (accelerators, vcpu, memory) are incorrectly mutually exclusive #383

@FarhanTejani

Description

@FarhanTejani

hyp create hyp-pytorch-job rejects configurations that specify both node_count and explicit resource fields (accelerators, vcpu, memory). The CLI raises:

❌ Either node-count OR a combination of accelerators, vcpu, memory-in-gib must be specified for instance-type
ml.p4d.24xlarge

But both are needed simultaneously: node_count controls the number of replicas, while the resource fields control per-pod requests/limits. The underlying HyperPodPyTorchJob CRD supports both together.

Without explicit resource requests, the operator can auto-calculate resource requests that exceed what's actually available after system pod overhead, causing Kueue to never admit the job. This can make multi-node jobs with Kueue scheduling unusable through the CLI.

CLI version: v3.7.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions