HyperPod InstantStart is a training-and-inference integrated platform built on SageMaker HyperPod. It utilizes standard EKS orchestration and supports training and inference tasks with arbitrary GPU resource granularity.
HyperPod-InstantStart provides a unified interface for managing ML infrastructure, from cluster provisioning to training job orchestration and model serving.
- For training, it leverages HyperPod Training Operator (significantly simplifying distributed configuration with process-level recovery and log exception monitoring; optional), or KubeRay (as an orchestrator for the reinforcement learning framework VERL).
- For inference, it supports deployment on single or multi-node setups using arbitrary containers, such as standard vLLM/SGLang or self-built containers, while also providing standardized API exposure (e.g., OpenAI-compatible API). You can self-manage serving with intelligent routing and metrics, or use the HyperPod-managed Inference Operator for built-in routing, L1/L2 KV-cache sharing, and integrated observability.
- Additionally, it offers managed MLFlow Tracking Server for storing training metrics, enabling sharing and collaboration with fine-grained IAM permission controls.
- It also introduces Agent-Driven AI Infra: launch the Kiro Agent and build production-grade clusters, model deployments, and managed Operator configurations from scratch through minimal interaction, powered by MCP tools (wrapping project backend APIs for best-practice compliance) and project-level Agent SKILLs that orchestrate complete end-to-end workflows.
- Cluster Management: Supports EKS cluster creation, importing existing EKS clusters, cluster environment configuration, HyperPod cluster creation and scaling, EKS Node Group creation
- Model Management: Supports multiple S3 CSI configurations, as well as HuggingFace model downloads (CPU Pod)
- Inference: Hosting for vLLM, SGLang or any custom container, with support for binding Pods to different Services (no need to repeatedly destroy and create Pods during resource rebalancing)
- Training: Supports model training patterns including LlamaFactory and Torch Script
- Training History: Integration with SageMaker-managed MLFlow creation and display/sharing of training performance metrics
- (NEW)Agentic Orchestration: Provides integrated MCP server for Natural language based AI task orchestration, e.g. Cluster Management, Inference, (Coming Soon) Training & Hosting
- (NEW)SandBox Service for RL Training: Provides interactive SandBox Service within the Cluster for (Coding) RL Training
For detailed setup instructions, please refer to Feishu Doc (zh_cn), or Lark Doc (en)
- Agent-driven AI infrastructure — Just launch the Kiro Agent and, through minimal interaction, build production-grade HyperPod clusters, model deployments, and managed Operator configurations from scratch. Unlike directly letting a coding agent call AWS CLI/SDKs, the project's MCP tools wrap backend APIs to enforce AI-workload best practices (avoiding agent improvisation while cutting interaction rounds and context usage), and project-level Agent SKILLs orchestrate complete business workflows for robust, reproducible multi-step operations — all running in-container with zero local setup. AWS
agent-plugins/sagemaker-aiskills additionally supplement cluster troubleshooting (NCCL, node/GPU, performance, creation failures). See the [Agentic] section of the project manual. - One-click cluster provisioning & node management — End-to-end EKS + HyperPod creation, dependency configuration, scaling, and node lifecycle operations from a single interface.
- High-resilience distributed training — HyperPod Training Operator greatly simplifies Torch distributed setup and adds business-log anomaly monitoring with process-level restart recovery.
- RL training orchestration — Standard KubeRay acts as the orchestrator for the Verl reinforcement-learning framework. The platform can also deploy sandbox-as-a-service — isolated, on-demand execution sandboxes exposed as a service — to serve as the interactive environment for agent RL training (e.g. tool-use / code-execution rollouts).
- Flexible inference serving — Self-manage any container (e.g. OSS vLLM/SGLang) with intelligent routing and metrics, or use the HyperPod-managed Inference Operator with built-in intelligent routing, L1/L2 KV-cache sharing, and integrated observability.
- Karpenter-based autoscaling — Just-in-time GPU node provisioning, including HyperPod Spot instances, that scales capacity to match workload demand.
- Model & storage management — HuggingFace model downloads and flexible S3 storage mounting for training and inference assets.
- Training observability — Managed MLflow tracking server for storing metrics, with sharing and collaboration under fine-grained IAM permission controls.





