HyperPod-InstantStart

HyperPod InstantStart is a training-and-inference integrated platform built on SageMaker HyperPod. It utilizes standard EKS orchestration and supports training and inference tasks with arbitrary GPU resource granularity.

Overview

HyperPod-InstantStart provides a unified interface for managing ML infrastructure, from cluster provisioning to training job orchestration and model serving.

For training, it leverages HyperPod Training Operator (significantly simplifying distributed configuration with process-level recovery and log exception monitoring; optional), or KubeRay (as an orchestrator for the reinforcement learning framework VERL).
For inference, it supports deployment on single or multi-node setups using arbitrary containers, such as standard vLLM/SGLang or self-built containers, while also providing standardized API exposure (e.g., OpenAI-compatible API). You can self-manage serving with intelligent routing and metrics, or use the HyperPod-managed Inference Operator for built-in routing, L1/L2 KV-cache sharing, and integrated observability.
Additionally, it offers managed MLFlow Tracking Server for storing training metrics, enabling sharing and collaboration with fine-grained IAM permission controls.
It also introduces Agent-Driven AI Infra: launch the Kiro Agent and build production-grade clusters, model deployments, and managed Operator configurations from scratch through minimal interaction, powered by MCP tools (wrapping project backend APIs for best-practice compliance) and project-level Agent SKILLs that orchestrate complete end-to-end workflows.

Architecture

Demo Videos

Create HyperPod Cluster

Download Model from HuggingFace

Model Deployment from S3

Distributed Verl Training with KubeRay

Agentic Orchestration and AI Workloads

Key Components

Cluster Management: Supports EKS cluster creation, importing existing EKS clusters, cluster environment configuration, HyperPod cluster creation and scaling, EKS Node Group creation
Model Management: Supports multiple S3 CSI configurations, as well as HuggingFace model downloads (CPU Pod)
Inference: Hosting for vLLM, SGLang or any custom container, with support for binding Pods to different Services (no need to repeatedly destroy and create Pods during resource rebalancing)
Training: Supports model training patterns including LlamaFactory and Torch Script
Training History: Integration with SageMaker-managed MLFlow creation and display/sharing of training performance metrics
(NEW)Agentic Orchestration: Provides integrated MCP server for Natural language based AI task orchestration, e.g. Cluster Management, Inference, (Coming Soon) Training & Hosting
(NEW)SandBox Service for RL Training: Provides interactive SandBox Service within the Cluster for (Coding) RL Training

For detailed setup instructions, please refer to Feishu Doc (zh_cn), or Lark Doc (en)

Key Features

Agent-driven AI infrastructure — Just launch the Kiro Agent and, through minimal interaction, build production-grade HyperPod clusters, model deployments, and managed Operator configurations from scratch. Unlike directly letting a coding agent call AWS CLI/SDKs, the project's MCP tools wrap backend APIs to enforce AI-workload best practices (avoiding agent improvisation while cutting interaction rounds and context usage), and project-level Agent SKILLs orchestrate complete business workflows for robust, reproducible multi-step operations — all running in-container with zero local setup. AWS agent-plugins/sagemaker-ai skills additionally supplement cluster troubleshooting (NCCL, node/GPU, performance, creation failures). See the [Agentic] section of the project manual.
One-click cluster provisioning & node management — End-to-end EKS + HyperPod creation, dependency configuration, scaling, and node lifecycle operations from a single interface.
High-resilience distributed training — HyperPod Training Operator greatly simplifies Torch distributed setup and adds business-log anomaly monitoring with process-level restart recovery.
RL training orchestration — Standard KubeRay acts as the orchestrator for the Verl reinforcement-learning framework. The platform can also deploy sandbox-as-a-service — isolated, on-demand execution sandboxes exposed as a service — to serve as the interactive environment for agent RL training (e.g. tool-use / code-execution rollouts).
Flexible inference serving — Self-manage any container (e.g. OSS vLLM/SGLang) with intelligent routing and metrics, or use the HyperPod-managed Inference Operator with built-in intelligent routing, L1/L2 KV-cache sharing, and integrated observability.
Karpenter-based autoscaling — Just-in-time GPU node provisioning, including HyperPod Spot instances, that scales capacity to match workload demand.
Model & storage management — HuggingFace model downloads and flexible S3 storage mounting for training and inference assets.
Training observability — Managed MLflow tracking server for storing metrics, with sharing and collaboration under fine-grained IAM permission controls.

Name		Name	Last commit message	Last commit date
Latest commit History 292 Commits
.claude		.claude
.kiro		.kiro
cli-min		cli-min
hypd-inst-mcp		hypd-inst-mcp
resources		resources
smtj-recipes		smtj-recipes
train-recipes		train-recipes
ui-panel		ui-panel
.mcp.json		.mcp.json
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyperPod-InstantStart

Overview

Architecture