Skip to content

haozhx23/HyperPod-InstantStart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

292 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HyperPod-InstantStart

English Documentation Chinese Documentation

HyperPod InstantStart is a training-and-inference integrated platform built on SageMaker HyperPod. It utilizes standard EKS orchestration and supports training and inference tasks with arbitrary GPU resource granularity.

Overview

HyperPod-InstantStart provides a unified interface for managing ML infrastructure, from cluster provisioning to training job orchestration and model serving.

  • For training, it leverages HyperPod Training Operator (significantly simplifying distributed configuration with process-level recovery and log exception monitoring; optional), or KubeRay (as an orchestrator for the reinforcement learning framework VERL).
  • For inference, it supports deployment on single or multi-node setups using arbitrary containers, such as standard vLLM/SGLang or self-built containers, while also providing standardized API exposure (e.g., OpenAI-compatible API). You can self-manage serving with intelligent routing and metrics, or use the HyperPod-managed Inference Operator for built-in routing, L1/L2 KV-cache sharing, and integrated observability.
  • Additionally, it offers managed MLFlow Tracking Server for storing training metrics, enabling sharing and collaboration with fine-grained IAM permission controls.
  • It also introduces Agent-Driven AI Infra: launch the Kiro Agent and build production-grade clusters, model deployments, and managed Operator configurations from scratch through minimal interaction, powered by MCP tools (wrapping project backend APIs for best-practice compliance) and project-level Agent SKILLs that orchestrate complete end-to-end workflows.

Architecture

Architecture Diagram

Demo Videos

Create HyperPod Cluster

hypd create

Download Model from HuggingFace

model download

Model Deployment from S3

deploy

Distributed Verl Training with KubeRay

verl

Agentic Orchestration and AI Workloads

agentic

Key Components

  • Cluster Management: Supports EKS cluster creation, importing existing EKS clusters, cluster environment configuration, HyperPod cluster creation and scaling, EKS Node Group creation
  • Model Management: Supports multiple S3 CSI configurations, as well as HuggingFace model downloads (CPU Pod)
  • Inference: Hosting for vLLM, SGLang or any custom container, with support for binding Pods to different Services (no need to repeatedly destroy and create Pods during resource rebalancing)
  • Training: Supports model training patterns including LlamaFactory and Torch Script
  • Training History: Integration with SageMaker-managed MLFlow creation and display/sharing of training performance metrics
  • (NEW)Agentic Orchestration: Provides integrated MCP server for Natural language based AI task orchestration, e.g. Cluster Management, Inference, (Coming Soon) Training & Hosting
  • (NEW)SandBox Service for RL Training: Provides interactive SandBox Service within the Cluster for (Coding) RL Training

For detailed setup instructions, please refer to Feishu Doc (zh_cn), or Lark Doc (en)

Key Features

  • Agent-driven AI infrastructure — Just launch the Kiro Agent and, through minimal interaction, build production-grade HyperPod clusters, model deployments, and managed Operator configurations from scratch. Unlike directly letting a coding agent call AWS CLI/SDKs, the project's MCP tools wrap backend APIs to enforce AI-workload best practices (avoiding agent improvisation while cutting interaction rounds and context usage), and project-level Agent SKILLs orchestrate complete business workflows for robust, reproducible multi-step operations — all running in-container with zero local setup. AWS agent-plugins/sagemaker-ai skills additionally supplement cluster troubleshooting (NCCL, node/GPU, performance, creation failures). See the [Agentic] section of the project manual.
  • One-click cluster provisioning & node management — End-to-end EKS + HyperPod creation, dependency configuration, scaling, and node lifecycle operations from a single interface.
  • High-resilience distributed training — HyperPod Training Operator greatly simplifies Torch distributed setup and adds business-log anomaly monitoring with process-level restart recovery.
  • RL training orchestration — Standard KubeRay acts as the orchestrator for the Verl reinforcement-learning framework. The platform can also deploy sandbox-as-a-service — isolated, on-demand execution sandboxes exposed as a service — to serve as the interactive environment for agent RL training (e.g. tool-use / code-execution rollouts).
  • Flexible inference serving — Self-manage any container (e.g. OSS vLLM/SGLang) with intelligent routing and metrics, or use the HyperPod-managed Inference Operator with built-in intelligent routing, L1/L2 KV-cache sharing, and integrated observability.
  • Karpenter-based autoscaling — Just-in-time GPU node provisioning, including HyperPod Spot instances, that scales capacity to match workload demand.
  • Model & storage management — HuggingFace model downloads and flexible S3 storage mounting for training and inference assets.
  • Training observability — Managed MLflow tracking server for storing metrics, with sharing and collaboration under fine-grained IAM permission controls.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors