GPU Image Builder

High‑performance NVIDIA GPU images for training, inference, and multi‑node workloads

A modular, production-grade system for building GPU-accelerated VM images on Google Cloud using HashiCorp Packer, with an option to use the AWS plugin for EC2 builds. Designed for training, inference, and multi-node distributed workloads on NVIDIA GPUs.

Why This Project Exists

Every time you spin up a new GPU node, you end up doing the same thing: installing drivers, CUDA, PyTorch, tuning sysctl knobs, fighting dependency conflicts, burning hours and cloud credits before any real work starts. (I know, cause i spent hours handling those for this project)

This project was born out of that frustration. Instead of repeating the same setup ritual on every node, it bakes everything into a single, reproducible image. Boot it up and you're ready to train or serve immediately.

It also became a way for me to explore and document the NVIDIA ecosystem; driver versions, DCGM, Fabric Manager, UCX/RDMA, NCCL tuning, and all the performance tweaks that are scattered across dozens of docs but rarely packaged together.

Architecture

The build is fully layered — each layer is optional and can be enabled independently:

┌──────────────────────────────────────────────────┐
│  Multi-node Layer (NCCL, RDMA, SHARP)            │  enable_multinode = true
├──────────────────────────────────────────────────┤
│  Training Layer (PyTorch, DeepSpeed, DALI, UCX)  │  enable_training = true
├──────────────────────────────────────────────────┤
│  Inference Layer (PyTorch, vLLM, TensorRT-LLM)   │  enable_inference = true
├──────────────────────────────────────────────────┤
│  Monitoring Layer (DCGM 4, Fabric Manager)        │  always included
├──────────────────────────────────────────────────┤
│  Performance Layer (NUMA, hugepages, governors)   │  always included
├──────────────────────────────────────────────────┤
│  Framework Layer (Python, uv, /opt/ml venv)      │  always included
├──────────────────────────────────────────────────┤
│  Base Layer (Driver 590, CUDA 13.1, CTK)         │  always included
└──────────────────────────────────────────────────┘

Features

Base Layer

NVIDIA driver 590+ (open kernel modules, compute-only)
CUDA Toolkit 13.1
NVIDIA Container Toolkit (Docker + containerd)
DCGM + Fabric Manager (NVSwitch/HGX)
NVIDIA Persistence Daemon
Swap disabled, hugepages, CPU governor tuning, NUMA tools

Framework Layer

Python 3 + uv package manager
Shared ML virtual environment at /opt/ml

Performance Tuning

GPU persistence mode (nvidia-smi -pm 1)
NUMA awareness + CPU pinning (numactl, hwloc)
Hugepages (2048 pages)
irqbalance disabled
Swap disabled, vm.swappiness=0

Monitoring

DCGM 4 (matched to CUDA major version)
DCGM Exporter for Prometheus
Fabric Manager for NVLink/NVSwitch GPUs

Training Layer (`enable_training = true`)

PyTorch (CUDA-enabled) + Transformers + Accelerate
DeepSpeed
Megatron-LM dependencies (ninja, einops)
NVIDIA DALI
UCX + RDMA + InfiniBand
Network sysctl tuning (BBR, large buffers)

Inference Layer (`enable_inference = true`)

PyTorch (CUDA-enabled) + Transformers + Accelerate
Triton, xFormers, Flash-Attention
cuPy, cupynumeric, NVIDIA Warp
vLLM
TensorRT-LLM
Quantization (bitsandbytes, AutoAWQ, Optimum)

Repository Structure

.
├── build.pkr.hcl                        # Build pipeline — provisioner ordering
├── node.pkr.hcl                         # GCP source definition (googlecompute)
├── variable.pkr.hcl                     # Variable definitions with defaults
├── local.pkr.hcl                        # Local values
├── plugins.pkr.hcl                      # Packer plugin requirements
├── example.pkrvars.hcl.example          # Example variable values (copy and customize)
├── script/
│   ├── base.sh                          # NVIDIA driver, CUDA, CTK, DCGM, system tuning
│   ├── framework_setup.sh               # Python, uv, /opt/ml venv
│   ├── performance-script.sh            # GPU persistence, NUMA, hugepages, irqbalance
│   ├── monitoring-setup.sh              # DCGM 4, Fabric Manager, DCGM Exporter
│   ├── training_optimized_script.sh     # PyTorch + DeepSpeed + DALI + UCX/RDMA
│   ├── inference_optimized_setup.sh     # PyTorch + vLLM + TensorRT-LLM + quantization
│   └── multinode_performance_tuning.sh  # Multi-node networking optimizations
└── README.md

Prerequisites

HashiCorp Packer >= 1.9
Google Compute Packer plugin (installed via packer init)
Optionally, the AWS Packer plugin can be used for EC2 builds by adding an amazon-ebs source to node.pkr.hcl
GCP project with Compute Engine API enabled (or AWS account with EC2 access)
GCP authentication (gcloud auth application-default login) or AWS credentials
Ubuntu 22.04 LTS base image
NVIDIA GPU instance type (e.g. A100, H100, L4 on GCP; p4d, p5 on AWS)

Quick Start

1. Initialize Packer plugins

packer init .

2. Configure variables

cp example.pkrvars.hcl.example my.pkrvars.hcl
# Edit my.pkrvars.hcl with your GCP project, zone, machine type, etc.

3. Validate the template

packer validate -var-file=my.pkrvars.hcl .

4. Build a general-purpose GPU image

packer build -var-file=my.pkrvars.hcl .

This builds all always-included layers: base, framework, performance, and monitoring.

5. Build a training-optimized image

Set enable_training = true in your .pkrvars.hcl file, then:

packer build -var-file=my.pkrvars.hcl .

Adds PyTorch, DeepSpeed, DALI, UCX/RDMA, and NCCL tuning.

6. Build an inference-optimized image

Set enable_inference = true in your .pkrvars.hcl file, then:

packer build -var-file=my.pkrvars.hcl .

Adds PyTorch, vLLM, TensorRT-LLM, and quantization libraries.

7. Build with multi-node optimizations

Set enable_multinode = true in your .pkrvars.hcl file for NCCL/RDMA/SHARP tuning.

Variables

Variable	Type	Default	Description
`image_name`	string	—	Name of the resulting image
`image_description`	string	—	Description of the image
`project_id`	string	—	GCP project ID
`image_family`	string	—	Source image family (e.g. `ubuntu-2204-lts`)
`image_project_id`	list(string)	—	Project(s) to search for the source image
`zone`	string	—	GCP zone for the build instance
`machine_type`	string	—	GCP machine type (e.g. `a2-highgpu-1g`)
`ssh_username`	string	—	SSH username for Packer
`disk_size`	number	`100`	Boot disk size in GB
`driver_version`	string	`590.48.01`	NVIDIA latest driver version
`cuda_version`	string	`13.1`	CUDA toolkit version
`pytorch_version`	string	`2.5.1`	PyTorch version
`enable_training`	bool	`false`	Enable training layer
`enable_inference`	bool	`false`	Enable inference layer
`enable_multinode`	bool	`false`	Enable multi-node optimizations

Layer Details

Layer	Script	Components	When
Base	`base.sh`	NVIDIA driver (open kernel modules), CUDA toolkit, Container Toolkit, DCGM, Persistence Daemon, swap/hugepages/NUMA tuning	Always
Framework	`framework_setup.sh`	Python 3, uv, `/opt/ml` shared venv	Always
Performance	`performance-script.sh`	GPU persistence mode, swap off, hugepages, NUMA/hwloc tools, irqbalance disabled	Always
Monitoring	`monitoring-setup.sh`	DCGM 4 (CUDA-version matched), DCGM Exporter, Fabric Manager	Always
Training	`training_optimized_script.sh`	PyTorch, Transformers, DeepSpeed, DALI, UCX/RDMA, network tuning	`enable_training`
Inference	`inference_optimized_setup.sh`	PyTorch, Triton, xFormers, Flash-Attention, vLLM, TensorRT-LLM, quantization	`enable_inference`
Multi-node	`multinode_performance_tuning.sh`	NCCL tuning, RDMA, SHARP	`enable_multinode`

Using AWS Instead of GCP

The provisioning scripts are cloud-agnostic. To build on AWS instead of GCP:

Add the AWS plugin to plugins.pkr.hcl:

amazon = {
  source  = "github.com/hashicorp/amazon"
  version = "~> 1"
}

Add an amazon-ebs source in node.pkr.hcl targeting a GPU instance type (e.g. p4d.24xlarge, p5.48xlarge)
Update the sources list in build.pkr.hcl to reference the new source

All shell scripts under script/ will work without modification.

References

License

MIT License. Contributions are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Image Builder

Why This Project Exists

Architecture

Features

Base Layer

Framework Layer

Performance Tuning

Monitoring

Training Layer (`enable_training = true`)

Inference Layer (`enable_inference = true`)

Repository Structure

Prerequisites

Quick Start

1. Initialize Packer plugins

2. Configure variables

3. Validate the template

4. Build a general-purpose GPU image

5. Build a training-optimized image

6. Build an inference-optimized image

7. Build with multi-node optimizations

Variables

Layer Details

Using AWS Instead of GCP

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
script		script
.gitignore		.gitignore
README.md		README.md
Todo.md		Todo.md
build.pkr.hcl		build.pkr.hcl
example.pkrvars.hcl.example		example.pkrvars.hcl.example
local.pkr.hcl		local.pkr.hcl
node.pkr.hcl		node.pkr.hcl
plugins.pkr.hcl		plugins.pkr.hcl
variable.pkr.hcl		variable.pkr.hcl

Folders and files

Latest commit

History

Repository files navigation

GPU Image Builder

Why This Project Exists

Architecture

Features

Base Layer

Framework Layer

Performance Tuning

Monitoring

Training Layer (enable_training = true)

Inference Layer (enable_inference = true)

Repository Structure

Prerequisites

Quick Start

1. Initialize Packer plugins

2. Configure variables

3. Validate the template

4. Build a general-purpose GPU image

5. Build a training-optimized image

6. Build an inference-optimized image

7. Build with multi-node optimizations

Variables

Layer Details

Using AWS Instead of GCP

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Training Layer (`enable_training = true`)

Inference Layer (`enable_inference = true`)

Packages