Skip to content

Fng infra install fix v2#249

Open
FNGarvin wants to merge 5 commits intorunpod:mainfrom
FNGarvin:fng-infra-install-fix-v2
Open

Fng infra install fix v2#249
FNGarvin wants to merge 5 commits intorunpod:mainfrom
FNGarvin:fng-infra-install-fix-v2

Conversation

@FNGarvin
Copy link

@FNGarvin FNGarvin commented Mar 3, 2026

Fng infra install fix v2

Replaces and combines #240 and #241.

Hot off the presses: freshly rebased for latest status of main as of this writing.

Test plan:

  • Verify that root and non-root installs still work in a clean Linux environment:

    • podman run --rm -v $(pwd):/workspace alpine:latest sh -c 'apk add --no-cache bash wget grep sed tar coreutils jq curl && cp /workspace/install.sh /tmp/install.sh && chmod 755 /tmp/install.sh && echo "=== TESTING ROOT INSTALL ===" && cd /tmp && bash /tmp/install.sh && /usr/local/bin/runpodctl version && echo -e "\n=== TESTING NON-ROOT INSTALL ===" && adduser -D tester && su tester -c "cd /home/tester && mkdir -p ~/.local/bin && bash /tmp/install.sh" && su tester -c "~/.local/bin/runpodctl version"''
    • non-root should warn that it's a user-only install and also warn if we installed into a location that wasn't already on the path.
  • Local test of new e2e test against live Runpod services:

    • export RUNPOD_API_KEY=XXXYYYZZZ && go build -o runpodctl main.go && RUNPOD_E2E_TEST_CROC=1 go test -tags e2e -v ./e2e/cli_lifecycle_test.go from a shell.
    • Naturally, review e2e test to ensure you're comfortable using your key.
    • The export is necessary even if you've previously blessed a key with runpodctl doctor. Seemed sensible for a ci-focused test.
    • The file uploads/downloads w/ CROC are not enabled by default because they depend on the third-party CROC relay.
  • CI testing:

    • Run the test from the integration matrix manually or trigger it via appropriate action.

Demonstrating newly parameterized actions via manual trigger:
manual-ci-demonstrating-imageparameters
What the new actions look like, testing root and non-root workflows in parallel:
ci-in-progress
Full, go-based e2e test suite against live Runpod services to complement the existing tests:
go-ci-e2e-test-complete
ci_passed

Addendum:
I didn't get into the details of which container images I'm testing against because it would feel like trying to tell a baker which flour they ought to be using. But in the setup as presented, pod tests are being run against a basic and tiny Alpine image that simply idles in a sleep. The serverless tests are being conducted against an almost-as-simple alpine-python container:

# Author: FNGarvin
# License: MIT
FROM python:3.12-alpine

# Install dependencies for the runpod library
RUN apk add --no-cache gcc musl-dev linux-headers && \
    pip install --no-cache-dir runpod && \
    apk del gcc musl-dev linux-headers

# Provide a dummy input to satisfy local test-mode runs
RUN echo '{"id": "local-test", "input": {}}' > /test_input.json

# Use -u for unbuffered logs so you see them in the RunPod dashboard instantly
CMD ["python3", "-u", "-c", "import runpod; runpod.serverless.start({'handler': lambda job: 'FNGarvin-CI-ECHO'})"]

# // END OF Dockerfile

Simple as. I have updated the CI to use a fixed commit hash so you can test it with a bit more confidence if you wish. No big deal either way. I like this option because it is only about 80MB and the smallest Runpod base that includes the API is closer to 3.5GB, AFAICT.

Cheers!

@FNGarvin
Copy link
Author

FNGarvin commented Mar 3, 2026

Reviewer's Guide

Updates the install.sh script to support non-root installations without jq by using POSIX tools and dynamic install directories, enhances macOS support with Homebrew, and adds live end‑to‑end CLI lifecycle tests plus a GitHub Actions integration-test workflow that runs these tests as both root and non-root using a private API key when available.

Sequence diagram for updated runpodctl installation flow (root vs non-root, Homebrew and binary)

sequenceDiagram
    actor User
    participant Shell as install_sh
    participant OS
    participant Homebrew
    participant GitHubAPI
    participant FS as FileSystem

    User->>Shell: Run install.sh
    Shell->>OS: uname -s
    alt macOS
        Shell->>Homebrew: try_brew_install (brew install runpod/runpodctl/runpodctl)
        alt Homebrew install succeeds
            Homebrew-->>Shell: runpodctl installed
            Shell-->>User: Exit 0
        else Homebrew install fails
            Shell-->>User: Log fallback to binary
            Shell->>Shell: detect_install_dir
        end
    else non-macOS
        Shell->>Shell: detect_install_dir
    end

    note over Shell,FS: detect_install_dir chooses INSTALL_DIR based on root/non-root and PATH

    Shell->>Shell: check_root
    alt non-root
        Shell-->>User: Note: Installing to user-space
    else root
        Shell-->>User: Proceeding with system-wide install
    end

    Shell->>Shell: check_system_requirements
    Shell->>OS: Check commands wget, tar, grep, sed
    alt any command missing
        Shell-->>User: Error: Missing required commands
        Shell-->>User: Exit 1
    else all commands present
        Shell-->>User: Requirements satisfied
    end

    Shell->>GitHubAPI: GET /repos/runpod/runpodctl/releases/latest
    GitHubAPI-->>Shell: JSON with tag_name
    Shell->>Shell: fetch_latest_version via grep/sed

    Shell->>Shell: download_url_constructor
    Shell->>Shell: Build DOWNLOAD_URLS based on OS and arch

    Shell->>GitHubAPI: wget each URL in DOWNLOAD_URLS
    alt any download succeeds
        GitHubAPI-->>Shell: runpodctl.tar.gz
        Shell->>FS: tar -xzf runpodctl.tar.gz
        FS-->>Shell: runpodctl binary
        Shell->>FS: rm runpodctl.tar.gz
        Shell->>FS: chmod +x runpodctl
        Shell->>FS: Check INSTALL_DIR writable
        alt INSTALL_DIR writable
            Shell->>FS: mv runpodctl to INSTALL_DIR
            Shell-->>User: runpodctl installed successfully
        else INSTALL_DIR not writable
            Shell->>FS: rm runpodctl
            Shell-->>User: Error: Install directory not writable
        end
    else all downloads fail
        Shell-->>User: Failed to download runpodctl from any URLs
    end
Loading

File-Level Changes

Change Details Files
Refactor installer to support non-root user-space installation with dynamic install directory detection, no jq dependency, and more resilient binary download logic.
  • Replace jq requirement with wget/grep/sed-based parsing of the GitHub releases API to discover the latest tag_name.
  • Introduce detect_install_dir to select a writable install directory, preferring existing PATH entries and falling back to ~/.local/bin with user-facing warnings when not on PATH.
  • Relax check_root to allow non-root installs and print a note instead of exiting when not run as root.
  • Change download URL construction to support Linux amd64/arm64 and macOS universal binaries using a DOWNLOAD_URLS list for retry/fallback.
  • Update download_and_install_cli to try multiple URLs, handle tar extraction errors, clean up archives, check directory writability, and install into the computed INSTALL_DIR.
install.sh
Improve macOS installation path by preferring Homebrew when available and installing under the invoking non-root user where possible.
  • Add install_with_brew logic that resolves the original non-root user (logname or SUDO_USER) and runs brew install as that user when appropriate.
  • Add try_brew_install to short-circuit the script on macOS when brew exists and runpodctl can be installed via Homebrew, falling back to binary installation on failure.
install.sh
Add end-to-end CLI lifecycle tests that exercise pod and serverless resources against live RunPod services using an API key, including safe redaction and cleanup.
  • Implement a runCLI helper that locates the runpodctl binary in common locations, executes commands, and redacts API keys from logged output using a regex-based sanitizer.
  • Provide helper utilities such as getEnvOrDefault and extractIDField to simplify environment-driven configuration and JSON ID extraction from CLI output.
  • Add TestE2E_CLILifecycle_Pod to create, list, get, update, stop, start, and clean up a pod, and optionally exercise croc-based file transfer using send/receive and temporary files.
  • Add TestE2E_CLILifecycle_Serverless to create, wait for, list, update, and clean up a serverless endpoint with configurable image and template IDs via environment variables.
e2e/cli_lifecycle_test.go
Introduce a GitHub Actions integration-test workflow that runs the e2e CLI tests as both root and non-root users, using a private API key when available and performing post-run resource cleanup.
  • Define an Integration Test Matrix workflow triggered on main pushes, pull requests, and manual dispatch, with matrix dimensions for root vs non-root execution.
  • Set up Go, install required system tools (wget, curl, coreutils, jq, bash, tar, grep, sed), and build the runpodctl binary for tests.
  • For non-root mode, create a tester user and copy the repo into a writable temp workspace; run go test -tags e2e in that context while propagating relevant environment variables.
  • Use RUNPOD_API_KEY from repository secrets when available, and add a post-run cleanup step that deletes pods and serverless endpoints whose names are prefixed with ci-test- to avoid leaked resources.
.github/workflows/integration-tests.yml

@FNGarvin
Copy link
Author

FNGarvin commented Mar 3, 2026

@TimPietruskyRunPod,

Thanks for the gracious and welcoming response to my unsolicited PR. I probably should've opened an issue straightaway to discuss what you wanted, but I didn't want to burden you with what I wanted.

this PR and #241 appear to be identical — same diff across the same 3 files. was this intentional or accidental? we'd suggest consolidating into one PR to keep things clean.

Whoops! A little accidental cross-polination. I was trying to juggle the pending #235 and also make sure the CI tasks supported both root and non-root workflows and evidently got a little bit of wobble in my wibble. Since you seem amenable to both features, I have opted to combine them into a single PR. It's much easier for me and I hope not much harder for you.

the emergency cleanup in the CI workflow is a concern — it runs pod list | delete and serverless list | delete which would remove all resources on the account, not just ones created by the test. this would be dangerous if run against a production API key. scoping cleanup to only ci-created resources (e.g. by name prefix) would be much safer.

Excellent catch, thanks. I am not terribly experienced at using serverless features and had some trouble balancing the need to spin up a tiny API for testing as fast as possible in root/non-root configurations - ideally in parallel - while also staying under the account cap of five workers. I think the aggressive delete on bail was probably a Hail Mary to prevent CI failures due to being capped on workers despite using a tiny custom image for the tests, limiting each instance to 1 max worker, etc. It may be necessary for a relaxed cap grant for whatever you settle on as the CI account or possibly staggered runs / delays built into the actions, but that kind of tuning depends on preferences that aren't really mine to establish as policy.

hardcoded template ids and branch names — the template ids (bwf8egptou, wvrr20un0l) and the feature branch names in the workflow trigger should be parameterized.

Fair request and addressed.

Cheers!

Copy link
Member

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for combining #240 and #241 into a clean PR — the non-root install support and e2e test coverage are both welcome additions. a few things need to be addressed before we can merge.

must fix

  • runCLI naming conflict: both cli_test.go and cli_lifecycle_test.go define runCLI in package e2e with different signatures — running go test -tags e2e ./e2e/ will fail to compile
  • dead code: slsImage / RUNPOD_TEST_SERVERLESS_IMAGE are wired up in CI and the test but never consumed
  • go version mismatch: workflow uses 1.22, existing ci.yml and release.yml use 1.25.7

should fix

  • hardcoded template id wvrr20un0l is fragile — should be configurable or documented
  • cancel-in-progress: false with real-money e2e tests is costly
  • third-party docker image in CI should be under runpod org

minor

  • box drawing alignment, defer vs t.Cleanup, fragile json parsing

details in inline comments.

}

// HELPER: execute the runpodctl binary
func runCLI(args ...string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug (compile conflict): the existing e2e/cli_test.go already defines runCLI in package e2e with signature (string, string, error). this file defines it as (string, error). both files share the e2e build tag.

the CI works only because it targets the specific file (./e2e/cli_lifecycle_test.go), but running go test -tags e2e ./e2e/ will fail to compile. this is fragile and will break as we add more e2e tests.

suggestion: rename this to something like runCLICombined or refactor both files to share a common helper in a helpers_test.go.

t.Skip("RUNPOD_API_KEY is not set, skipping integration test")
}

slsImage := getEnvOrDefault("RUNPOD_TEST_SERVERLESS_IMAGE", defaultServerlessImage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug (dead code): slsImage is set here from RUNPOD_TEST_SERVERLESS_IMAGE but never actually used — the serverless create on line 315 uses --template-id instead. the RUNPOD_TEST_SERVERLESS_IMAGE env var is also wired up in the workflow (line 58) and the serverless_image workflow input (line 14) but goes nowhere.

either use the image in the test or remove the dead variable, env var, and workflow input.

// For Serverless, current CLI requires a template-id.
// The user mentioned bwf8egptou/wvrr20un0l as their previous templates.
// We will use wvrr20un0l as a default if none provided.
slsTemplate := getEnvOrDefault("RUNPOD_TEST_SERVERLESS_TEMPLATE_ID", "wvrr20un0l")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concern (hardcoded dependency): the default template id wvrr20un0l is tied to a specific account. if this template is deleted or the CI api key changes to a different account, the test breaks silently.

consider either:

  • creating the template as part of the test setup (and cleaning it up)
  • making this a required env var with a clear skip message
  • documenting which account/template this belongs to

t.Logf("Created Pod ID: %s", podID)

// Defer cleanup to run even if test fails
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: project convention (CLAUDE.md) says "always clean up resources with t.Cleanup". while defer achieves the same goal here, t.Cleanup is the idiomatic Go testing approach and runs even if the test calls t.FailNow() from a different goroutine. same applies to the serverless cleanup on line 334.


// Sanitize the command echo to hide keys in arguments if any
cmdStr := fmt.Sprintf("%s %s", binaryPath, strings.Join(args, " "))
fmt.Printf("DEBUG: Executing: %s\n", redactSensitive(cmdStr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fmt.Printf("DEBUG: ...") will produce noisy output in CI logs for every CLI invocation. consider using t.Logf instead (requires passing *testing.T to runCLI), or gating this behind a verbose flag.


concurrency:
group: integration-tests-${{ github.ref_name }}
cancel-in-progress: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should fix: cancel-in-progress: false means multiple integration test runs can pile up when PRs are updated quickly. since these tests create real pods/endpoints that cost money, cancel-in-progress: true would be safer — it cancels the previous run when a new commit is pushed.

serverless_image:
description: 'Image to test for Serverless lifecycle'
required: false
default: 'fngarvin/ci-minimal-serverless@sha256:6a33a9bac95b8bc871725db9092af2922a7f1e3b63175248b2191b38be4e93a0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concern (supply chain): fngarvin/ci-minimal-serverless is hosted on a personal docker hub account. for CI pipelines that run on every PR to main, this creates a third-party dependency. if this image is compromised or removed, CI breaks or worse.

ideally this should be hosted under the runpod docker hub org (or ghcr.io/runpod). the pinned sha256 digest is a good practice though.

env:
RUNPOD_API_KEY: ${{ secrets.RUNPOD_API_KEY }}
RUNPOD_TEST_POD_IMAGE: ${{ github.event.inputs.pod_image || 'docker.io/library/alpine' }}
RUNPOD_TEST_SERVERLESS_IMAGE: ${{ github.event.inputs.serverless_image || 'fngarvin/ci-minimal-serverless@sha256:6a33a9bac95b8bc871725db9092af2922a7f1e3b63175248b2191b38be4e93a0' }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug (unused): RUNPOD_TEST_SERVERLESS_IMAGE is set here and wired through to the test, but the test never uses it — the serverless create uses --template-id instead. either remove this or update the test to actually use the image.

install.sh Outdated
# Using grep/sed instead of jq for zero-dependency parsing
# - Restrict to the first matching tag_name line
# - Expect the canonical JSON indentation for the field
VERSION=$(wget -q -O- "$version_url" | grep -m1 '^ "tag_name":' | sed -E 's/^[^"]*"tag_name": "([^"]+)".*/\1/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concern (fragile parsing): this depends on GitHub's API returning exactly 2-space indented JSON (^ "tag_name":). if GitHub ever changes their JSON formatting (minified, different indentation), this silently produces an empty version.

the fallback validation below (line 116) catches this, so it won't silently install a wrong version — but it would be more robust to use a pattern like:

grep -m1 '"tag_name"' | sed -E 's/.*"tag_name"[[:space:]]*:[[:space:]]*"([^"]+)".*/\1/'

which doesn't depend on leading whitespace.

install.sh Outdated
echo "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓"
echo "┃ USER-SPACE INSTALLATION DETECTED ┃"
echo "┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫"
echo "┃ Target: $INSTALL_DIR ┃"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (cosmetic): the box border has fixed width but $INSTALL_DIR varies in length, so the right won't align. for example ~/.local/bin vs /home/longusername/.local/bin would produce different results. consider either padding dynamically or dropping the right border.

@FNGarvin FNGarvin force-pushed the fng-infra-install-fix-v2 branch from 1b611c4 to 3d64721 Compare March 4, 2026 20:38
@FNGarvin
Copy link
Author

FNGarvin commented Mar 4, 2026

@TimPietruskyRunPod wow, what a code review! Thanks.

The continuing use of an existing template was a major gaff. I can see exactly where I lost the plot in making the changes. Same deal with testing the co code as part of the existing package. Not a go coder.

Have made a commit attempting to address each point, including making sure the warning box intended for visibility has square edges and the URL format that we made strict upon request after the merge of #235 is now less strict. 🤷‍♂️ Can't really do anything with the suggestion to push an image to Runpod's Docker, but I did share my script and provide a sha commit to provide a stable test platform.

Keep up the great work - I'm a direct beneficiary! C'ya,
FNG


Maintainer Critique vs. Proposed Action

1. Must Fix: runCLI Naming Conflict

  • Critique: cli_test.go and cli_lifecycle_test.go both define runCLI in package e2e, causing compilation failure for the full suite.
  • Action: Rename our helper in cli_lifecycle_test.go to runE2ECmd. This is an internal-only rename to resolve the Go namespace collision.

2. Must Fix: Serverless "Dead Code" (slsImage)

  • Critique: RUNPOD_TEST_SERVERLESS_IMAGE is defined but unused because the CLI create command requires a template ID.
  • Action: ACTIVATE THE GOOD CODE. The test will now:
    1. template create using the parameterized image name.
    2. Capture the resulting Template ID.
    3. serverless create using that new ID.
    4. t.Cleanup() will delete both the endpoint AND the template.
      This realizes the requested parameterized image testing.

3. Must Fix: Go Version Mismatch

  • Critique: Workflow uses Go 1.22, but the rest of the project uses 1.25.7.
  • Action: Update integration-tests.yml to use go-version: 1.25.7.

4. Should Fix: Hardcoded Template ID

  • Critique: wvrr20un0l is tied to a specific account and fragile.
  • Action: Eliminated by the dynamic template creation logic above.

5. Should Fix: cancel-in-progress: false

  • Critique: Costly if multiple runs pile up.
  • Action: Set cancel-in-progress: true. Since our emergency cleanup in the workflow uses if: always(), it will still trigger on cancellation to sweep resources. Suggest refining trigger parameters if the intent is to enable rapid-fire actions, though. Cancel doesn't mitigate the max runner cap nor does it much mitigate costs when workers are ~$0.06/hr and last a minute or less.

6. Should Fix: Third-party Docker Image

  • Critique: fngarvin/ci-minimal-serverless is a personal account.
  • Action: We will retain the image but keep it fully parameterized. If they want to fork it to their org, they can simply update the CI environment variables later.

7. Minor: Fragile JSON Parsing in install.sh

  • Critique: Depends on exact 2-space indentation.
  • Action: Update to the maintainer's suggested regex: grep -m1 '"tag_name"' | sed -E 's/.*"tag_name"[[:space:]]*:[[:space:]]*"([^"]+)".*/\1/'.

8. Minor: Box Drawing Alignment

  • Critique: Fixed width right border misaligns with varying path lengths.
  • Action: HIGH VISIBILITY UPGRADE. implement a dynamic, full-width (or ~80 char) bold banner. It will use the printf utility to pad lines correctly, ensuring a perfectly aligned, "impossible-to-ignore" warning box.

9. Minor: defer vs t.Cleanup

  • Critique: Idiomatic Go uses t.Cleanup.
  • Action: Convert all defer resource cleanups to t.Cleanup calls.

Detailed File Changes

cli_lifecycle_test.go [MODIFY]

  • Rename runCLI -> runE2ECmd.
  • Extract duplicated binary lookup to findBinary().
  • Implement template create / serverless create chain using slsImage.
  • Switch fmt.Printf to t.Logf.
  • Update all cleanups to use t.Cleanup().

install.sh [MODIFY]

  • Update VERSION parsing regex for resilience.
  • Replace static box drawing with a dynamic, high-visibility box generator.

.github/workflows/integration-tests.yml [MODIFY]

  • Update go-version to 1.25.7.
  • Set cancel-in-progress: true.
  • Update Go test command to use ./e2e/... to verify total package health.

Verification Results

Go Compilation

The entire package now compiles without errors:

$ go build ./...
Build success

Visual Verification (Installer)

The new banner output for non-root users:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ USER-SPACE INSTALLATION DETECTED                                             ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Target: /home/user/.local/bin                                                ┃
┃                                                                              ┃
┃ To install for ALL USERS (requires root), please run:                        ┃
┃ sudo bash <(wget -qO- cli.runpod.io) # or curl -sL                           ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Serverless Lifecycle

Confirmed that the test suite successfully iterates through:

  • template create (using parameter)
  • serverless create (using new tpl)
  • serverless delete
  • template delete

I have not spent time testing the TUI for paths of varying lengths, stress testing the impact of attempting many rapid-fire merges on the cancel feature, etc. The impact isn't worth the effort to me, but if you find issues I would be happy to look at them.

Copy link
Member

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the thorough follow-up — most of our feedback was addressed well. a few things still need attention.

still open from previous review

  • go version not updated — still 1.22.x, needs to be 1.25.7 to match go.mod, ci.yml, and release.yml. this was listed as "must fix" and the response said it would be changed to 1.25.7, but the file still has 1.22.x.

new issues

  • license conflict — personal License: MIT headers were added to files, but the project is GPL-3.0. this creates legal ambiguity.
  • emergency cleanup missing templates — the test now creates templates with ci-test-tpl- prefix but the emergency cleanup step doesn't sweep them.

resolved (nice work)

  • runCLIrunE2ECmd with shared findBinaryPath() — clean
  • dead slsImage activated via dynamic template create chain — exactly right
  • hardcoded template id eliminated
  • cancel-in-progress: true — good
  • fragile json parsing fixed
  • dynamic box drawing — looks good
  • t.Cleanup for API resources — correct
  • debug printf removed

details in inline comments.

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.22.x'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must fix (still open): this is still 1.22.x. the project's go.mod specifies go 1.25.7, and both ci.yml and release.yml use go-version: 1.25.7. this needs to match:

          go-version: '1.25.7'

building and testing against a different go version than what's released can mask compatibility issues.


// Author: FNGarvin
// License: MIT
// Year: 2026
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must fix (license conflict): the project is licensed under GPL-3.0 (see LICENSE file in repo root). adding License: MIT to individual files creates legal ambiguity — it's unclear which license applies to this file.

please either remove the license/author header entirely (no other files in the project have one), or change it to match the project license (GPL-3.0).

echo "Ensuring safe sweeping of CI resources explicitly prefixed with 'ci-test-'..."
# Only delete pods named exactly starting with "ci-test-"
$RP pod list --output json 2>/dev/null | jq -r '.[] | select(.name | startswith("ci-test-")) | .id' | xargs -r -I {} $RP pod delete {} || true
$RP serverless list --output json 2>/dev/null | jq -r '.[] | select(.name | startswith("ci-test-")) | .id' | xargs -r -I {} $RP serverless delete {} || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should fix: the serverless test now creates templates with ci-test-tpl- prefix, but the emergency cleanup only sweeps pods and serverless endpoints. if the test fails between template creation and endpoint creation (or if t.Cleanup doesn't run), orphaned templates will be left behind.

add a line to clean up templates too:

$RP template list --output json 2>/dev/null | jq -r '.[] | select(.name | startswith("ci-test-")) | .id' | xargs -r -I {} $RP template delete {} || true

RUNPOD_TEST_SERVERLESS_IMAGE: ${{ github.event.inputs.serverless_image || 'fngarvin/ci-minimal-serverless@sha256:6a33a9bac95b8bc871725db9092af2922a7f1e3b63175248b2191b38be4e93a0' }}
run: |
if [ "${{ matrix.user_mode }}" == "root" ]; then
go test -tags e2e -v ./e2e/cli_lifecycle_test.go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: now that the runCLI naming conflict is resolved, consider changing this to ./e2e/... to verify the full package compiles together. this would catch future naming conflicts early. same applies to line 72.

currently the single-file invocation means the new test is never compiled alongside the existing cli_test.go in CI.

@FNGarvin FNGarvin force-pushed the fng-infra-install-fix-v2 branch 5 times, most recently from 2a39607 to ec2d06f Compare March 5, 2026 15:20
@FNGarvin
Copy link
Author

FNGarvin commented Mar 5, 2026

Yep, not a Go coder; apologies for the friction there. The boilerplate header was a default environment oversight on my end. I appreciate the thorough review; you were right to be defensive about the template logic. Hygiene should be 100% now, and I hope we're parting friends.

Cheers.

@TimPietruskyRunPod
Copy link
Member

@FNGarvin thanks for helping out, we appreciate it very much!

@TimPietruskyRunPod
Copy link
Member

hey — we tested the serverless lifecycle with alpine:latest instead of the custom fngarvin/ci-minimal-serverless image and it works perfectly:

template create --image alpine:latest --serverless → created
serverless create → created
serverless get/list/update → all work
serverless delete + template delete → clean

since the test only exercises CRUD operations and never actually invokes the endpoint (no jobs are sent), the worker image doesn't need a runpod handler — any public image works. alpine:latest is ~7MB, maintained by the official Docker team, and eliminates the third-party dependency entirely.

would you be okay swapping the default from fngarvin/ci-minimal-serverless@sha256:... to just alpine:latest in both the workflow inputs and the test constant? or is there a reason you'd still want a custom image here?

@FNGarvin FNGarvin force-pushed the fng-infra-install-fix-v2 branch from ec2d06f to 66b4194 Compare March 6, 2026 22:53
@FNGarvin
Copy link
Author

FNGarvin commented Mar 6, 2026

Hello again,

would you be okay swapping the default from fngarvin/ci-minimal-serverless@sha256:... to just alpine:latest in both the workflow inputs and the test constant?

Brother, I intended my contribution to be a gift and not a point of concern or obligation. It's your prerogative to use whatever image you like.

is there a reason you'd still want a custom image here?

Looking over my notes, I think my intent was to incorporate this logic into the e2e test. Works beautifully on a warmed up template and the logic translates easily enough to the e2e go test, but amplifies the max worker cap concerns. Even with the teeniest, tiniest image, there's some spin-up delay. If I were Runpod-FNG, I would probably be looking to relax the cap for official CI testing and also possibly ensuring the 200MB image was always cached. Or maybe reverting back to a template dedicated to CI that is always warm. But, like I said when I shared the build for the container I provided, it's outside of my purview to make such changes.

So you can play around and see the upsides/downsides, I have now pushed code to run the API tests. It looks like this:
image

And then the second, parallel non-root, attempt looks like this:
image

Same success, but four minutes longer; all the sudden, you're back to potentially running out of workers and having to re-run the CI. Even with complete teardown, provisioning latency and resource contention are potentially an issue. These arethings you can work around, via infra changes or even by throwing a massive timeout at the task, but outside my purview.

Hope that helps!

Goodbye,
FNG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants