Skip to content

Wire per-job shared volume and blob-mode flags into engine chart#383

Open
asafyehezkel wants to merge 7 commits into
masterfrom
add-per-job-shared-volume
Open

Wire per-job shared volume and blob-mode flags into engine chart#383
asafyehezkel wants to merge 7 commits into
masterfrom
add-per-job-shared-volume

Conversation

@asafyehezkel

@asafyehezkel asafyehezkel commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Helm side of the hybrid Redis->storage data plane (pairs with engine PR tensorleap/engine#2332).

  • Mount the per-job shared volume (k3d single-node hostPath) into the engine job pod and the orchestrator — the orchestrator uses it to prune orphan per-job dirs left by hard-killed jobs.
  • Add PAYLOAD_STORE to engine-cm (default redis; blob offloads heavy queue payloads to the volume and Redis carries slim references).
  • Add Phase 5 blob-mode tunables: job_shared_max_concurrent_jobs, job_shared_disk_fraction, job_shared_disk_min_free_gb (used only when payload_store=blob).

Defaults preserve current behavior (payload_store=redis). On-prem chart only; cloud (engine/helm-chart, EFS) is a follow-up.

Asaf Yehezkel added 7 commits June 4, 2026 12:23
Expose JOB_SHARED_MAX_CONCURRENT_JOBS / JOB_SHARED_DISK_FRACTION /
JOB_SHARED_DISK_MIN_FREE_GB via engine-cm so the disk-bound queue-cap
budget and the disk-free guard floor can be tuned per node (bigger nodes
run more concurrent jobs). Defaults (2 / 0.6 / 2) match the in-code
defaults; only used when payload_store=blob.
The orchestrator now prunes per-job shared dirs leaked by hard-killed
jobs, so it needs the shared-volume base mounted (k3d single-node
hostPath at /job-shared).
REDIS_BLOB_POD_GB / REDIS_BLOB_MAXMEMORY_MB (default 1 GiB / 768 MiB),
the cluster default for per-job Redis in blob mode. A per-job
pod_memory_override (via node-server's engine-redis-settings) still wins.
@asafyehezkel asafyehezkel force-pushed the add-per-job-shared-volume branch from 6e7c421 to f2e9348 Compare June 4, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant