Skip to content

fix: job run scheduler termination column#4905

Merged
williamvega merged 26 commits into
masterfrom
fix/job-run-scheduler-termination-col
Jun 2, 2026
Merged

fix: job run scheduler termination column#4905
williamvega merged 26 commits into
masterfrom
fix/job-run-scheduler-termination-col

Conversation

@williamvega

@williamvega williamvega commented May 12, 2026

Copy link
Copy Markdown
Collaborator

What type of PR is this?

fix

What this PR does / why we need it

This PR fixes the original problem in issue #4839, while further work is needed to include cancellation as a scheduler termination reason and populate the column with other reasons besides preemption.

The preemption reason stored in lookout's job_run error column was getting overwritten by other events. We now store the preemption reason and other reasons the scheduler may cancel a job in its own column, scheduler_termination_reason.

The first event termination reason we are storing in this column is from the JobPreemptedEvent, and later prs should include the JobCancelledEvent reason as suggested in the issue.

Fair-share preemption:
image

Manual preemption:
image

Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
…tion

Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
@williamvega williamvega force-pushed the fix/job-run-scheduler-termination-col branch 3 times, most recently from 1d02fb8 to 6909d85 Compare May 13, 2026 19:51
Signed-off-by: William Vega <williamvega1006@gmail.com>
@williamvega williamvega force-pushed the fix/job-run-scheduler-termination-col branch from 6909d85 to e5447ce Compare May 13, 2026 21:15
@williamvega williamvega changed the title fix: job run scheduler termination col fix: job run scheduler termination column May 13, 2026
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
…lliamvega:armadaproject/armada into fix/job-run-scheduler-termination-col

Signed-off-by: William Vega <williamvega1006@gmail.com>
@williamvega williamvega marked this pull request as ready for review May 14, 2026 14:52
@greptile-apps

greptile-apps Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a dedicated scheduler_termination_reason JSONB column on job_run to preserve the preemption reason without it being overwritten by subsequent events. It also adds a new PreemptingJobId field on JobRunPreempted to track fair-share preemptions, exposes a new GET /api/v1/jobRunSchedulerTerminationReason endpoint, and renders the reason in the UI run-details sidebar.

  • DB + ingester: A migration adds scheduler_termination_reason JSONB to job_run; the ingester populates it from JobRunPreempted events via BuildTerminationReason, using COALESCE to avoid overwriting on subsequent updates.
  • Proto changes: preempting_job_id is added to both armadaevents.JobRunPreempted (field 8) and api.JobPreemptedEvent (field 10, with field 7 reserved); the scheduler now passes the preempting job ID into the event.
  • API + UI: A new lookout REST endpoint retrieves the column, and JobRunDetails.tsx renders it as a pretty-printed JSON accordion when present.

Confidence Score: 5/5

Safe to merge; the migration is purely additive, the ingester changes are well-guarded with nil checks, and the new endpoint follows established patterns.

The DB migration is non-destructive (nullable column), the ingester correctly uses COALESCE so existing rows are unaffected, proto field numbering is handled cleanly (field 7 reserved, new field at 10), and the scheduler changes correctly pass the preempting job ID for fair-share preemptions while passing empty string for manual and reconciliation paths. Test coverage for the new code paths is solid.

No files require special attention.

Important Files Changed

Filename Overview
internal/lookoutingester/instructions/instructions.go Adds SchedulerTerminationReason population in handleJobRunPreempted; correctly guards nil map before calling BuildTerminationReason; exports BuildTerminationReason for reuse.
internal/lookoutingester/lookoutdb/insertion.go Correctly adds scheduler_termination_reason to both temp-table batch copy and scalar update paths; COALESCE preserves existing value when new one is NULL; conflation logic updated.
internal/lookout/repository/getjobrunschedulerterminationreason.go New repository implementation queries scheduler_termination_reason with NULL guard; correctly maps pgx.ErrNoRows to ErrNotFound.
internal/lookout/application.go New handler wired correctly; returns HTTP 400 for all errors (including ErrNotFound), consistent with existing pattern but conflates not-found with server errors.
internal/scheduler/scheduler.go Passes preempting job ID into createEventsForPreemptedJob; preemptingJobId() helper correctly handles nil; reconciliation and API-preemption paths pass empty string.
internal/lookoutui/src/services/lookout/useGetJobRunSchedulerTerminationReason.ts Follows the established pattern of useGetJobRunDebugMessage; fakeDataEnabled branch returns empty string instead of a stub value, so the accordion is invisible in fake-data mode.
pkg/armadaevents/events.proto Adds preempting_job_id as field 8 on JobRunPreempted; non-breaking additive change.
pkg/api/event.proto Reserves field 7 and adds preempting_job_id at field 10 on JobPreemptedEvent; Rust client correctly updated to tag 10.
internal/lookout/schema/migrations/034_add_scheduler_termination_reason.sql Single-line migration adds scheduler_termination_reason JSONB column (nullable); non-destructive and safe to apply to existing tables.
internal/lookoutui/src/pages/jobs/components/sidebar/JobRunDetails.tsx Renders scheduler termination reason as a pretty-printed JSON accordion; correctly skips rendering when data is empty or absent.

Sequence Diagram

sequenceDiagram
    participant Scheduler
    participant Pulsar
    participant Ingester
    participant DB as job_run (Postgres)
    participant API as Lookout API
    participant UI

    Scheduler->>Pulsar: "JobRunPreempted{PreemptedRunId, PreemptingJobId, Reason}"
    Pulsar->>Ingester: consume event
    Ingester->>Ingester: handleJobRunPreempted() BuildTerminationReason
    Ingester->>DB: "UPDATE job_run SET scheduler_termination_reason = COALESCE(new, existing)"
    UI->>API: POST /api/v1/jobRunSchedulerTerminationReason
    API->>DB: "SELECT scheduler_termination_reason WHERE run_id=$1"
    DB-->>API: JSON object
    API-->>UI: schedulerTerminationReason string
    UI->>UI: JSON.parse then pretty-print in accordion
Loading

Reviews (10): Last reviewed commit: "Merge branch 'master' into fix/job-run-s..." | Re-trigger Greptile

Comment thread pkg/api/event.proto Outdated
Comment on lines +483 to 487
var terminationReason map[string]any
if event.Reason != "" || terminationReasonArgs != nil {
terminationReason = BuildTerminationReason(event.Reason, terminationReasonArgs)
}
jobRun := model.UpdateJobRunInstruction{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty reason stored when only preemptingJobId is present

If an inbound JobRunPreempted event carries a non-empty PreemptingJobId but an empty Reason, the condition event.Reason != "" || terminationReasonArgs != nil is true, so BuildTerminationReason("", args) is called and {"reason": "", "args": {"preemptingJobId": "..."}} is persisted. The empty string for reason would be rendered verbatim in the UI sidebar. In practice this scenario shouldn't occur today because all scheduler code paths that set PreemptingJobId also set a reason, but it may be worth guarding or documenting.

Comment thread client/rust/src/gen/api.rs Outdated
Signed-off-by: William Vega <williamvega1006@gmail.com>
Comment thread client/rust/src/gen/api.rs
@williamvega williamvega merged commit 9a81877 into master Jun 2, 2026
32 checks passed
@williamvega williamvega deleted the fix/job-run-scheduler-termination-col branch June 2, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants