fix: job run scheduler termination column#4905
Conversation
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
…tion Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
1d02fb8 to
6909d85
Compare
Signed-off-by: William Vega <williamvega1006@gmail.com>
6909d85 to
e5447ce
Compare
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
…lliamvega:armadaproject/armada into fix/job-run-scheduler-termination-col Signed-off-by: William Vega <williamvega1006@gmail.com>
Greptile SummaryThis PR introduces a dedicated
Confidence Score: 5/5Safe to merge; the migration is purely additive, the ingester changes are well-guarded with nil checks, and the new endpoint follows established patterns. The DB migration is non-destructive (nullable column), the ingester correctly uses COALESCE so existing rows are unaffected, proto field numbering is handled cleanly (field 7 reserved, new field at 10), and the scheduler changes correctly pass the preempting job ID for fair-share preemptions while passing empty string for manual and reconciliation paths. Test coverage for the new code paths is solid. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Scheduler
participant Pulsar
participant Ingester
participant DB as job_run (Postgres)
participant API as Lookout API
participant UI
Scheduler->>Pulsar: "JobRunPreempted{PreemptedRunId, PreemptingJobId, Reason}"
Pulsar->>Ingester: consume event
Ingester->>Ingester: handleJobRunPreempted() BuildTerminationReason
Ingester->>DB: "UPDATE job_run SET scheduler_termination_reason = COALESCE(new, existing)"
UI->>API: POST /api/v1/jobRunSchedulerTerminationReason
API->>DB: "SELECT scheduler_termination_reason WHERE run_id=$1"
DB-->>API: JSON object
API-->>UI: schedulerTerminationReason string
UI->>UI: JSON.parse then pretty-print in accordion
Reviews (10): Last reviewed commit: "Merge branch 'master' into fix/job-run-s..." | Re-trigger Greptile |
| var terminationReason map[string]any | ||
| if event.Reason != "" || terminationReasonArgs != nil { | ||
| terminationReason = BuildTerminationReason(event.Reason, terminationReasonArgs) | ||
| } | ||
| jobRun := model.UpdateJobRunInstruction{ |
There was a problem hiding this comment.
Empty
reason stored when only preemptingJobId is present
If an inbound JobRunPreempted event carries a non-empty PreemptingJobId but an empty Reason, the condition event.Reason != "" || terminationReasonArgs != nil is true, so BuildTerminationReason("", args) is called and {"reason": "", "args": {"preemptingJobId": "..."}} is persisted. The empty string for reason would be rendered verbatim in the UI sidebar. In practice this scenario shouldn't occur today because all scheduler code paths that set PreemptingJobId also set a reason, but it may be worth guarding or documenting.
Signed-off-by: William Vega <williamvega1006@gmail.com>
Signed-off-by: William Vega <williamvega1006@gmail.com>
What type of PR is this?
fix
What this PR does / why we need it
This PR fixes the original problem in issue #4839, while further work is needed to include cancellation as a scheduler termination reason and populate the column with other reasons besides preemption.
The preemption reason stored in lookout's job_run error column was getting overwritten by other events. We now store the preemption reason and other reasons the scheduler may cancel a job in its own column, scheduler_termination_reason.
The first event termination reason we are storing in this column is from the JobPreemptedEvent, and later prs should include the JobCancelledEvent reason as suggested in the issue.
Fair-share preemption:

Manual preemption:
