-
Notifications
You must be signed in to change notification settings - Fork 910
Enhanced worker crash handling with integrated crash telemetry #5412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Added enhanced crash handling logic with feature flag control - Implemented dual-mode operation for Plan v7 vs Plan v8+ scenarios - Added forced completion for Plan v8+ worker crashes - Enhanced logging for crash detection and completion analysis - Added notifyServerOfWorkerCrash variable for clear intent - Maintained backward compatibility with original logic - Added comprehensive trace logging for debugging
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
src/Agent.Listener/JobDispatcher.cs
Outdated
| detailInfo = string.Join(Environment.NewLine, workerOutput); | ||
| Trace.Info($"Return code {returnCode} indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result."); | ||
| await LogWorkerProcessUnhandledException(message, detailInfo, agentCertManager.SkipServerCertificateValidation); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we have sendEvent to server method here as well, as here only we are logging if worker is terminated due to unhandled exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I did not do that because the completion reporting where after the below code and I kept it as it is.
// stop renew lock
lockRenewalTokenSource.Cancel();
// renew job request should never blows up.
await renewJobRequest;
I asked from copilot regarding can we keep reporting after we are sending telemetry and it is saying that current order is correct and given the below reason, not sure whether it will be correct to keep before or not.
Reason:
- Background renewal task: The renewJobRequest task is running continuously in the background to keep the job lock alive
- Race condition prevention: Without cancellation, renewal might continue even after job completion
|
@raujaiswal I've opened a new pull request, #5423, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sanjuyadav24
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add design doc, other documentation link in work item with anything related to this issue
Also please check, I think we should also add a catch condition the actual block of worker which threw unhandled exception and process crashed
…crosoft/azure-pipelines-agent into users/raujaiswal/worker-crash
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Context
#AB2333253
Description
Problem (in Brief):
There were instances where the worker crashed, yet the pipeline continued running until it eventually hit the job‑level timeout. While the worker ideally should never crash, any unhandled exception causing a crash results in unnecessary resource consumption and prolonged pipeline execution—raising concerns about product reliability and efficiency.
How this problem gets noticed?
We identified this issue through an IcM where a regression was detected. Sharing the retrospective below for reference:
https://portal.microsofticm.com/imp/v5/retrospectives/Internal/1250792
Root Cause (Worker Crash – Brief Summary)
In the current Plan v8+ flow, the worker is responsible for reporting job status (Succeeded/Failed/Completed) back to the server. However, due to an unhandled exception, the worker process was crashing before sending the completion event. As a result, the server never received a signal to stop the job, and the pipeline continued running until the job-level timeout, leading to unnecessary resource consumption.
To address this, we implemented improvements focusing on two areas:
Preventing Worker Crashes
The worker ideally should never crash. Wherever crashes were caused by unhandled exceptions, we identified those scenarios and added the required try/catch handling to ensure the worker exits gracefully.
Fallback Handling via Listener
In cases where the worker still crashes unexpectedly, the listener now takes responsibility for notifying the server to stop the job. We also added telemetry signals to track and diagnose worker crashes more effectively.
Repro Steps:
Risk Assessment (Low / Medium / High)
Assess the risk level and justify your assessment. For example: code path sensitivity, usage scope, or backward compatibility concerns.
Unit Tests Added or Updated (Yes / No)
Indicate whether unit tests were added or modified to reflect the changes.
Additional Testing Performed
List manual or automated tests performed beyond unit tests (e.g., integration, scenario, regression).
Change Behind Feature Flag (Yes / No)
Can this change be behine feature flag, if not why?
Tech Design / Approach
Design Doc (deep dive in Plan v7, plan v8+): https://microsoftapc.sharepoint.com/:w:/r/teams/ADOTasksandAgents/_layouts/15/Doc.aspx?sourcedoc=%7BDFED2317-D226-4EC9-A6AB-1E3E53D91934%7D&file=worker%20crash%20handling.docx&action=default&mobileredirect=true
Documentation Changes Required (Yes/No)
Indicate whether related documentation needs to be updated.
Logging Added/Updated (Yes/No)
Telemetry Added/Updated (Yes/No)
Kusto Query:
Rollback Scenario and Process (Yes/No)
Dependency Impact Assessed and Regression Tested (Yes/No)