Skip to content

fix: workflow cancel panic#257

Merged
jason-lynch merged 1 commit intomainfrom
fix/PLAT-403/cancel-panic
Jan 27, 2026
Merged

fix: workflow cancel panic#257
jason-lynch merged 1 commit intomainfrom
fix/PLAT-403/cancel-panic

Conversation

@jason-lynch
Copy link
Member

@jason-lynch jason-lynch commented Jan 25, 2026

Summary

Changes the post-query pending event sort to use a microsecond-precision timestamp rather than the event's time.Time. The time.Time caused problems because it serializes to a string with second-precision by default, so events that occur close together could end up with equal timestamps. The microsecond precision reduces the chance of incorrect sorting to a tolerable level.

I did look at using CreateRevision, but I learned that CreateRevision can still be equal if the events were recorded within the same transaction. CreateRevision worked in testing, but there's a non-zero chance that it could have broken in the future.

Testing

make test-workflows-backend
make test-e2e

PLAT-403

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

The pull request relocates event sorting logic from application-level (in-memory sorting in etcd.go) to database-level (etcd client-side sorting in pending_event/store.go), removing the sortPendingEvents function and consolidating the sorting operation into a single clientv3 query parameter.

Changes

Cohort / File(s) Summary
Event sorting refactoring
server/internal/workflows/backend/etcd/etcd.go
Removed slices import, removed sortPendingEvents function, and removed calls to sorting in GetWorkflowTask and CompleteWorkflowTask methods
Event query sorting
server/internal/workflows/backend/etcd/pending_event/store.go
Added clientv3.WithSort(clientv3.SortByCreateRevision, clientv3.SortAscend) option to GetPrefixOp call in GetByInstanceExecution method

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • rshoemaker

Poem

🐰 Hop, hop! The sorting has moved with glee,
From memory to etcd, now sorted by spree,
No more manual shuffle, let the database play,
Cleaner code hops along the workflow way! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description provides a clear summary, testing commands, and issue reference, but lacks explicit sections for Changes and Checklist items as specified in the template. Add a 'Changes' section with bulleted list of modifications and complete the 'Checklist' with checkboxes confirming tests, docs, and changelog updates per the template.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: workflow cancel panic' accurately and concisely describes the main change—fixing a workflow cancellation panic by switching from post-query timestamp sorting to query-time CreateRevision sorting.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jason-lynch
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@server/internal/workflows/backend/etcd/pending_event/store.go`:
- Around line 51-53: GetByInstanceExecution currently relies only on
clientv3.SortByCreateRevision which can produce nondeterministic ordering when
multiple events share the same CreateRevision; update the fetch to guarantee
deterministic ordering by applying a secondary tie-breaker: either request a
stable secondary sort by key (e.g., include SortByKey/SortAscend if the client
supports composing sorts) or, more robustly, perform a client-side stable sort
of the returned KVs by (CreateRevision, Key) before converting to Value. Locate
Store.GetByInstanceExecution (uses InstanceExecutionPrefix and
storage.NewGetPrefixOp) and modify the retrieval flow so the final ordering is
stable by CreateRevision then Key (or implement the stable-sorting logic in the
GetMultipleOp processing path that wraps NewGetPrefixOp).

@jason-lynch jason-lynch force-pushed the test/rerun-flaky-tests branch from d333f55 to 146650c Compare January 25, 2026 19:12
@jason-lynch jason-lynch force-pushed the fix/PLAT-403/cancel-panic branch from 1fc7985 to 7264b7e Compare January 25, 2026 19:13
Changes the post-query pending event sort to use a microsecond-precision
timestamp rather than the event's `time.Time`. The `time.Time` caused
problems because it serializes to a string with second-precision by
default, so events that occur close together could end up with equal
timestamps. The microsecond precision reduces the chance of incorrect
sorting to a tolerable level.

I did look at using CreateRevision, but I learned that CreateRevision
can still be equal if the events were recorded within the same
transaction. CreateRevision worked in testing, but there's a non-zero
chance that it could have broken in the future.

PLAT-403
@jason-lynch jason-lynch force-pushed the fix/PLAT-403/cancel-panic branch from 7264b7e to 9b24f17 Compare January 26, 2026 14:44
Copy link
Contributor

@rshoemaker rshoemaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved.

Base automatically changed from test/rerun-flaky-tests to main January 27, 2026 13:47
@jason-lynch jason-lynch merged commit 1a1b813 into main Jan 27, 2026
3 checks passed
@jason-lynch jason-lynch deleted the fix/PLAT-403/cancel-panic branch January 27, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants