-
Notifications
You must be signed in to change notification settings - Fork 151
fix: move actor metrics to separate workflow #4044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: move actor metrics to separate workflow #4044
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
How to use the Graphite Merge QueueAdd the label merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
🚅 Deployed to the rivet-pr-4044 environment in rivet-frontend
|
| ctx.listen_n::<Main>(256).await? | ||
| } else { | ||
| // Jitter sleep to prevent stampeding herds | ||
| let jitter = { rand::thread_rng().gen_range(0..METRICS_INTERVAL_MS / 10) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not deterministic, not a big deal but should not be done in a wf. maybe determine jitter once at beginning of wf?
Merge activity
|
91e86f9 to
58af8e5
Compare
2146e7b to
7e99948
Compare
Code ReviewThis PR extracts actor metrics collection into a separate workflow, improving separation of concerns and maintainability. The refactoring looks solid overall with a few observations: ✅ Strengths
🔍 Observations & Suggestions1. KV Size Calculation Logic (engine/packages/pegboard/src/workflows/actor/metrics.rs:199-316)The chunked scanning approach is good for handling large KV stores, but there's a potential issue in the loop logic:
let mut destroy = false; // Instead of true2. State Duplication (engine/packages/pegboard/src/workflows/actor/metrics.rs:20-32)There are two similar state structs -
Recommendation: Consider whether both are needed or if one can be removed. 3. Transaction Timeout Handling (engine/packages/pegboard/src/workflows/actor/metrics.rs:263-265)The timeout check happens inside the transaction stream loop, which is good. However, the warning logs the actor_id using structured logging correctly per CLAUDE.md guidelines. 4. Version Rollback (Cargo.lock changes)All package versions rolled back from 2.0.42 to 2.0.38. This is unusual for a feature PR - typically versions increment forward. This might indicate:
Recommendation: Verify this is intentional and not a merge artifact. 5. Removed UDB Keys Command (engine/packages/engine/src/commands/udb_keys.rs)The entire Question: Were the other features of 6. Missing Awake Duration TrackingThe metrics workflow now tracks awake duration separately, but I noticed the old code in Verification Needed: Ensure the new approach correctly accounts for all awake time, especially edge cases like:
7. Config Schema UpdatesNew config fields added (engine/artifacts/config-schema.json:408-456):
These seem unrelated to the metrics workflow changes. Are they from another PR or intentionally bundled? 🔒 Security & Performance
📝 Style & ConventionsCode follows CLAUDE.md guidelines:
🧪 TestingConsider adding tests for:
SummaryThis is a well-structured refactoring that improves code organization. The main concern is the Recommendation: Approve with fix for the destroy flag initialization. |
Code Review: Move Actor Metrics to Separate WorkflowThis PR refactors actor metrics collection by extracting it into a dedicated workflow. Overall, this is a well-structured refactoring that improves separation of concerns. Here are my findings: ✅ Strengths
🔍 Issues & Concerns1. Logic Bug in Destroy Signal Handling (metrics.rs:67)let mut destroy = true; // Should be false
for sig in &sigs {
match sig {
Main::Pause(sig) => { /* ... */ }
Main::Resume(sig) => { /* ... */ }
Main::Destroy(sig) => {
// ...
destroy = true;
break;
}
}
}Issue: 2. Inconsistent Awake Duration Calculation (metrics.rs:94-101)// Timeout was reached, record duration up till now
if sigs.is_empty() {
let now = util::timestamp::now();
if let Some(last_recorded_awake_ts) = state.last_recorded_awake_ts {
new_awake_duration += now - last_recorded_awake_ts;
}
state.last_recorded_awake_ts = Some(now);
}Issue: When a timeout occurs, the code records awake duration and resets 3. Redundant KV Metrics Recording (metrics.rs:111)ctx.activity(RecordKvMetricsInput {}).await?;Issue: KV metrics are recorded every interval regardless of whether the actor is paused or running. This seems wasteful - KV metrics should likely only be recorded when the actor is active and potentially writing data, not during every pause cycle. 4. Missing Error Context (metrics.rs:221-238)The estimate/scan logic doesn't handle estimation failures gracefully. If Suggestion: Consider wrapping the estimate in a 5. Race Condition in Signal Timing (mod.rs:1398-1401)ctx.v(2)
.signal(metrics::Pause { ts: util::timestamp::now() })
.to_workflow_id(metrics_workflow_id)
.send()
.await?;Issue: The timestamp is captured at signal send time, not at the actual pause event. If there's latency in signal delivery, this could lead to inaccurate duration calculations. Consider passing the timestamp from the caller who knows when the actual event occurred. 6. Potential Integer Overflow (metrics.rs:72, 85, 98)Multiple places add timestamps without overflow checking: new_awake_duration += sig.ts - last_recorded_awake_ts;While unlikely in practice, very long-running actors with precision timing could theoretically overflow 7. Removed Total Actors Metric (setup.rs:118)The PR removes the - namespace::keys::metric::inc(
- &tx.with_subspace(namespace::keys::subspace()),
- input.namespace_id,
- namespace::keys::metric::Metric::TotalActors(input.name.clone()),
- 1,
- );Question: Is this intentional? This seems like a useful metric to track. If it's being moved elsewhere, that should be documented. 📝 Code Style & Best Practices
🧪 Testing ConsiderationsMissing Test Coverage: I don't see any tests added for the new metrics workflow. Consider adding tests for:
🔒 Security & Performance
📊 Config Schema ChangesThe config schema changes look reasonable - new fields are properly optional and documented as experimental. Recommendations
VerdictThe refactoring is architecturally sound and improves code organization, but there's at least one critical bug that needs fixing before merge. With the suggested fixes, this would be a solid improvement to the codebase. Review generated by Claude Code |

No description provided.