fix: move actor metrics to separate workflow #4044

MasterPtato · 2026-01-26T21:23:31Z

No description provided.

MasterPtato · 2026-01-26T21:23:41Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix: move actor metrics to separate workflow #4044 : 2 dependent PRs (#4014 , #4047 ) 👈 (View in Graphite)
chore: clean up pools, remove unused crates #4013
feat: add actor and kv metrics #3952
fix: remove attributes from last metric push metric #4043
fix(pb): always check actor queue on init #3982
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

railway-app · 2026-01-26T21:23:41Z

🚅 Deployed to the rivet-pr-4044 environment in rivet-frontend

Service	Status	Web	Updated (UTC)
mcp-hub	✅ Success (View Logs)	Web	Jan 27, 2026 at 3:17 am
website	🕒 Building (View Logs)	Web	Jan 27, 2026 at 3:17 am
frontend-cloud	🕒 Building (View Logs)	Web	Jan 27, 2026 at 3:17 am
frontend-inspector	🕒 Building (View Logs)	Web	Jan 27, 2026 at 3:17 am

NathanFlurry · 2026-01-27T01:42:12Z

engine/packages/pegboard/src/workflows/actor/metrics.rs

+				ctx.listen_n::<Main>(256).await?
+			} else {
+				// Jitter sleep to prevent stampeding herds
+				let jitter = { rand::thread_rng().gen_range(0..METRICS_INTERVAL_MS / 10) };


this is not deterministic, not a big deal but should not be done in a wf. maybe determine jitter once at beginning of wf?

graphite-app · 2026-01-27T02:49:39Z

Merge activity

Jan 27, 2:49 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
Jan 27, 2:50 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #3982.
Jan 27, 2:50 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #3982.
Jan 27, 2:59 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
Jan 27, 3:00 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #4043.
Jan 27, 3:00 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #4043.
Jan 27, 3:12 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
Jan 27, 3:12 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #4043.
Jan 27, 3:12 AM UTC: The Graphite merge queue removed this pull request due to downstack failures on PR #4043.
Jan 27, 3:17 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
Jan 27, 3:18 AM UTC: CI is running for this pull request on a draft pull request (#4048) due to your merge queue CI optimization settings.
Jan 27, 3:19 AM UTC: Merged by the Graphite merge queue via draft PR: #4048.

claude · 2026-01-27T03:18:53Z

Code Review

This PR extracts actor metrics collection into a separate workflow, improving separation of concerns and maintainability. The refactoring looks solid overall with a few observations:

✅ Strengths

Improved Separation of Concerns: Moving metrics to a dedicated workflow (engine/packages/pegboard/src/workflows/actor/metrics.rs:45-124) makes the main actor workflow cleaner and more maintainable.
Signal-Based Lifecycle Management: The use of Pause/Resume/Destroy signals (engine/packages/pegboard/src/workflows/actor/metrics.rs:318-337) provides clean coordination between workflows.
Proper State Management: The metrics workflow maintains its own state tracking awake duration and KV storage size separately from the main actor workflow.
Added UDB CLI Tool: New size command (engine/packages/engine/src/commands/udb/cli.rs:353-408) is useful for debugging storage issues. The implementation correctly handles both estimated and exact size calculations.

🔍 Observations & Suggestions

1. KV Size Calculation Logic (engine/packages/pegboard/src/workflows/actor/metrics.rs:199-316)

The chunked scanning approach is good for handling large KV stores, but there's a potential issue in the loop logic:

Line 67: destroy is initialized to true but should probably be false. When sigs is empty (timeout case), destroy remains true which would incorrectly break the loop.
Recommendation: Initialize destroy = false and only set it to true when a Destroy signal is received.

let mut destroy = false;  // Instead of true

2. State Duplication (engine/packages/pegboard/src/workflows/actor/metrics.rs:20-32)

There are two similar state structs - State and LifecycleState - both tracking last_kv_storage_size. This duplication could lead to inconsistencies:

State is set once in init_state activity
LifecycleState is used in the loop but also has last_kv_storage_size

Recommendation: Consider whether both are needed or if one can be removed.

3. Transaction Timeout Handling (engine/packages/pegboard/src/workflows/actor/metrics.rs:263-265)

The timeout check happens inside the transaction stream loop, which is good. However, the warning logs the actor_id using structured logging correctly per CLAUDE.md guidelines.

4. Version Rollback (Cargo.lock changes)

All package versions rolled back from 2.0.42 to 2.0.38. This is unusual for a feature PR - typically versions increment forward. This might indicate:

The PR is based on an older branch
A manual version reset occurred

Recommendation: Verify this is intentional and not a merge artifact.

5. Removed UDB Keys Command (engine/packages/engine/src/commands/udb_keys.rs)

The entire udb_keys command module was deleted (200 lines). The new size subcommand in the udb CLI provides similar functionality, but some features like ParseConflictLogs were removed.

Question: Were the other features of udb_keys (conflict log parsing) intentionally removed, or should they be migrated?

6. Missing Awake Duration Tracking

The metrics workflow now tracks awake duration separately, but I noticed the old code in set_sleeping (engine/packages/pegboard/src/workflows/actor/runtime.rs:1172-1625) was recording awake duration at sleep time. The new approach records it periodically + at pause/destroy time, which should be equivalent but uses a different mechanism.

Verification Needed: Ensure the new approach correctly accounts for all awake time, especially edge cases like:

Actor destroyed while awake (handled at line 84-86 ✓)
Multiple rapid pause/resume cycles
Backfilled metrics workflows starting in paused state

7. Config Schema Updates

New config fields added (engine/artifacts/config-schema.json:408-456):

force_shutdown_duration
default_metadata_poll_interval
min_metadata_poll_interval

These seem unrelated to the metrics workflow changes. Are they from another PR or intentionally bundled?

🔒 Security & Performance

No security concerns identified - metrics collection is internal and doesn't expose sensitive data
Performance: Chunked KV scanning with timeout protection is well-designed
Transaction handling: Proper use of Snapshot isolation for reads

📝 Style & Conventions

Code follows CLAUDE.md guidelines:

✅ Hard tabs used consistently
✅ Structured logging with field syntax (?actor_id)
✅ Lowercase log messages
✅ Proper error handling with Result<> returns

🧪 Testing

Consider adding tests for:

Metrics workflow state transitions (paused → resumed → destroyed)
KV size calculation edge cases (empty stores, stores < 3MB, > 3MB)
Awake duration accumulation across multiple pause/resume cycles

Summary

This is a well-structured refactoring that improves code organization. The main concern is the destroy initialization bug (line 67) that should be fixed. Other observations are minor and mostly for verification/documentation purposes.

Recommendation: Approve with fix for the destroy flag initialization.

pkg-pr-new · 2026-01-27T03:18:57Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4044

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@4044

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4044

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4044

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4044

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4044

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4044

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4044

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4044

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4044

commit: 58af8e5

claude · 2026-01-27T03:19:11Z

Code Review: Move Actor Metrics to Separate Workflow

This PR refactors actor metrics collection by extracting it into a dedicated workflow. Overall, this is a well-structured refactoring that improves separation of concerns. Here are my findings:

✅ Strengths

Good Separation of Concerns: Extracting metrics into a separate workflow (pegboard_actor_metrics) improves modularity and makes the main actor workflow cleaner and more focused.
Proper Signal-Based Communication: The use of Pause, Resume, and Destroy signals provides clean lifecycle management between the parent and metrics workflows.
Chunked KV Scanning: The record_kv_metrics function properly handles large datasets by scanning in chunks with timeout protection (EARLY_TXN_TIMEOUT), preventing long-running transactions.
Smart Estimation Strategy: The code uses FDB's size estimation for datasets >3MB and falls back to manual scanning for smaller datasets, following FDB's recommendations.

🔍 Issues & Concerns

1. Logic Bug in Destroy Signal Handling (metrics.rs:67)

let mut destroy = true;  // Should be false
for sig in &sigs {
    match sig {
        Main::Pause(sig) => { /* ... */ }
        Main::Resume(sig) => { /* ... */ }
        Main::Destroy(sig) => {
            // ...
            destroy = true;
            break;
        }
    }
}

Issue: destroy is initialized to true on line 67, but should be false. This means if no signals are received in a batch, the workflow will incorrectly terminate. The logic should only set destroy = true when explicitly receiving a Destroy signal.

2. Inconsistent Awake Duration Calculation (metrics.rs:94-101)

// Timeout was reached, record duration up till now
if sigs.is_empty() {
    let now = util::timestamp::now();
    if let Some(last_recorded_awake_ts) = state.last_recorded_awake_ts {
        new_awake_duration += now - last_recorded_awake_ts;
    }
    state.last_recorded_awake_ts = Some(now);
}

Issue: When a timeout occurs, the code records awake duration and resets last_recorded_awake_ts to now. However, if the timeout happens while paused, this will incorrectly record time. The code should check if !state.paused before recording duration.

3. Redundant KV Metrics Recording (metrics.rs:111)

ctx.activity(RecordKvMetricsInput {}).await?;

Issue: KV metrics are recorded every interval regardless of whether the actor is paused or running. This seems wasteful - KV metrics should likely only be recorded when the actor is active and potentially writing data, not during every pause cycle.

4. Missing Error Context (metrics.rs:221-238)

The estimate/scan logic doesn't handle estimation failures gracefully. If estimate_kv_size fails, the error propagates up immediately rather than falling back to manual scanning.

Suggestion: Consider wrapping the estimate in a match or logging the error and falling through to manual scanning.

5. Race Condition in Signal Timing (mod.rs:1398-1401)

ctx.v(2)
    .signal(metrics::Pause { ts: util::timestamp::now() })
    .to_workflow_id(metrics_workflow_id)
    .send()
    .await?;

Issue: The timestamp is captured at signal send time, not at the actual pause event. If there's latency in signal delivery, this could lead to inaccurate duration calculations. Consider passing the timestamp from the caller who knows when the actual event occurred.

6. Potential Integer Overflow (metrics.rs:72, 85, 98)

Multiple places add timestamps without overflow checking:

new_awake_duration += sig.ts - last_recorded_awake_ts;

While unlikely in practice, very long-running actors with precision timing could theoretically overflow i64. Consider using checked_add or saturating arithmetic.

7. Removed Total Actors Metric (setup.rs:118)

The PR removes the TotalActors metric increment without explanation:

- namespace::keys::metric::inc(
-     &tx.with_subspace(namespace::keys::subspace()),
-     input.namespace_id,
-     namespace::keys::metric::Metric::TotalActors(input.name.clone()),
-     1,
- );

Question: Is this intentional? This seems like a useful metric to track. If it's being moved elsewhere, that should be documented.

📝 Code Style & Best Practices

Comment Quality (metrics.rs:211): Good adherence to CLAUDE.md - complete sentence structure in comments.
Logging: Good use of structured logging with ?actor_id following the project's tracing patterns.
Transaction Naming: Proper use of custom_instrument with descriptive span names.

🧪 Testing Considerations

Missing Test Coverage: I don't see any tests added for the new metrics workflow. Consider adding tests for:

Pause/Resume signal handling and duration calculation
Destroy signal termination
KV metrics estimation vs. manual scanning
Edge cases (rapid pause/resume cycles, very large KV stores)

🔒 Security & Performance

Transaction Timeout Protection: Good use of EARLY_TXN_TIMEOUT to prevent runaway transactions.
Jitter for Stampede Prevention: Good use of jitter (line 60) to prevent thundering herd.
Snapshot Isolation: Proper use of Snapshot isolation level for read-only metrics queries.

📊 Config Schema Changes

The config schema changes look reasonable - new fields are properly optional and documented as experimental.

Recommendations

Fix the destroy = true initialization bug (critical)
Add !state.paused check before recording timeout durations (correctness)
Consider skipping KV metrics when paused (performance)
Add test coverage for the metrics workflow (quality)
Document why TotalActors metric was removed (clarity)
Consider passing actual event timestamps instead of util::timestamp::now() (accuracy)

Verdict

The refactoring is architecturally sound and improves code organization, but there's at least one critical bug that needs fixing before merge. With the suggested fixes, this would be a solid improvement to the codebase.

Review generated by Claude Code

This was referenced Jan 26, 2026

feat: add actor and kv metrics #3952

Closed

fix(pb): always check actor queue on init #3982

Closed

chore: clean up pools, remove unused crates #4013

Closed

railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4044 January 26, 2026 21:23 Destroyed

MasterPtato mentioned this pull request Jan 26, 2026

DO NOT MERGE: add metrics exporter for ee #4014

Closed

MasterPtato mentioned this pull request Jan 26, 2026

fix: remove attributes from last metric push metric #4043

Closed

MasterPtato marked this pull request as ready for review January 26, 2026 21:28

NathanFlurry approved these changes Jan 27, 2026

View reviewed changes

fix: move actor metrics to separate workflow

58af8e5

MasterPtato force-pushed the 01-23-fix_move_actor_metrics_to_separate_workflow branch from 91e86f9 to 58af8e5 Compare January 27, 2026 03:16

MasterPtato force-pushed the 01-22-chore_clean_up_pools_remove_unused_crates branch from 2146e7b to 7e99948 Compare January 27, 2026 03:16

railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4044 January 27, 2026 03:16 Destroyed

MasterPtato mentioned this pull request Jan 27, 2026

chore: remove ch provision users #4047

Closed

graphite-app bot pushed a commit that referenced this pull request Jan 27, 2026

fix: move actor metrics to separate workflow (#4044)

dcccd89

graphite-app bot closed this Jan 27, 2026

graphite-app bot deleted the 01-23-fix_move_actor_metrics_to_separate_workflow branch January 27, 2026 03:19

This was referenced Jan 27, 2026

fix: fix active req caculation, add hws metrics #4053

Closed

chore(guard): clean up guard #4057

Closed

fix(gateway): configure cors for error responses #4058

Closed

fix(guard): cache custom serve routes #4059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move actor metrics to separate workflow #4044

fix: move actor metrics to separate workflow #4044

Uh oh!

MasterPtato commented Jan 26, 2026

Uh oh!

MasterPtato commented Jan 26, 2026 •

edited

Loading

Uh oh!

railway-app bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

NathanFlurry Jan 27, 2026

Uh oh!

graphite-app bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 27, 2026

Uh oh!

pkg-pr-new bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: move actor metrics to separate workflow #4044

fix: move actor metrics to separate workflow #4044

Uh oh!

Conversation

MasterPtato commented Jan 26, 2026

Uh oh!

MasterPtato commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

railway-app bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

graphite-app bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

claude bot commented Jan 27, 2026

Code Review

✅ Strengths

🔍 Observations & Suggestions

1. KV Size Calculation Logic (engine/packages/pegboard/src/workflows/actor/metrics.rs:199-316)

2. State Duplication (engine/packages/pegboard/src/workflows/actor/metrics.rs:20-32)

3. Transaction Timeout Handling (engine/packages/pegboard/src/workflows/actor/metrics.rs:263-265)

4. Version Rollback (Cargo.lock changes)

5. Removed UDB Keys Command (engine/packages/engine/src/commands/udb_keys.rs)

6. Missing Awake Duration Tracking

7. Config Schema Updates

🔒 Security & Performance

📝 Style & Conventions

🧪 Testing

Summary

Uh oh!

pkg-pr-new bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Jan 27, 2026

Code Review: Move Actor Metrics to Separate Workflow

✅ Strengths

🔍 Issues & Concerns

1. Logic Bug in Destroy Signal Handling (metrics.rs:67)

2. Inconsistent Awake Duration Calculation (metrics.rs:94-101)

3. Redundant KV Metrics Recording (metrics.rs:111)

4. Missing Error Context (metrics.rs:221-238)

5. Race Condition in Signal Timing (mod.rs:1398-1401)

6. Potential Integer Overflow (metrics.rs:72, 85, 98)

7. Removed Total Actors Metric (setup.rs:118)

📝 Code Style & Best Practices

🧪 Testing Considerations

🔒 Security & Performance

📊 Config Schema Changes

Recommendations

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MasterPtato commented Jan 26, 2026 •

edited

Loading

railway-app bot commented Jan 26, 2026 •

edited

Loading

graphite-app bot commented Jan 27, 2026 •

edited

Loading

pkg-pr-new bot commented Jan 27, 2026 •

edited

Loading