Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

MasterPtato commented Jan 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@railway-app
Copy link

railway-app bot commented Jan 26, 2026

🚅 Deployed to the rivet-pr-4044 environment in rivet-frontend

Service Status Web Updated (UTC)
mcp-hub ✅ Success (View Logs) Web Jan 27, 2026 at 3:17 am
website 🕒 Building (View Logs) Web Jan 27, 2026 at 3:17 am
frontend-cloud 🕒 Building (View Logs) Web Jan 27, 2026 at 3:17 am
frontend-inspector 🕒 Building (View Logs) Web Jan 27, 2026 at 3:17 am

@MasterPtato MasterPtato marked this pull request as ready for review January 26, 2026 21:28
ctx.listen_n::<Main>(256).await?
} else {
// Jitter sleep to prevent stampeding herds
let jitter = { rand::thread_rng().gen_range(0..METRICS_INTERVAL_MS / 10) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not deterministic, not a big deal but should not be done in a wf. maybe determine jitter once at beginning of wf?

@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 27, 2026

Merge activity

@MasterPtato MasterPtato force-pushed the 01-23-fix_move_actor_metrics_to_separate_workflow branch from 91e86f9 to 58af8e5 Compare January 27, 2026 03:16
@MasterPtato MasterPtato force-pushed the 01-22-chore_clean_up_pools_remove_unused_crates branch from 2146e7b to 7e99948 Compare January 27, 2026 03:16
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4044 January 27, 2026 03:16 Destroyed
@claude
Copy link

claude bot commented Jan 27, 2026

Code Review

This PR extracts actor metrics collection into a separate workflow, improving separation of concerns and maintainability. The refactoring looks solid overall with a few observations:

✅ Strengths

  1. Improved Separation of Concerns: Moving metrics to a dedicated workflow (engine/packages/pegboard/src/workflows/actor/metrics.rs:45-124) makes the main actor workflow cleaner and more maintainable.

  2. Signal-Based Lifecycle Management: The use of Pause/Resume/Destroy signals (engine/packages/pegboard/src/workflows/actor/metrics.rs:318-337) provides clean coordination between workflows.

  3. Proper State Management: The metrics workflow maintains its own state tracking awake duration and KV storage size separately from the main actor workflow.

  4. Added UDB CLI Tool: New size command (engine/packages/engine/src/commands/udb/cli.rs:353-408) is useful for debugging storage issues. The implementation correctly handles both estimated and exact size calculations.

🔍 Observations & Suggestions

1. KV Size Calculation Logic (engine/packages/pegboard/src/workflows/actor/metrics.rs:199-316)

The chunked scanning approach is good for handling large KV stores, but there's a potential issue in the loop logic:

  • Line 67: destroy is initialized to true but should probably be false. When sigs is empty (timeout case), destroy remains true which would incorrectly break the loop.
  • Recommendation: Initialize destroy = false and only set it to true when a Destroy signal is received.
let mut destroy = false;  // Instead of true

2. State Duplication (engine/packages/pegboard/src/workflows/actor/metrics.rs:20-32)

There are two similar state structs - State and LifecycleState - both tracking last_kv_storage_size. This duplication could lead to inconsistencies:

  • State is set once in init_state activity
  • LifecycleState is used in the loop but also has last_kv_storage_size

Recommendation: Consider whether both are needed or if one can be removed.

3. Transaction Timeout Handling (engine/packages/pegboard/src/workflows/actor/metrics.rs:263-265)

The timeout check happens inside the transaction stream loop, which is good. However, the warning logs the actor_id using structured logging correctly per CLAUDE.md guidelines.

4. Version Rollback (Cargo.lock changes)

All package versions rolled back from 2.0.42 to 2.0.38. This is unusual for a feature PR - typically versions increment forward. This might indicate:

  • The PR is based on an older branch
  • A manual version reset occurred

Recommendation: Verify this is intentional and not a merge artifact.

5. Removed UDB Keys Command (engine/packages/engine/src/commands/udb_keys.rs)

The entire udb_keys command module was deleted (200 lines). The new size subcommand in the udb CLI provides similar functionality, but some features like ParseConflictLogs were removed.

Question: Were the other features of udb_keys (conflict log parsing) intentionally removed, or should they be migrated?

6. Missing Awake Duration Tracking

The metrics workflow now tracks awake duration separately, but I noticed the old code in set_sleeping (engine/packages/pegboard/src/workflows/actor/runtime.rs:1172-1625) was recording awake duration at sleep time. The new approach records it periodically + at pause/destroy time, which should be equivalent but uses a different mechanism.

Verification Needed: Ensure the new approach correctly accounts for all awake time, especially edge cases like:

  • Actor destroyed while awake (handled at line 84-86 ✓)
  • Multiple rapid pause/resume cycles
  • Backfilled metrics workflows starting in paused state

7. Config Schema Updates

New config fields added (engine/artifacts/config-schema.json:408-456):

  • force_shutdown_duration
  • default_metadata_poll_interval
  • min_metadata_poll_interval

These seem unrelated to the metrics workflow changes. Are they from another PR or intentionally bundled?

🔒 Security & Performance

  • No security concerns identified - metrics collection is internal and doesn't expose sensitive data
  • Performance: Chunked KV scanning with timeout protection is well-designed
  • Transaction handling: Proper use of Snapshot isolation for reads

📝 Style & Conventions

Code follows CLAUDE.md guidelines:

  • ✅ Hard tabs used consistently
  • ✅ Structured logging with field syntax (?actor_id)
  • ✅ Lowercase log messages
  • ✅ Proper error handling with Result<> returns

🧪 Testing

Consider adding tests for:

  1. Metrics workflow state transitions (paused → resumed → destroyed)
  2. KV size calculation edge cases (empty stores, stores < 3MB, > 3MB)
  3. Awake duration accumulation across multiple pause/resume cycles

Summary

This is a well-structured refactoring that improves code organization. The main concern is the destroy initialization bug (line 67) that should be fixed. Other observations are minor and mostly for verification/documentation purposes.

Recommendation: Approve with fix for the destroy flag initialization.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 27, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4044

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@4044

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4044

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4044

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4044

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4044

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4044

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4044

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4044

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4044

commit: 58af8e5

@claude
Copy link

claude bot commented Jan 27, 2026

Code Review: Move Actor Metrics to Separate Workflow

This PR refactors actor metrics collection by extracting it into a dedicated workflow. Overall, this is a well-structured refactoring that improves separation of concerns. Here are my findings:

✅ Strengths

  1. Good Separation of Concerns: Extracting metrics into a separate workflow (pegboard_actor_metrics) improves modularity and makes the main actor workflow cleaner and more focused.

  2. Proper Signal-Based Communication: The use of Pause, Resume, and Destroy signals provides clean lifecycle management between the parent and metrics workflows.

  3. Chunked KV Scanning: The record_kv_metrics function properly handles large datasets by scanning in chunks with timeout protection (EARLY_TXN_TIMEOUT), preventing long-running transactions.

  4. Smart Estimation Strategy: The code uses FDB's size estimation for datasets >3MB and falls back to manual scanning for smaller datasets, following FDB's recommendations.

🔍 Issues & Concerns

1. Logic Bug in Destroy Signal Handling (metrics.rs:67)

let mut destroy = true;  // Should be false
for sig in &sigs {
    match sig {
        Main::Pause(sig) => { /* ... */ }
        Main::Resume(sig) => { /* ... */ }
        Main::Destroy(sig) => {
            // ...
            destroy = true;
            break;
        }
    }
}

Issue: destroy is initialized to true on line 67, but should be false. This means if no signals are received in a batch, the workflow will incorrectly terminate. The logic should only set destroy = true when explicitly receiving a Destroy signal.

2. Inconsistent Awake Duration Calculation (metrics.rs:94-101)

// Timeout was reached, record duration up till now
if sigs.is_empty() {
    let now = util::timestamp::now();
    if let Some(last_recorded_awake_ts) = state.last_recorded_awake_ts {
        new_awake_duration += now - last_recorded_awake_ts;
    }
    state.last_recorded_awake_ts = Some(now);
}

Issue: When a timeout occurs, the code records awake duration and resets last_recorded_awake_ts to now. However, if the timeout happens while paused, this will incorrectly record time. The code should check if !state.paused before recording duration.

3. Redundant KV Metrics Recording (metrics.rs:111)

ctx.activity(RecordKvMetricsInput {}).await?;

Issue: KV metrics are recorded every interval regardless of whether the actor is paused or running. This seems wasteful - KV metrics should likely only be recorded when the actor is active and potentially writing data, not during every pause cycle.

4. Missing Error Context (metrics.rs:221-238)

The estimate/scan logic doesn't handle estimation failures gracefully. If estimate_kv_size fails, the error propagates up immediately rather than falling back to manual scanning.

Suggestion: Consider wrapping the estimate in a match or logging the error and falling through to manual scanning.

5. Race Condition in Signal Timing (mod.rs:1398-1401)

ctx.v(2)
    .signal(metrics::Pause { ts: util::timestamp::now() })
    .to_workflow_id(metrics_workflow_id)
    .send()
    .await?;

Issue: The timestamp is captured at signal send time, not at the actual pause event. If there's latency in signal delivery, this could lead to inaccurate duration calculations. Consider passing the timestamp from the caller who knows when the actual event occurred.

6. Potential Integer Overflow (metrics.rs:72, 85, 98)

Multiple places add timestamps without overflow checking:

new_awake_duration += sig.ts - last_recorded_awake_ts;

While unlikely in practice, very long-running actors with precision timing could theoretically overflow i64. Consider using checked_add or saturating arithmetic.

7. Removed Total Actors Metric (setup.rs:118)

The PR removes the TotalActors metric increment without explanation:

- namespace::keys::metric::inc(
-     &tx.with_subspace(namespace::keys::subspace()),
-     input.namespace_id,
-     namespace::keys::metric::Metric::TotalActors(input.name.clone()),
-     1,
- );

Question: Is this intentional? This seems like a useful metric to track. If it's being moved elsewhere, that should be documented.

📝 Code Style & Best Practices

  1. Comment Quality (metrics.rs:211): Good adherence to CLAUDE.md - complete sentence structure in comments.

  2. Logging: Good use of structured logging with ?actor_id following the project's tracing patterns.

  3. Transaction Naming: Proper use of custom_instrument with descriptive span names.

🧪 Testing Considerations

Missing Test Coverage: I don't see any tests added for the new metrics workflow. Consider adding tests for:

  • Pause/Resume signal handling and duration calculation
  • Destroy signal termination
  • KV metrics estimation vs. manual scanning
  • Edge cases (rapid pause/resume cycles, very large KV stores)

🔒 Security & Performance

  1. Transaction Timeout Protection: Good use of EARLY_TXN_TIMEOUT to prevent runaway transactions.

  2. Jitter for Stampede Prevention: Good use of jitter (line 60) to prevent thundering herd.

  3. Snapshot Isolation: Proper use of Snapshot isolation level for read-only metrics queries.

📊 Config Schema Changes

The config schema changes look reasonable - new fields are properly optional and documented as experimental.

Recommendations

  1. Fix the destroy = true initialization bug (critical)
  2. Add !state.paused check before recording timeout durations (correctness)
  3. Consider skipping KV metrics when paused (performance)
  4. Add test coverage for the metrics workflow (quality)
  5. Document why TotalActors metric was removed (clarity)
  6. Consider passing actual event timestamps instead of util::timestamp::now() (accuracy)

Verdict

The refactoring is architecturally sound and improves code organization, but there's at least one critical bug that needs fixing before merge. With the suggested fixes, this would be a solid improvement to the codebase.


Review generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants