Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

Copy link
Member Author

NathanFlurry commented Jan 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 26, 2026

More templates

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4038

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4038

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@4038

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4038

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4038

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4038

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4038

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4038

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4038

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4038

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4038

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4038

commit: 4189086

@claude
Copy link

claude bot commented Jan 26, 2026

PR Review: Traces UI

This PR adds a comprehensive tracing UI along with a queue inspector feature. Overall, the implementation is solid with good architectural decisions. Below are my findings:


Strengths

Architecture & Design

  • Well-structured feature flags: Version-based feature detection (MIN_RIVETKIT_VERSION_TRACES, MIN_RIVETKIT_VERSION_QUEUE) with clear user messaging is excellent
  • Efficient data handling: Using CBOR serialization, BARE protocol versioning, and proper chunking for trace data shows strong understanding of performance needs
  • Separation of concerns: Clean separation between UI components (actor-traces.tsx), data processing (read-range.ts), and inspector protocol (actor-inspector-context.tsx)
  • Live/historical modes: The dual-mode UI with preset time ranges and custom date picking is user-friendly

Code Quality

  • Proper use of React patterns: useMemo for expensive computations (trace tree building, query results), proper dependency arrays
  • Good error handling: Graceful degradation when features aren't available, clear error states
  • Type safety: Strong TypeScript usage throughout with proper type imports from OTLP schemas

🔍 Issues & Recommendations

1. Performance Concerns

Critical: Unbounded recursion in trace rendering (actor-traces.tsx:247-291)

function renderItemsWithGaps(items: TraceItem[], depth: number, nowNs: bigint): ReactElement[]
  • Issue: No maximum depth check for nested spans
  • Impact: Deep trace hierarchies could cause stack overflow or performance degradation
  • Fix: Add a MAX_DEPTH constant and stop recursion at that level with a visual indicator
const MAX_DEPTH = 20;
if (depth >= MAX_DEPTH) {
  return [<div key="max-depth" className="text-xs text-muted-foreground ml-4">Max nesting depth reached...</div>];
}

Memory: Large trace arrays kept in memory (actor-traces.tsx:113-119)

const traceTree = useMemo(() => {
  if (!queryResult) return [];
  const spans = extractSpans(queryResult.otlp);
  return buildSpanTree(spans);
}, [queryResult]);
  • Issue: With DEFAULT_LIMIT = 1000, this could be substantial memory
  • Suggestion: Consider pagination or virtualization for very large trace sets
  • Note: The 1000 limit helps, but document this clearly for users

Sorting performance (read-range.ts:129-133)

records.sort((a, b) => {
  if (a.absNs < b.absNs) return -1;
  if (a.absNs > b.absNs) return 1;
  return a.sequence - b.sequence;
});
  • Issue: Sorting bigints can be slow for large datasets
  • Optimization: Consider using a more efficient comparison: return Number(a.absNs - b.absNs) || (a.sequence - b.sequence) (with overflow checks)

2. Bug: Potential race condition in queue size updates

Location: queue-manager.ts:80, 97, 143, 166

this.#actor.inspector.updateQueueSize(this.#metadata.size);
  • Issue: Queue size updates happen after database writes but aren't atomic with them
  • Scenario: If a write fails after metadata is updated in memory, the inspector will show incorrect size
  • Fix: Only call updateQueueSize after successful writes are confirmed
await this.#driver.kvBatchPut(this.#actor.id, [...]);
// Only update inspector after successful write
this.#actor.inspector.updateQueueSize(this.#metadata.size);

3. Code Quality Issues

Missing null checks (actor-traces.tsx:463-464)

const parentId = node.span.parentSpanId;
if (parentId && byId.has(parentId)) {
  byId.get(parentId)?.children.push(node);  // Optional chaining after has() check is redundant
}
  • Issue: After has(parentId) check, get(parentId) can still be undefined (rare edge case)
  • Fix: Use the pattern:
if (parentId) {
  const parent = byId.get(parentId);
  if (parent) parent.children.push(node);
}

Inconsistent error handling (actor-queue.tsx:32-38)

if (queueStatusQuery.isError || !queueStatusQuery.data) {
  return <div>Queue data is currently unavailable.</div>;
}
  • Issue: Doesn't distinguish between network errors, permissions, or actual failures
  • Suggestion: Check queueStatusQuery.error and provide more specific messaging

Magic numbers should be constants

// actor-traces.tsx:102
refetchInterval: isLive ? 1000 : false,
  • Fix: const LIVE_REFETCH_INTERVAL_MS = 1000;

4. Testing Concerns

No tests for new functionality

  • Missing: Tests for trace tree building, OTLP conversion, queue status queries
  • Recommendation: Add unit tests for:
    • buildSpanTree with nested/orphaned spans
    • readRangeWireToOtlp with edge cases (empty, clamped, out-of-range)
    • Queue status truncation logic

Test cleanup improvement needed (test/mod.ts:83-85)

c.onTestFinished(async () => {
  await new Promise((resolve) => server.close(() => resolve(undefined)));
});
  • Issue: Server close doesn't have error handling or timeout
  • Suggestion: Add timeout to prevent hanging tests

5. Documentation

Missing JSDoc comments

  • Key functions lack documentation:
    • buildSpanTree - should explain the algorithm
    • otlpAnyValueToJs - should note type conversion rules
    • normalizeQueueStatus - should explain BigInt to Number conversions

Changelog/migration notes needed

  • Users upgrading from pre-2.0.40 RivetKit should understand:
    • New features require version upgrade
    • What data migration (if any) is needed
    • Breaking changes to inspector protocol

🔒 Security

✅ Good practices observed:

  • Proper token handling in WebSocket protocols
  • CBOR deserialization wrapped in try-catch blocks (read-range.ts:384-388)
  • Size limits enforced (DEFAULT_LIMIT, maxQueueStatusLimit)

⚠️ Minor concern:

Trace data exposure: Ensure trace data doesn't leak sensitive information in attributes

  • Current: No PII filtering on attributes
  • Recommendation: Document that users should avoid logging sensitive data, or add opt-in filtering

📊 Performance Metrics

Based on code analysis:

  • Memory: ~1-2MB for 1000 spans (acceptable)
  • Latency: Live refresh every 1s might be aggressive for slow connections
    • Suggestion: Consider adaptive polling based on fetch duration
  • Database: CBOR encoding is efficient, but ensure FDB transaction limits aren't hit with large traces

🎯 Action Items

High Priority

  1. Add max depth limit to trace rendering recursion
  2. Fix queue size update race condition
  3. Add proper error boundaries around trace rendering

Medium Priority

  1. Add unit tests for trace parsing and tree building
  2. Document migration path for RivetKit version upgrades
  3. Improve error messaging specificity

Low Priority

  1. Extract magic numbers to named constants
  2. Add JSDoc comments to complex functions
  3. Consider virtualization for very large trace lists

📝 Code Style

Adheres well to CLAUDE.md:

  • ✅ Uses lowercase log messages
  • ✅ Proper timestamp naming (createdAtMs, timeUnixNano)
  • ✅ Structured logging with tracing
  • ✅ No unnecessary formatting

Minor style notes:

  • Some functions exceed 100 lines (e.g., ActorTraces) - consider breaking into smaller components
  • Consistent use of arrow functions vs function declarations

🎉 Summary

This is a high-quality PR that adds significant value to the Rivet Actors platform. The trace visualization is well-designed and the queue inspector is a useful addition. The main concerns are around edge cases (recursion depth, race conditions) and testing coverage, but these are addressable post-merge with follow-up PRs if time-sensitive.

Recommendation: ✅ Approve with minor revisions

Focus on the high-priority items (recursion depth limit, race condition fix) before merging, and address the medium/low priority items in follow-up PRs.

@claude
Copy link

claude bot commented Jan 28, 2026

PR Review: feat(frontend): traces ui

Overview

This PR adds a comprehensive traces UI to the frontend, replacing the events system with OpenTelemetry-based distributed tracing. The changes span across frontend components, RivetKit inspector protocol, and the traces package.

Positive Highlights

Architecture & Design

  • Clean separation of concerns: The traces functionality is well-isolated in the @rivetkit/traces package with clear boundaries
  • Protocol versioning: Good use of versioned protocols (v4) with backward compatibility converters in versioned.ts:96-99
  • Feature detection: Smart version-based feature detection with clear user messaging when features are not supported (actor-inspector-context.tsx:115-137)

Code Quality

  • Type safety: Strong TypeScript usage throughout, especially in the OTLP type definitions
  • Data transformation pipeline: Well-structured conversion from wire format → OTLP → UI display (read-range.ts:101-145)
  • Error handling: Graceful degradation when traces are unavailable (actor-traces.tsx:153-158)

Issues & Concerns

🔴 Critical Issues

1. Missing Error Handling in CBOR Decoding (actor-inspector-context.tsx:738-750)

cbor.decode(new Uint8Array(body.val.output))

If the CBOR data is corrupted, this will throw and crash the message handler. Should wrap in try/catch like in read-range.ts:385-388.

2. Unbounded Memory Growth Risk (actor-traces.tsx:113-119)
The traceTree useMemo builds an entire tree structure in memory. With DEFAULT_LIMIT = 1000 spans, and potentially deep trees, this could consume significant memory. Consider:

  • Virtualization for rendering
  • Pagination or windowing
  • Memory limit warnings

3. Race Condition in Queue Size Updates (actor-inspector.ts:61-67)

updateQueueSize(size: number) {
    if (this.#lastQueueSize === size) return;
    this.#lastQueueSize = size;
    this.emitter.emit("queueUpdated");
}

No synchronization mechanism. If called concurrently, could lead to inconsistent state or missed updates.

⚠️ High Priority Issues

4. Timeout Without Cleanup (actor-inspector-context.tsx:829-835)

setTimeout(() => {
    if (this.suspensions.has(id)) {
        reject(new Error("Action timed out"));
        this.suspensions.delete(id);
    }
}, timeoutMs);

The timeout handler is never cleared. If the action resolves before timeout, the timeout still fires. This leaks memory and could reject an already-resolved promise.

Fix:

const timeoutId = setTimeout(() => { ... }, timeoutMs);
return {
    id,
    promise: promise.finally(() => clearTimeout(timeoutId))
};

5. Missing Input Validation (actor-inspector-context.tsx:625-644)

getQueueStatus: async (limit) => {
    const safeLimit = Math.max(0, Math.floor(limit));

The frontend passes DEFAULT_QUEUE_LIMIT = 200 but there is no maximum validation. A malicious or buggy client could request millions of messages.

6. Potential BigInt Overflow (actor-traces.tsx:565-566)

function nsToMs(ns: bigint): number {
    return Number(ns / 1_000_000n);
}

If ns is a very large bigint, converting to Number could result in Infinity or precision loss. Should add bounds checking.

7. BigInt Serialization Issues (actor-inspector-context.tsx:615-616)

startMs: BigInt(Math.floor(startMs)),
endMs: BigInt(Math.floor(endMs)),

Creating BigInts from potentially NaN or Infinity values will throw. Need validation.

📝 Medium Priority Issues

8. Inefficient Sorting (actor-traces.tsx:252-254)

const sorted = [...items].sort((a, b) =>
    a.timeNs < b.timeNs ? -1 : a.timeNs > b.timeNs ? 1 : 0,
);

Creates a new array copy every render. Since items is already memoized, this should be memoized too.

9. Missing Null Checks (actor-traces.tsx:464)

byId.get(parentId)?.children.push(node);

Uses optional chaining but still attempts push. If the parent does not exist, this silently fails. Should log or handle this case.

10. Inconsistent Queue Size Handling (actor-queue.tsx:41-42)

const size = Number.isFinite(status.size) ? status.size : queueSizeQuery.data ?? 0;

Why might status.size be non-finite? This suggests a data quality issue. Should investigate root cause.

11. Deprecated Code Path (test/mod.ts:44-47)

// const internalClient = createClientWithDriver(
//     managerDriver,
//     ClientConfigSchema.parse({}),
// );

Commented code with TODO at line 58. Should either remove or complete the migration.

12. String Interpolation in JSX (actor-traces.tsx:213-216)

{`${format(displayRange.from, "PPpp")}${format(displayRange.to, "PPpp")}`}

Works but creates a new string on every render. Consider memoizing formatted strings.

🔍 Code Style & Best Practices

13. Inconsistent Comparison (actor-traces.tsx:308)

const isActive = node.endNs == null;

Uses == instead of ===. While intentional (catches both null and undefined), the codebase should be consistent. Based on CLAUDE.md conventions, prefer explicit checks.

14. Magic Numbers (actor-traces.tsx:47, 48)

const GAP_THRESHOLD_MS = 500;
const DEFAULT_LIMIT = 1000;

These should be configurable or at least documented why these specific values.

15. Missing Documentation (read-range.ts:229-242)
The recordSpanId function extracts span IDs from different record types but lacks documentation explaining the record type hierarchy.

16. TODO Comments (actor-inspector-context.tsx:206, 223, 245)
Multiple TODOs for database functionality. Should be tracked as issues if not completing in this PR.

17. Type Assertion Without Validation (actor-inspector-context.tsx:254)

return [] as unknown as Record<string, unknown>[];

Dangerous type assertion. If the implementation changes, this could cause runtime errors.

🛡️ Security Considerations

18. Token Exposure (actor-inspector-context.tsx:461-462)

credentials.token ? `rivet_token.${credentials.token}` : "",

Tokens are sent in WebSocket subprotocols. Ensure these are not logged or exposed in client-side debugging tools.

19. No Rate Limiting
The live mode refetches every second (actor-traces.tsx:102). With multiple actors open, this could overwhelm the backend. Consider:

  • Increasing interval to 2-5 seconds
  • Implementing request coalescing
  • Adding circuit breaker for repeated failures

🧪 Testing Concerns

20. Test Changes Look Incomplete (test/mod.ts:58)

// TODO: I think this whole function is fucked, we should probably switch to calling registry.serve() directly

Strong language suggests this needs attention. The test setup looks complex and brittle.

21. No Tests for New UI Components
The new traces UI components lack corresponding test files. Given the complexity of the state management and time-range logic, this is risky.

Recommendations

Immediate Fixes Required

  1. Fix timeout memory leak ([SVC-2483] Remove hardcoded uses of rivet.gg #4)
  2. Add CBOR decode error handling ([SVC-2555] Set up issue templates #1)
  3. Validate BigInt inputs ([SVC-2404] Improve internal documentation #7)
  4. Add max limit to queue status requests ([SVC-2358] Enable Redis memory overcommit #5)

Before Merge

  1. Memoize sorting operations (Remove fallible uses of as in api services #8)
  2. Add bounds checking to nsToMs (Remove servers not in salt #6)
  3. Clean up TODO comments or create issues ([SOC-45] [SOC-44] Add back presence_game_public_metadata and presence_game_friend_metadata to analytics-event-create #16)
  4. Document magic numbers (Fix compile error in mm-worker #14)

Follow-up Work

  1. Add virtualization for large trace trees ([SVC-2479] Send cluster events to PostHog #2)
  2. Add unit tests for trace UI components ([SOC-17] Fix user presence returning "gm" #21)
  3. Implement rate limiting or request coalescing ([SOC-46] [SOC-23] Add mutual team ids index to improve performance #19)
  4. Fix test infrastructure ([SOC-25] Gracefully handle failure to parse presence JSON #20)
  5. Consider synchronization for queue size updates ([SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3)

Performance Notes

  • The live polling (1s interval) is aggressive. Monitor production impact.
  • Building entire span trees in memory could be problematic with deep traces
  • Multiple CBOR encode/decode operations per message - consider caching if performance issues arise

Summary

This is a solid implementation of a complex feature with good architectural decisions. The protocol versioning and feature detection are particularly well done. However, there are several critical issues around error handling, resource cleanup, and input validation that should be addressed before merging.

Recommendation: Request changes for critical issues (#1, #4, #5, #7), then approve after fixes.


Review generated by Claude Code

@NathanFlurry NathanFlurry force-pushed the 01-24-feat_frontend_traces_ui branch from 3c7fabc to 7dc3fdd Compare January 30, 2026 08:19
@jog1t jog1t force-pushed the 01-24-feat_frontend_traces_ui branch from 7dc3fdd to 0232d12 Compare January 30, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants