Skip to content

feat(csharp): add IStatementOperationObserver and fail-open implementations (PECO-3022)#461

Open
jadewang-db wants to merge 7 commits into
mainfrom
stack/pr-phase2-observer-interface
Open

feat(csharp): add IStatementOperationObserver and fail-open implementations (PECO-3022)#461
jadewang-db wants to merge 7 commits into
mainfrom
stack/pr-phase2-observer-interface

Conversation

@jadewang-db
Copy link
Copy Markdown
Collaborator

@jadewang-db jadewang-db commented May 19, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Summary

Adds the IStatementOperationObserver interface that decouples statement classes from telemetry's data model, plus three implementations. Pure new code with no consumers yet — wired up in subsequent stack phases.

  • IStatementOperationObserver — 8 methods shaped around the statement's lifecycle (OnExecuteStarted, OnExecuteSucceeded, OnPollCompleted, OnFirstBatchReady, OnConsumed, OnChunksDownloaded, OnError, OnFinalized). Contract is documented as fail-open: implementations MUST NOT throw, must be thread-safe, and OnFinalized is terminal + idempotent. The interface speaks the statement's language, not the telemetry proto's — future tracing/audit observers can plug in without touching statement code.
  • TelemetryObserver — translates observer calls to mutations on a StatementTelemetryContext. On OnFinalized builds the OssSqlDriverTelemetryLog and enqueues to the session's TelemetryClient. Uses a private Safe(Action) helper to concentrate the try/catch in exactly one place; method bodies are clean one-liners. Idempotency via Interlocked.CompareExchange on an _emitted flag.
  • NullObserver — singleton no-op. Used as the default field value so statement code never needs null-checks at callsites.
  • SafeObserver — belt-and-suspenders decorator that wraps any inner observer with per-method try/catch. Optional; first-party observers already honor the contract internally.

Files touched (all new)

  • csharp/src/Telemetry/IStatementOperationObserver.cs
  • csharp/src/Telemetry/TelemetryObserver.cs
  • csharp/src/Telemetry/NullObserver.cs
  • csharp/src/Telemetry/SafeObserver.cs
  • csharp/test/Unit/Telemetry/TelemetryObserverTests.cs
  • csharp/test/Unit/Telemetry/NullObserverTests.cs
  • csharp/test/Unit/Telemetry/SafeObserverTests.cs

Test plan

  • NullObserver_AllMethods_AreNoOps, NullObserver_IsSingleton.
  • TelemetryObserver_OnExecuteStarted_PopulatesContext, _OnExecuteSucceeded_RecordsStatementId, _OnFinalized_EnqueuesExactlyOnce, _OnFinalized_CalledTwice_EnqueuesOnce, _OnError_RecordsErrorAndFinalizes, _AllMethods_NeverThrow_WhenContextCorrupted, _OnChunksDownloaded_MergesIntoChunkDetails.
  • SafeObserver_PropagatesNormalCallsToInner, _SwallowsExceptionsFromInner_LogsAtTrace.

Related

Part of PECO-3022.

Jade Wang added 7 commits May 22, 2026 20:06
…erMode\n\nTask ID: task-1.1-refactor-connection-telemetry-create
- High: preserve fail-open contract — wrap the TSessionHandle Guid byte[]
  conversion in InitializeTelemetry with try/catch. Pre-refactor the same
  conversion lived inside ConnectionTelemetry.Create's outer try/catch so a
  malformed session GUID degraded to NoOp telemetry; moving it to the
  transport boundary lost that guarantee.
- Medium: remove SafeBuildSystemConfiguration_..._FallbackPath — it passed
  string.Empty for assemblyVersion expecting a throw, but
  CreateSystemConfiguration coalesces empty string. The catch block is
  never reached. The CanonicalConstant_HasExpectedLiteralValue test
  already pins both branches by construction.
- Low: rename Create_ThrowingHttpClient_ReturnsNullConnectionTelemetry to
  ...ReturnsNoOpConnectionTelemetry — actual return is the NoOp singleton.
- Low: document the test's implicit assumption that the empty-host throws
  inside HttpClientFactory/TelemetryClientManager so future defensive
  handling there doesn't silently turn this into a non-test.
- Low: add Create_EmptySessionId_MapsToNullInContext to pin the
  string.Empty -> null SessionId mapping at ConnectionTelemetry.cs:133.

Co-authored-by: Isaac
…on\n\nTask ID: task-2.1-observer-interface-and-null-impl
…ryContext + enqueue\n\nTask ID: task-2.2-telemetry-observer-impl
…server\n\nTask ID: task-2.3-safe-observer-decorator
@jadewang-db jadewang-db force-pushed the stack/pr-phase2-observer-interface branch from b0d9f02 to 63365ea Compare May 22, 2026 20:07
@jadewang-db jadewang-db marked this pull request as ready for review May 22, 2026 20:27
Copy link
Copy Markdown
Collaborator Author

@jadewang-db jadewang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review of phase 2 (observer interface + 3 implementations + tests).

Summary. Clean, well-documented, fail-open design. Interlocked.CompareExchange correctly guards exactly-once finalize; tests cover the contract dimensions that matter (concurrency, null inputs, throwing dependencies, no-ambient-Activity). Pure new code with no consumers yet, so blast radius is contained.

Findings, all medium-or-lower — none are blockers, but the comment/code mismatch in TelemetryObserver.OnError is worth fixing in this PR because it will mislead the next reader.

  1. TelemetryObserver.OnError — comment contradicts the code.
  2. TelemetryObserver.Safe — uses ex.Message directly; should use SafeMessage(ex) for parity with OnError and SafeObserver.
  3. Safe() / SafeMessage() duplicated across TelemetryObserver and SafeObserver with subtly different suppression-event payloads — converge or extract.
  4. Contract says "any further calls" after OnFinalized are no-ops; today only further OnFinalized is gated, other methods still mutate _ctx. Inert in current callers, but a strict reading of the contract would gate all methods.

Nothing critical. Approving in spirit once #1 is addressed.

// have explicit consent: the proto's DriverErrorInfo.error_message field is
// pending LPP review (see ConnectionTelemetry.EmitOperationTelemetry).
_ctx.ErrorName = ex.GetType().Name;
_ctx.ErrorMessage = SafeMessage(ex);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Comment contradicts the code.

Lines 164–166 say "do not capture ex.Message here unless we have explicit consent" — but line 168 then captures it: _ctx.ErrorMessage = SafeMessage(ex);.

The in-memory capture matches existing behavior in ConnectionTelemetry.cs:207, and BuildTelemetryLog does not serialize ErrorMessage into the proto today, so there is no actual leak. But the comment is misleading and the future risk is real: if anyone wires ErrorMessage into DriverErrorInfo.error_message later (once LPP review lands), they will reasonably assume the capture site already considered the privacy implications — when in fact the comment told them the opposite.

Pick one: either (a) drop the _ctx.ErrorMessage = SafeMessage(ex); line and rely on ErrorName only, matching what the comment says, or (b) update the comment to reflect that .Message is captured in-memory today, the proto field is gated separately in BuildTelemetryLog, and SafeMessage is used to neutralize exceptions that throw from their .Message property. Option (b) is consistent with ConnectionTelemetry and is probably what you want.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Done — went with option (b): updated the comment at TelemetryObserver.cs:165-170 to accurately describe what the code does. The new comment notes that .Message is captured in-memory only, that BuildTelemetryLog does not serialize ErrorMessage into the proto today (LPP review pending), and that SafeMessage guards against .Message throwing. Matches the existing pattern in ConnectionTelemetry.cs:207.

tags: new ActivityTagsCollection
{
{ "error.type", ex.GetType().Name },
{ "error.message", ex.Message }
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] ex.Message here, but SafeMessage(ex) in OnError (and in SafeObserver).

The rest of this file — and SafeObserver.Safe — explicitly guard against exceptions whose .Message property throws. This site doesn't. The outer catch {} on line 234 saves us from a crash, but it also swallows the suppression event for that specific failure mode, so we lose telemetry visibility into observer failures exactly when an upstream exception is misbehaving.

Use SafeMessage(ex) for consistency.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Done — TelemetryObserver.Safe() now uses SafeMessage(ex) instead of ex.Message directly. Suppression telemetry is now preserved even when the inner exception has a throwing .Message property.

{
{ "error.type", ex.GetType().Name },
{ "error.message", SafeMessage(ex) },
{ "observer.suppressed.source", "SafeObserver" },
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low] Diverging suppression-event payloads between the two Safe() helpers.

SafeObserver.Safe tags the suppression event with observer.suppressed.source = "SafeObserver" (line 122). TelemetryObserver.Safe (TelemetryObserver.cs:227) emits the same event name (telemetry.observer.suppressed) but without a source tag.

This matters in two ways:

  • A trace consumer can't tell which layer suppressed an exception when both decorators are in play (e.g. a third-party observer wrapped in SafeObserver wrapped over TelemetryObserver).
  • These two helpers — plus SafeMessage — are essentially duplicated. Converging them into a single internal helper (e.g. TelemetryObserverSafety.Run(Action, string source)) would make the inconsistency impossible by construction. Optional, but worth doing while the surface is small.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Done — chose to converge rather than extract a shared helper (lighter for two small classes). Added observer.suppressed.source = "TelemetryObserver" to TelemetryObserver.Safe() so a trace consumer can tell the two suppression layers apart when both decorators are in play.

/// been invoked once, the observer's record is considered complete and any further
/// calls — including additional <see cref="OnFinalized"/> calls — <b>MUST be no-ops
/// (idempotent)</b>. This protects against the common case where both an error path
/// and a dispose path attempt to finalize the same statement.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low] Contract vs. implementation: post-OnFinalized non-OnFinalized calls.

This contract says: "any further calls — including additional OnFinalized calls — MUST be no-ops (idempotent)." That language covers all methods, not just OnFinalized.

TelemetryObserver only gates further OnFinalized via _emitted. After finalize, a stray OnPollCompleted / OnError / OnChunksDownloaded will still mutate _ctx. The mutations are inert today because the log has already been built and no one reads the context post-finalize — but a strict reading of the contract would have all observer methods check _emitted first and return early.

Two equally reasonable resolutions:

  • Loosen the interface doc: "OnFinalized is the only terminal call that must be idempotent; other methods after finalize are silently dropped but their effect on observer state is implementation-defined."
  • Tighten the implementation: add an if (Volatile.Read(ref _emitted) != 0) return; short-circuit at the top of every Safe(...) body in TelemetryObserver.

The contract-shipped-as-written is what later reviewers will hold us to, so picking one of these and aligning is worth doing now while there are no external consumers.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Done — tightened the implementation (the safer interpretation per the review comment). Added a private IsFinalized() helper (Volatile.Read(ref _emitted) != 0) and a gate at the top of every non-finalize lifecycle method body in TelemetryObserver. After OnFinalized() fires, all other lifecycle methods become true no-ops with no further mutation of _ctx.


// The throwing client must have been invoked exactly once (idempotent finalize)
// and the observer must have swallowed its exception.
Assert.Equal(1, throwing.EnqueueCallCount);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Add a test for the post-finalize behavior that pins down whichever interpretation you pick in the contract-vs-impl issue (see comment on IStatementOperationObserver.cs:47).

E.g., something like OnFinalized_ThenLifecycleCalls_AreNoOps that asserts later OnPollCompleted / OnError calls do not enqueue a second log and do not affect any observable post-finalize state. Today this is implicitly true (because the second Enqueue is gated by _emitted), but having the assertion in the test suite locks it in.

Low priority.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Done — added TelemetryObserver_OnFinalized_ThenLifecycleCalls_AreNoOps. Captures the context fingerprint right after OnFinalized, then calls every non-terminal lifecycle method with values that would visibly mutate state if the new IsFinalized() gate were missing, and asserts both the enqueue count and every captured field are unchanged. Locks in the contract interpretation chosen in the previous fix.

CurtHagenlocher pushed a commit to CurtHagenlocher/databricks that referenced this pull request May 24, 2026
)

## 🥞 Stacked PR
Use this
[link](https://github.com/adbc-drivers/databricks/pull/455/files) to
review incremental changes.
-
[**stack/PECO-3022-sea-telemetry-design**](adbc-drivers#455)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/455/files)]
-
[stack/pr-phase1-connection-telemetry-create-refactor](adbc-drivers#460)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/460/files/c0dfbed80fd09c91b9e2e5ed8050a268435618bd..d3190aeb6f2f1c727b359d0ef40d26be2280c73e)]
-
[stack/pr-phase2-observer-interface](adbc-drivers#461)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/461/files/b2340d9d32da68d5b81756e0a77008f05aadd45b..b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73)]
-
[stack/pr-phase3-databricks-statement-observer](adbc-drivers#462)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/462/files/b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73..fa799fcd0cc2209982eae2890e16e26854a0649e)]
-
[stack/pr-phase4-sea-connection-telemetry](adbc-drivers#463)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/463/files/f0e344aa4d9d341302ee9b3a5217a6794d8ba189..25b5e6eaf2f2f04941d07bbbe582845254630950)]
-
[stack/pr-phase5-sea-statement-telemetry](adbc-drivers#464)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/464/files/9dc278a249433000b03e2d3a4cdf7daa151caa69..8109047005a89804f243beb3a6c689f43b539506)]

---------
## Summary of the design

This design closes the gap where the SEA (REST) transport in the C# ADBC
Databricks driver emits **zero** client telemetry — no session events,
no per-statement operation events, no error events, no chunk metrics.
SEA traffic is currently invisible to the `eng_lumberjack`
driver-telemetry pipeline.

The design:
- Introduces a small **observer interface**
(`IStatementOperationObserver`) between statement classes and the
telemetry implementation. The interface is shaped around the statement's
lifecycle (`OnExecuteStarted`, `OnPollCompleted`, …) rather than
telemetry's data model. Contract: fail-open — implementations must not
throw.
- **Refactors `ConnectionTelemetry.Create`** to accept `string
sessionId` (dropping its Thrift `TSessionHandle` coupling) and a
`DriverMode mode` parameter. Both transports use the same factory.
- Wires observer callbacks at the SEA hookpoints in
`StatementExecutionConnection` and `StatementExecutionStatement`.
- **Reuses all existing telemetry infrastructure as-is**:
`TelemetryClient`, `TelemetryClientManager`,
`CircuitBreakerTelemetryExporter`, `DatabricksTelemetryExporter`,
`FeatureFlagCache`, `TelemetrySessionContext`,
`StatementTelemetryContext`.

## Key decisions and alternatives considered

- **Observer interface over static helper or instance emitter.** The
gap-fix doc proposed a static `TelemetryHelper`. This design supersedes
that with an observer-shaped interface because: (a) it gives one-line
callsites with no parameter threading, (b) it decouples statement
classes from telemetry types so future tracing/audit observers can plug
in without touching statement code, (c) it's trivially mockable for
tests. The fail-open contract pushes all try/catch into the
implementations exactly once — callsites stay clean.
- **Composition, not inheritance.** Considered (and rejected) three
inheritance variants: SEA-inherits-Thrift (drags in entire HiveServer2
chain — semantically wrong), shared base above both (blocked
structurally — `HiveServer2Connection`'s parent is in the Apache package
we don't own), and interface + extension methods (functionally identical
to the static helper). C# single inheritance + unowned Apache base
classes make pure inheritance impractical.
- **Decorator at AdbcConnection boundary rejected.** A wrapper around
the whole connection cannot see deep telemetry signals (chunk timing,
poll count, first-batch latency) — those live inside statement classes.
Wrong granularity.
- **Refactor `Create` signature rather than add overload.** Changes the
canonical `ConnectionTelemetry.Create` to take `string sessionId`.
Thrift converts at the call boundary
(`sessionHandle.SessionId.Guid.ToString()`). Eliminates the Thrift leak
from the telemetry API permanently.
- **Scoped strictly to SEA, plus `DriverMode.Sea` setter.** The
cross-cutting Thrift gaps (workspace_id, auth_type, metadata-ops
instrumentation, retry_count) are owned by the separate gap-fix
workstream and will land independently. The only cross-cutting change
pulled in here is unhooking the hardcoded `DriverMode.Types.Type.Thrift`
in `BuildDriverConnectionParams` — without it SEA records would be
silently mislabeled as Thrift.

## Areas needing specific review focus

1. **Observer interface shape & fail-open contract** (`§5.1`) — the
observer methods, naming, and the requirement that implementations never
throw. Specifically: are the 8 methods the right cut, or is finer
granularity (e.g. split `OnChunksDownloaded` from `OnConsumed`)
preferred?
2. **`ConnectionTelemetry.Create` signature change** (`§5.2`) — replaces
`TSessionHandle?` with `string sessionId` and adds a `DriverMode mode`
parameter. This touches a stable API used by the existing Thrift path;
the Thrift call site must convert at the boundary.
3. **Result-format mapping for SEA** (`§8`) — SEA does not expose a
typed `ResultFormat`. The mapping table from wire `disposition` ×
manifest state → proto `ExecutionResult.Format` is a judgment call;
please review.
4. **Chunk-metrics dependency on gap-fix** (`§9`, `§16`) — this design
assumes the `CloudFetchDownloader → ChunkMetrics →
CloudFetchReader.GetChunkMetrics()` plumbing from the gap-fix workstream
lands first or concurrently. If gap-fix is delayed, SEA ships with
`ChunkMetrics.Empty` and backfills later. Is that acceptable, or should
we sequence differently?
5. **Open questions** (`§17`): polling-granularity semantics for
`PollCount`, `is_internal_call` semantics for SEA `USE SCHEMA`, and
whether SEA's `operation_type` should always be
`EXECUTE_STATEMENT_ASYNC` or map to sync-emulated variants.

## Related

- Builds on the architecture in
[`csharp/doc/telemetry-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/csharp/doc/telemetry-design.md)
- Supersedes the `TelemetryHelper` static-helper proposal in
[`docs/designs/fix-telemetry-gaps-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/docs/designs/fix-telemetry-gaps-design.md)
for the new SEA code; Thrift-side gap-fix work continues independently
- Jira: [PECO-3022](https://databricks.atlassian.net/browse/PECO-3022)

This pull request and its description were written by Isaac.

---------

Co-authored-by: Jade Wang <jade.wang+data@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant