Skip to content

refactor(csharp): make ConnectionTelemetry.Create protocol-agnostic (PECO-3022)#460

Open
jadewang-db wants to merge 4 commits into
mainfrom
stack/pr-phase1-connection-telemetry-create-refactor
Open

refactor(csharp): make ConnectionTelemetry.Create protocol-agnostic (PECO-3022)#460
jadewang-db wants to merge 4 commits into
mainfrom
stack/pr-phase1-connection-telemetry-create-refactor

Conversation

@jadewang-db
Copy link
Copy Markdown
Collaborator

@jadewang-db jadewang-db commented May 19, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Summary

Refactors ConnectionTelemetry.Create so it can serve both transports cleanly — the precondition for wiring SEA telemetry in later phases of this stack:

  • Replaces TSessionHandle? sessionHandle with string sessionId. TSessionHandle is a Thrift-only type; SEA only has a string _sessionId. The Thrift call site converts at the boundary (new Guid(SessionHandle.SessionId.Guid).ToString()), keeping the conversion local to Thrift code.
  • Adds DriverMode mode parameter. Removes the hardcoded DriverMode.Types.Type.Thrift from BuildDriverConnectionParams/SafeBuildDriverConnectionParams (lines 458, 642 pre-refactor). Without this, SEA records would still be labeled DRIVER_MODE_THRIFT.
  • Aligns the driver_name constant (ADBC Databricks Driver) used by both transports so lumberjack dashboards see a single literal regardless of protocol.

Files touched

  • csharp/src/Telemetry/ConnectionTelemetry.csCreate signature change, mode threading, hardcoded Thrift removed.
  • csharp/src/DatabricksConnection.cs — Thrift caller converts TSessionHandle → string and passes DriverMode.Thrift.
  • csharp/test/Unit/Telemetry/ConnectionTelemetryCreateSignatureTests.cs (new) — covers Create_AcceptsStringSessionId, Create_ThriftMode_SetsDriverModeThrift, Create_SeaMode_SetsDriverModeSea.

Test plan

  • All existing telemetry unit tests pass unchanged.
  • Existing Thrift integration tests continue to emit driver_connection_params.mode = DRIVER_MODE_THRIFT.
  • New unit tests cover string sessionId + both DriverMode values.

Related

Part of PECO-3022.

Copy link
Copy Markdown
Collaborator Author

@jadewang-db jadewang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #460ConnectionTelemetry.Create protocol-agnostic refactor

Summary

The refactor itself is mechanically clean — string sessionId + DriverMode mode thread through correctly, hardcoded Thrift literals are removed, and the boundary conversion lives where it should (in the Thrift caller). The XML docs added to Create explain the why of both parameters well.

One High correctness issue: moving the TSessionHandle → string Guid conversion out of ConnectionTelemetry.Create's outer try/catch and into InitializeTelemetry opens a path where a malformed SessionId.Guid byte array would propagate to connection-open instead of degrading to NoOp telemetry. Details inline.

A few Low/Medium test-quality nits — including a test whose name promises a fallback path it doesn't actually exercise, and a misleading test name (ReturnsNullConnectionTelemetry vs. the singleton). Also inline.

Positives worth keeping

  • Per-test distinct hosts to side-step the TelemetryClientManager per-host singleton cache — exactly the kind of subtle isolation note future readers benefit from.
  • Both DriverMode enum values (Thrift + Sea) are covered, including the new SEA path that doesn't even have a caller in this PR yet — that's the right place to pin behavior.
  • Comments reference the prior line numbers (ConnectionTelemetry.cs:642) so reviewers can verify the literal really moved. Nice forensic detail.

Scope / commit hygiene

Commit d3190ae titled [gap-fill][gap-1] Align driver_name string across SEA and Thrift only adds the regression tests — the actual alignment was done in an earlier commit on main. Suggest retitling to add regression tests pinning driver_name literal so the diff matches the title; otherwise a future bisect on "when did the alignment land" will land here and find no source change.

Overall: approve with one High issue to address before merge.

Comment thread csharp/src/DatabricksConnection.cs Outdated
// ConnectionTelemetry.Create stays transport-agnostic. SEA will pass its
// server-assigned session id string directly.
string sessionId = SessionHandle?.SessionId?.Guid != null
? new Guid(SessionHandle.SessionId.Guid).ToString()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High] Fail-open contract regression.

The new Guid(SessionHandle.SessionId.Guid) call now lives outside any try/catch. Pre-refactor, this exact conversion was on line ~120 inside ConnectionTelemetry.Create's outer try/catch (the catch at ConnectionTelemetry.cs:150), so any throw degraded to NoOpConnectionTelemetry.Instance and the connection still opened.

new Guid(byte[]) throws ArgumentException when the array is not exactly 16 bytes. The null-guard above only protects against Guid == null, not against a wrong-length array. While the Thrift server should always send a 16-byte payload, the explicit Never throws contract documented two lines above (ConnectionTelemetry.cs:62) is now reachable from InitializeTelemetry callers — a malformed session GUID would fail connection-open, not just disable telemetry.

Fix: wrap the conversion locally:

string sessionId = string.Empty;
try
{
    if (SessionHandle?.SessionId?.Guid != null)
    {
        sessionId = new Guid(SessionHandle.SessionId.Guid).ToString();
    }
}
catch
{
    // Preserve fail-open: malformed session GUID disables telemetry, not the connection.
}

Alternatively, move the conversion back inside Create by accepting byte[]? rawSessionGuid — but then SEA's caller has to construct a fake byte array, which defeats the whole refactor. The local try/catch is the cheaper fix.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — wrapped the byte[] -> Guid conversion in a local try/catch (commit 9b31881). A malformed session GUID now degrades to sessionId = string.Empty (which Create maps to SessionId = null), preserving the pre-refactor fail-open guarantee.

// canonical literal so SEA/Thrift parity holds on partial-init failures.
// We exercise the wrapper directly with a degenerate assembly version — any
// throw inside the wrapper falls back to the literal-populated proto.
var config = ConnectionTelemetry.SafeBuildSystemConfiguration(string.Empty, activity: null);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Test name promises a fallback that isn't actually exercised.

The test passes assemblyVersion: string.Empty expecting that to trip the catch block in SafeBuildSystemConfiguration. But CreateSystemConfiguration at ConnectionTelemetry.cs:512 does:

DriverVersion = assemblyVersion ?? string.Empty,

Empty string is coalesced, not validated — it flows through happily. So _FallbackPath is actually exercising the same code path as _HappyPath above it, just with a different DriverVersion value. The catch block in SafeBuildSystemConfiguration is never reached.

Two ways to fix:

  1. Rename to reflect reality: drop _FallbackPath — there's nothing in the catch worth covering separately from the happy path.
  2. Actually trigger the fallback — harder with static methods, but BuildSystemConfiguration can throw via the TrySet(() => config.OsArch = RuntimeInformation.OSArchitecture.ToString()); chain on some constrained runtimes. A cleaner approach would be to add an internal overload that takes a fault-injection delegate, or to assert the literal directly on the catch fallback proto using reflection-based invocation of the catch path.

Option 1 is fine — the intent (pin the literal) is already covered by CanonicalConstant_HasExpectedLiteralValue. The fallback path can be tested separately if/when a real fault-injection seam is added.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — went with option 1: removed the misleading _FallbackPath test and consolidated the happy path into a single SafeBuildSystemConfiguration_ReturnsCanonicalDriverName (commit 9b31881). Added a comment explaining why we don't separately cover the catch block: there's no fault-injection seam for the static helper, and CanonicalConstant_HasExpectedLiteralValue already pins the literal that both the happy-path and fallback branches reference.

}

[Fact]
public void Create_ThrowingHttpClient_ReturnsNullConnectionTelemetry()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low] Test name says ReturnsNull... but the method returns the NoOp singleton, not null.

The assertion two lines down — Assert.Same(NoOpConnectionTelemetry.Instance, telemetry); — confirms the actual contract: a NoOp instance, never null. Suggest renaming to Create_ThrowingHttpClient_ReturnsNoOpConnectionTelemetry. The current name will confuse the next reader who greps for Null thinking we have a null-return path.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed to Create_ThrowingHttpClient_ReturnsNoOpConnectionTelemetry (commit 9b31881).

Assert.Same(NoOpConnectionTelemetry.Instance, telemetry);
Assert.Null(telemetry.Session);
}
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low] Brittle test — depends on the current implementation of HttpClientFactory / TelemetryClientManager throwing on empty host.

The comment is honest about the mechanism (new Uri("https://") throws), but that's two layers down from the unit under test. If HttpClientFactory.CreateTelemetryHttpClient or TelemetryClientManager.GetOrCreateClient later gains an early defensive check that returns gracefully on empty host, this test would still pass — but it would no longer be testing what its name suggests. It'd be a silent non-test.

Suggestion (non-blocking): add a one-line Assert.True(string.IsNullOrEmpty(host), "This test relies on empty host tripping the inner URI/client construction") near the top, or pull the fault-injection seam up (an internal overload of Create that accepts the HttpClient) so the throw point is explicit in the test rather than implicit in transitive dependencies.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added an ASSUMPTION: paragraph to the test's comment block that documents the implicit dependency on HttpClientFactory / TelemetryClientManager throwing on empty host, with explicit guidance for the future case where they get defensive handling and this test silently becomes a non-test (commit 9b31881). Left the underlying mechanism as-is for now — a real fault-injection seam can wait until a second consumer needs it.

}
}

[Fact]
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Low/Suggestion] Coverage gap: no test covering the string.Empty → SessionId = null mapping.

ConnectionTelemetry.cs:133 now does:

SessionId = !string.IsNullOrEmpty(sessionId) ? sessionId : null,

This intentionally maps empty string to null SessionContext.SessionId so callers that don't yet have a session (e.g., InitializeTelemetry called before OpenSession returns a handle) get null in the telemetry payload, not an empty literal. That's good behavior — but no test pins it. Suggest adding:

[Fact]
public async Task Create_EmptySessionId_MapsToNullInContext()
{
    var telemetry = ConnectionTelemetry.Create(
        ..., sessionId: string.Empty, mode: DriverModeType.Thrift, ...);
    try { Assert.Null(telemetry.Session?.SessionId); }
    finally { await telemetry.DisposeAsync(); }
}

Without this, a refactor that drops the !string.IsNullOrEmpty guard (and starts emitting empty-string SessionId to lumberjack) would not be caught by any test.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added Create_EmptySessionId_MapsToNullInContext (commit 9b31881). It calls Create with sessionId: string.Empty and asserts telemetry.Session.SessionId is null, pinning the mapping at ConnectionTelemetry.cs:133. Test passes locally (5/5 in ConnectionTelemetryCreateSignatureTests).

@jadewang-db jadewang-db marked this pull request as ready for review May 21, 2026 21:13
Jade Wang added 4 commits May 22, 2026 20:06
…erMode\n\nTask ID: task-1.1-refactor-connection-telemetry-create
- High: preserve fail-open contract — wrap the TSessionHandle Guid byte[]
  conversion in InitializeTelemetry with try/catch. Pre-refactor the same
  conversion lived inside ConnectionTelemetry.Create's outer try/catch so a
  malformed session GUID degraded to NoOp telemetry; moving it to the
  transport boundary lost that guarantee.
- Medium: remove SafeBuildSystemConfiguration_..._FallbackPath — it passed
  string.Empty for assemblyVersion expecting a throw, but
  CreateSystemConfiguration coalesces empty string. The catch block is
  never reached. The CanonicalConstant_HasExpectedLiteralValue test
  already pins both branches by construction.
- Low: rename Create_ThrowingHttpClient_ReturnsNullConnectionTelemetry to
  ...ReturnsNoOpConnectionTelemetry — actual return is the NoOp singleton.
- Low: document the test's implicit assumption that the empty-host throws
  inside HttpClientFactory/TelemetryClientManager so future defensive
  handling there doesn't silently turn this into a non-test.
- Low: add Create_EmptySessionId_MapsToNullInContext to pin the
  string.Empty -> null SessionId mapping at ConnectionTelemetry.cs:133.

Co-authored-by: Isaac
@jadewang-db jadewang-db force-pushed the stack/pr-phase1-connection-telemetry-create-refactor branch from 9b31881 to 2454e7c Compare May 22, 2026 20:07
CurtHagenlocher pushed a commit to CurtHagenlocher/databricks that referenced this pull request May 24, 2026
)

## 🥞 Stacked PR
Use this
[link](https://github.com/adbc-drivers/databricks/pull/455/files) to
review incremental changes.
-
[**stack/PECO-3022-sea-telemetry-design**](adbc-drivers#455)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/455/files)]
-
[stack/pr-phase1-connection-telemetry-create-refactor](adbc-drivers#460)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/460/files/c0dfbed80fd09c91b9e2e5ed8050a268435618bd..d3190aeb6f2f1c727b359d0ef40d26be2280c73e)]
-
[stack/pr-phase2-observer-interface](adbc-drivers#461)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/461/files/b2340d9d32da68d5b81756e0a77008f05aadd45b..b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73)]
-
[stack/pr-phase3-databricks-statement-observer](adbc-drivers#462)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/462/files/b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73..fa799fcd0cc2209982eae2890e16e26854a0649e)]
-
[stack/pr-phase4-sea-connection-telemetry](adbc-drivers#463)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/463/files/f0e344aa4d9d341302ee9b3a5217a6794d8ba189..25b5e6eaf2f2f04941d07bbbe582845254630950)]
-
[stack/pr-phase5-sea-statement-telemetry](adbc-drivers#464)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/464/files/9dc278a249433000b03e2d3a4cdf7daa151caa69..8109047005a89804f243beb3a6c689f43b539506)]

---------
## Summary of the design

This design closes the gap where the SEA (REST) transport in the C# ADBC
Databricks driver emits **zero** client telemetry — no session events,
no per-statement operation events, no error events, no chunk metrics.
SEA traffic is currently invisible to the `eng_lumberjack`
driver-telemetry pipeline.

The design:
- Introduces a small **observer interface**
(`IStatementOperationObserver`) between statement classes and the
telemetry implementation. The interface is shaped around the statement's
lifecycle (`OnExecuteStarted`, `OnPollCompleted`, …) rather than
telemetry's data model. Contract: fail-open — implementations must not
throw.
- **Refactors `ConnectionTelemetry.Create`** to accept `string
sessionId` (dropping its Thrift `TSessionHandle` coupling) and a
`DriverMode mode` parameter. Both transports use the same factory.
- Wires observer callbacks at the SEA hookpoints in
`StatementExecutionConnection` and `StatementExecutionStatement`.
- **Reuses all existing telemetry infrastructure as-is**:
`TelemetryClient`, `TelemetryClientManager`,
`CircuitBreakerTelemetryExporter`, `DatabricksTelemetryExporter`,
`FeatureFlagCache`, `TelemetrySessionContext`,
`StatementTelemetryContext`.

## Key decisions and alternatives considered

- **Observer interface over static helper or instance emitter.** The
gap-fix doc proposed a static `TelemetryHelper`. This design supersedes
that with an observer-shaped interface because: (a) it gives one-line
callsites with no parameter threading, (b) it decouples statement
classes from telemetry types so future tracing/audit observers can plug
in without touching statement code, (c) it's trivially mockable for
tests. The fail-open contract pushes all try/catch into the
implementations exactly once — callsites stay clean.
- **Composition, not inheritance.** Considered (and rejected) three
inheritance variants: SEA-inherits-Thrift (drags in entire HiveServer2
chain — semantically wrong), shared base above both (blocked
structurally — `HiveServer2Connection`'s parent is in the Apache package
we don't own), and interface + extension methods (functionally identical
to the static helper). C# single inheritance + unowned Apache base
classes make pure inheritance impractical.
- **Decorator at AdbcConnection boundary rejected.** A wrapper around
the whole connection cannot see deep telemetry signals (chunk timing,
poll count, first-batch latency) — those live inside statement classes.
Wrong granularity.
- **Refactor `Create` signature rather than add overload.** Changes the
canonical `ConnectionTelemetry.Create` to take `string sessionId`.
Thrift converts at the call boundary
(`sessionHandle.SessionId.Guid.ToString()`). Eliminates the Thrift leak
from the telemetry API permanently.
- **Scoped strictly to SEA, plus `DriverMode.Sea` setter.** The
cross-cutting Thrift gaps (workspace_id, auth_type, metadata-ops
instrumentation, retry_count) are owned by the separate gap-fix
workstream and will land independently. The only cross-cutting change
pulled in here is unhooking the hardcoded `DriverMode.Types.Type.Thrift`
in `BuildDriverConnectionParams` — without it SEA records would be
silently mislabeled as Thrift.

## Areas needing specific review focus

1. **Observer interface shape & fail-open contract** (`§5.1`) — the
observer methods, naming, and the requirement that implementations never
throw. Specifically: are the 8 methods the right cut, or is finer
granularity (e.g. split `OnChunksDownloaded` from `OnConsumed`)
preferred?
2. **`ConnectionTelemetry.Create` signature change** (`§5.2`) — replaces
`TSessionHandle?` with `string sessionId` and adds a `DriverMode mode`
parameter. This touches a stable API used by the existing Thrift path;
the Thrift call site must convert at the boundary.
3. **Result-format mapping for SEA** (`§8`) — SEA does not expose a
typed `ResultFormat`. The mapping table from wire `disposition` ×
manifest state → proto `ExecutionResult.Format` is a judgment call;
please review.
4. **Chunk-metrics dependency on gap-fix** (`§9`, `§16`) — this design
assumes the `CloudFetchDownloader → ChunkMetrics →
CloudFetchReader.GetChunkMetrics()` plumbing from the gap-fix workstream
lands first or concurrently. If gap-fix is delayed, SEA ships with
`ChunkMetrics.Empty` and backfills later. Is that acceptable, or should
we sequence differently?
5. **Open questions** (`§17`): polling-granularity semantics for
`PollCount`, `is_internal_call` semantics for SEA `USE SCHEMA`, and
whether SEA's `operation_type` should always be
`EXECUTE_STATEMENT_ASYNC` or map to sync-emulated variants.

## Related

- Builds on the architecture in
[`csharp/doc/telemetry-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/csharp/doc/telemetry-design.md)
- Supersedes the `TelemetryHelper` static-helper proposal in
[`docs/designs/fix-telemetry-gaps-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/docs/designs/fix-telemetry-gaps-design.md)
for the new SEA code; Thrift-side gap-fix work continues independently
- Jira: [PECO-3022](https://databricks.atlassian.net/browse/PECO-3022)

This pull request and its description were written by Isaac.

---------

Co-authored-by: Jade Wang <jade.wang+data@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant