refactor(csharp): make ConnectionTelemetry.Create protocol-agnostic (PECO-3022)#460
refactor(csharp): make ConnectionTelemetry.Create protocol-agnostic (PECO-3022)#460jadewang-db wants to merge 4 commits into
Conversation
jadewang-db
left a comment
There was a problem hiding this comment.
Review of PR #460 — ConnectionTelemetry.Create protocol-agnostic refactor
Summary
The refactor itself is mechanically clean — string sessionId + DriverMode mode thread through correctly, hardcoded Thrift literals are removed, and the boundary conversion lives where it should (in the Thrift caller). The XML docs added to Create explain the why of both parameters well.
One High correctness issue: moving the TSessionHandle → string Guid conversion out of ConnectionTelemetry.Create's outer try/catch and into InitializeTelemetry opens a path where a malformed SessionId.Guid byte array would propagate to connection-open instead of degrading to NoOp telemetry. Details inline.
A few Low/Medium test-quality nits — including a test whose name promises a fallback path it doesn't actually exercise, and a misleading test name (ReturnsNullConnectionTelemetry vs. the singleton). Also inline.
Positives worth keeping
- Per-test distinct hosts to side-step the
TelemetryClientManagerper-host singleton cache — exactly the kind of subtle isolation note future readers benefit from. - Both
DriverModeenum values (Thrift + Sea) are covered, including the new SEA path that doesn't even have a caller in this PR yet — that's the right place to pin behavior. - Comments reference the prior line numbers (
ConnectionTelemetry.cs:642) so reviewers can verify the literal really moved. Nice forensic detail.
Scope / commit hygiene
Commit d3190ae titled [gap-fill][gap-1] Align driver_name string across SEA and Thrift only adds the regression tests — the actual alignment was done in an earlier commit on main. Suggest retitling to add regression tests pinning driver_name literal so the diff matches the title; otherwise a future bisect on "when did the alignment land" will land here and find no source change.
Overall: approve with one High issue to address before merge.
| // ConnectionTelemetry.Create stays transport-agnostic. SEA will pass its | ||
| // server-assigned session id string directly. | ||
| string sessionId = SessionHandle?.SessionId?.Guid != null | ||
| ? new Guid(SessionHandle.SessionId.Guid).ToString() |
There was a problem hiding this comment.
[High] Fail-open contract regression.
The new Guid(SessionHandle.SessionId.Guid) call now lives outside any try/catch. Pre-refactor, this exact conversion was on line ~120 inside ConnectionTelemetry.Create's outer try/catch (the catch at ConnectionTelemetry.cs:150), so any throw degraded to NoOpConnectionTelemetry.Instance and the connection still opened.
new Guid(byte[]) throws ArgumentException when the array is not exactly 16 bytes. The null-guard above only protects against Guid == null, not against a wrong-length array. While the Thrift server should always send a 16-byte payload, the explicit Never throws contract documented two lines above (ConnectionTelemetry.cs:62) is now reachable from InitializeTelemetry callers — a malformed session GUID would fail connection-open, not just disable telemetry.
Fix: wrap the conversion locally:
string sessionId = string.Empty;
try
{
if (SessionHandle?.SessionId?.Guid != null)
{
sessionId = new Guid(SessionHandle.SessionId.Guid).ToString();
}
}
catch
{
// Preserve fail-open: malformed session GUID disables telemetry, not the connection.
}Alternatively, move the conversion back inside Create by accepting byte[]? rawSessionGuid — but then SEA's caller has to construct a fake byte array, which defeats the whole refactor. The local try/catch is the cheaper fix.
There was a problem hiding this comment.
Done — wrapped the byte[] -> Guid conversion in a local try/catch (commit 9b31881). A malformed session GUID now degrades to sessionId = string.Empty (which Create maps to SessionId = null), preserving the pre-refactor fail-open guarantee.
| // canonical literal so SEA/Thrift parity holds on partial-init failures. | ||
| // We exercise the wrapper directly with a degenerate assembly version — any | ||
| // throw inside the wrapper falls back to the literal-populated proto. | ||
| var config = ConnectionTelemetry.SafeBuildSystemConfiguration(string.Empty, activity: null); |
There was a problem hiding this comment.
[Medium] Test name promises a fallback that isn't actually exercised.
The test passes assemblyVersion: string.Empty expecting that to trip the catch block in SafeBuildSystemConfiguration. But CreateSystemConfiguration at ConnectionTelemetry.cs:512 does:
DriverVersion = assemblyVersion ?? string.Empty,Empty string is coalesced, not validated — it flows through happily. So _FallbackPath is actually exercising the same code path as _HappyPath above it, just with a different DriverVersion value. The catch block in SafeBuildSystemConfiguration is never reached.
Two ways to fix:
- Rename to reflect reality: drop
_FallbackPath— there's nothing in the catch worth covering separately from the happy path. - Actually trigger the fallback — harder with static methods, but
BuildSystemConfigurationcan throw via theTrySet(() => config.OsArch = RuntimeInformation.OSArchitecture.ToString());chain on some constrained runtimes. A cleaner approach would be to add aninternaloverload that takes a fault-injection delegate, or to assert the literal directly on thecatchfallback proto using reflection-based invocation of the catch path.
Option 1 is fine — the intent (pin the literal) is already covered by CanonicalConstant_HasExpectedLiteralValue. The fallback path can be tested separately if/when a real fault-injection seam is added.
There was a problem hiding this comment.
Done — went with option 1: removed the misleading _FallbackPath test and consolidated the happy path into a single SafeBuildSystemConfiguration_ReturnsCanonicalDriverName (commit 9b31881). Added a comment explaining why we don't separately cover the catch block: there's no fault-injection seam for the static helper, and CanonicalConstant_HasExpectedLiteralValue already pins the literal that both the happy-path and fallback branches reference.
| } | ||
|
|
||
| [Fact] | ||
| public void Create_ThrowingHttpClient_ReturnsNullConnectionTelemetry() |
There was a problem hiding this comment.
[Low] Test name says ReturnsNull... but the method returns the NoOp singleton, not null.
The assertion two lines down — Assert.Same(NoOpConnectionTelemetry.Instance, telemetry); — confirms the actual contract: a NoOp instance, never null. Suggest renaming to Create_ThrowingHttpClient_ReturnsNoOpConnectionTelemetry. The current name will confuse the next reader who greps for Null thinking we have a null-return path.
There was a problem hiding this comment.
Done — renamed to Create_ThrowingHttpClient_ReturnsNoOpConnectionTelemetry (commit 9b31881).
| Assert.Same(NoOpConnectionTelemetry.Instance, telemetry); | ||
| Assert.Null(telemetry.Session); | ||
| } | ||
| } |
There was a problem hiding this comment.
[Low] Brittle test — depends on the current implementation of HttpClientFactory / TelemetryClientManager throwing on empty host.
The comment is honest about the mechanism (new Uri("https://") throws), but that's two layers down from the unit under test. If HttpClientFactory.CreateTelemetryHttpClient or TelemetryClientManager.GetOrCreateClient later gains an early defensive check that returns gracefully on empty host, this test would still pass — but it would no longer be testing what its name suggests. It'd be a silent non-test.
Suggestion (non-blocking): add a one-line Assert.True(string.IsNullOrEmpty(host), "This test relies on empty host tripping the inner URI/client construction") near the top, or pull the fault-injection seam up (an internal overload of Create that accepts the HttpClient) so the throw point is explicit in the test rather than implicit in transitive dependencies.
There was a problem hiding this comment.
Done — added an ASSUMPTION: paragraph to the test's comment block that documents the implicit dependency on HttpClientFactory / TelemetryClientManager throwing on empty host, with explicit guidance for the future case where they get defensive handling and this test silently becomes a non-test (commit 9b31881). Left the underlying mechanism as-is for now — a real fault-injection seam can wait until a second consumer needs it.
| } | ||
| } | ||
|
|
||
| [Fact] |
There was a problem hiding this comment.
[Low/Suggestion] Coverage gap: no test covering the string.Empty → SessionId = null mapping.
ConnectionTelemetry.cs:133 now does:
SessionId = !string.IsNullOrEmpty(sessionId) ? sessionId : null,This intentionally maps empty string to null SessionContext.SessionId so callers that don't yet have a session (e.g., InitializeTelemetry called before OpenSession returns a handle) get null in the telemetry payload, not an empty literal. That's good behavior — but no test pins it. Suggest adding:
[Fact]
public async Task Create_EmptySessionId_MapsToNullInContext()
{
var telemetry = ConnectionTelemetry.Create(
..., sessionId: string.Empty, mode: DriverModeType.Thrift, ...);
try { Assert.Null(telemetry.Session?.SessionId); }
finally { await telemetry.DisposeAsync(); }
}Without this, a refactor that drops the !string.IsNullOrEmpty guard (and starts emitting empty-string SessionId to lumberjack) would not be caught by any test.
There was a problem hiding this comment.
Done — added Create_EmptySessionId_MapsToNullInContext (commit 9b31881). It calls Create with sessionId: string.Empty and asserts telemetry.Session.SessionId is null, pinning the mapping at ConnectionTelemetry.cs:133. Test passes locally (5/5 in ConnectionTelemetryCreateSignatureTests).
…erMode\n\nTask ID: task-1.1-refactor-connection-telemetry-create
- High: preserve fail-open contract — wrap the TSessionHandle Guid byte[] conversion in InitializeTelemetry with try/catch. Pre-refactor the same conversion lived inside ConnectionTelemetry.Create's outer try/catch so a malformed session GUID degraded to NoOp telemetry; moving it to the transport boundary lost that guarantee. - Medium: remove SafeBuildSystemConfiguration_..._FallbackPath — it passed string.Empty for assemblyVersion expecting a throw, but CreateSystemConfiguration coalesces empty string. The catch block is never reached. The CanonicalConstant_HasExpectedLiteralValue test already pins both branches by construction. - Low: rename Create_ThrowingHttpClient_ReturnsNullConnectionTelemetry to ...ReturnsNoOpConnectionTelemetry — actual return is the NoOp singleton. - Low: document the test's implicit assumption that the empty-host throws inside HttpClientFactory/TelemetryClientManager so future defensive handling there doesn't silently turn this into a non-test. - Low: add Create_EmptySessionId_MapsToNullInContext to pin the string.Empty -> null SessionId mapping at ConnectionTelemetry.cs:133. Co-authored-by: Isaac
9b31881 to
2454e7c
Compare
) ## 🥞 Stacked PR Use this [link](https://github.com/adbc-drivers/databricks/pull/455/files) to review incremental changes. - [**stack/PECO-3022-sea-telemetry-design**](adbc-drivers#455) [[Files changed](https://github.com/adbc-drivers/databricks/pull/455/files)] - [stack/pr-phase1-connection-telemetry-create-refactor](adbc-drivers#460) [[Files changed](https://github.com/adbc-drivers/databricks/pull/460/files/c0dfbed80fd09c91b9e2e5ed8050a268435618bd..d3190aeb6f2f1c727b359d0ef40d26be2280c73e)] - [stack/pr-phase2-observer-interface](adbc-drivers#461) [[Files changed](https://github.com/adbc-drivers/databricks/pull/461/files/b2340d9d32da68d5b81756e0a77008f05aadd45b..b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73)] - [stack/pr-phase3-databricks-statement-observer](adbc-drivers#462) [[Files changed](https://github.com/adbc-drivers/databricks/pull/462/files/b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73..fa799fcd0cc2209982eae2890e16e26854a0649e)] - [stack/pr-phase4-sea-connection-telemetry](adbc-drivers#463) [[Files changed](https://github.com/adbc-drivers/databricks/pull/463/files/f0e344aa4d9d341302ee9b3a5217a6794d8ba189..25b5e6eaf2f2f04941d07bbbe582845254630950)] - [stack/pr-phase5-sea-statement-telemetry](adbc-drivers#464) [[Files changed](https://github.com/adbc-drivers/databricks/pull/464/files/9dc278a249433000b03e2d3a4cdf7daa151caa69..8109047005a89804f243beb3a6c689f43b539506)] --------- ## Summary of the design This design closes the gap where the SEA (REST) transport in the C# ADBC Databricks driver emits **zero** client telemetry — no session events, no per-statement operation events, no error events, no chunk metrics. SEA traffic is currently invisible to the `eng_lumberjack` driver-telemetry pipeline. The design: - Introduces a small **observer interface** (`IStatementOperationObserver`) between statement classes and the telemetry implementation. The interface is shaped around the statement's lifecycle (`OnExecuteStarted`, `OnPollCompleted`, …) rather than telemetry's data model. Contract: fail-open — implementations must not throw. - **Refactors `ConnectionTelemetry.Create`** to accept `string sessionId` (dropping its Thrift `TSessionHandle` coupling) and a `DriverMode mode` parameter. Both transports use the same factory. - Wires observer callbacks at the SEA hookpoints in `StatementExecutionConnection` and `StatementExecutionStatement`. - **Reuses all existing telemetry infrastructure as-is**: `TelemetryClient`, `TelemetryClientManager`, `CircuitBreakerTelemetryExporter`, `DatabricksTelemetryExporter`, `FeatureFlagCache`, `TelemetrySessionContext`, `StatementTelemetryContext`. ## Key decisions and alternatives considered - **Observer interface over static helper or instance emitter.** The gap-fix doc proposed a static `TelemetryHelper`. This design supersedes that with an observer-shaped interface because: (a) it gives one-line callsites with no parameter threading, (b) it decouples statement classes from telemetry types so future tracing/audit observers can plug in without touching statement code, (c) it's trivially mockable for tests. The fail-open contract pushes all try/catch into the implementations exactly once — callsites stay clean. - **Composition, not inheritance.** Considered (and rejected) three inheritance variants: SEA-inherits-Thrift (drags in entire HiveServer2 chain — semantically wrong), shared base above both (blocked structurally — `HiveServer2Connection`'s parent is in the Apache package we don't own), and interface + extension methods (functionally identical to the static helper). C# single inheritance + unowned Apache base classes make pure inheritance impractical. - **Decorator at AdbcConnection boundary rejected.** A wrapper around the whole connection cannot see deep telemetry signals (chunk timing, poll count, first-batch latency) — those live inside statement classes. Wrong granularity. - **Refactor `Create` signature rather than add overload.** Changes the canonical `ConnectionTelemetry.Create` to take `string sessionId`. Thrift converts at the call boundary (`sessionHandle.SessionId.Guid.ToString()`). Eliminates the Thrift leak from the telemetry API permanently. - **Scoped strictly to SEA, plus `DriverMode.Sea` setter.** The cross-cutting Thrift gaps (workspace_id, auth_type, metadata-ops instrumentation, retry_count) are owned by the separate gap-fix workstream and will land independently. The only cross-cutting change pulled in here is unhooking the hardcoded `DriverMode.Types.Type.Thrift` in `BuildDriverConnectionParams` — without it SEA records would be silently mislabeled as Thrift. ## Areas needing specific review focus 1. **Observer interface shape & fail-open contract** (`§5.1`) — the observer methods, naming, and the requirement that implementations never throw. Specifically: are the 8 methods the right cut, or is finer granularity (e.g. split `OnChunksDownloaded` from `OnConsumed`) preferred? 2. **`ConnectionTelemetry.Create` signature change** (`§5.2`) — replaces `TSessionHandle?` with `string sessionId` and adds a `DriverMode mode` parameter. This touches a stable API used by the existing Thrift path; the Thrift call site must convert at the boundary. 3. **Result-format mapping for SEA** (`§8`) — SEA does not expose a typed `ResultFormat`. The mapping table from wire `disposition` × manifest state → proto `ExecutionResult.Format` is a judgment call; please review. 4. **Chunk-metrics dependency on gap-fix** (`§9`, `§16`) — this design assumes the `CloudFetchDownloader → ChunkMetrics → CloudFetchReader.GetChunkMetrics()` plumbing from the gap-fix workstream lands first or concurrently. If gap-fix is delayed, SEA ships with `ChunkMetrics.Empty` and backfills later. Is that acceptable, or should we sequence differently? 5. **Open questions** (`§17`): polling-granularity semantics for `PollCount`, `is_internal_call` semantics for SEA `USE SCHEMA`, and whether SEA's `operation_type` should always be `EXECUTE_STATEMENT_ASYNC` or map to sync-emulated variants. ## Related - Builds on the architecture in [`csharp/doc/telemetry-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/csharp/doc/telemetry-design.md) - Supersedes the `TelemetryHelper` static-helper proposal in [`docs/designs/fix-telemetry-gaps-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/docs/designs/fix-telemetry-gaps-design.md) for the new SEA code; Thrift-side gap-fix work continues independently - Jira: [PECO-3022](https://databricks.atlassian.net/browse/PECO-3022) This pull request and its description were written by Isaac. --------- Co-authored-by: Jade Wang <jade.wang+data@databricks.com>
🥞 Stacked PR
Use this link to review incremental changes.
Summary
Refactors
ConnectionTelemetry.Createso it can serve both transports cleanly — the precondition for wiring SEA telemetry in later phases of this stack:TSessionHandle? sessionHandlewithstring sessionId.TSessionHandleis a Thrift-only type; SEA only has astring _sessionId. The Thrift call site converts at the boundary (new Guid(SessionHandle.SessionId.Guid).ToString()), keeping the conversion local to Thrift code.DriverMode modeparameter. Removes the hardcodedDriverMode.Types.Type.ThriftfromBuildDriverConnectionParams/SafeBuildDriverConnectionParams(lines 458, 642 pre-refactor). Without this, SEA records would still be labeledDRIVER_MODE_THRIFT.driver_nameconstant (ADBC Databricks Driver) used by both transports so lumberjack dashboards see a single literal regardless of protocol.Files touched
csharp/src/Telemetry/ConnectionTelemetry.cs—Createsignature change, mode threading, hardcoded Thrift removed.csharp/src/DatabricksConnection.cs— Thrift caller convertsTSessionHandle → stringand passesDriverMode.Thrift.csharp/test/Unit/Telemetry/ConnectionTelemetryCreateSignatureTests.cs(new) — coversCreate_AcceptsStringSessionId,Create_ThriftMode_SetsDriverModeThrift,Create_SeaMode_SetsDriverModeSea.Test plan
driver_connection_params.mode = DRIVER_MODE_THRIFT.Related
docs/designs/PECO-3022-sea-telemetry-integration-design.md§5.2 + §10.Part of PECO-3022.