fix(csharp): emit valid JSON for Thrift MAP values containing quotes#458
fix(csharp): emit valid JSON for Thrift MAP values containing quotes#458eric-wang-1990 wants to merge 3 commits into
Conversation
The Thrift server's flag=false serialization of MAP values containing
double-quote characters produces malformed JSON — the inner quotes are
not escaped, so the result looks like {"key":"val "quote""} which is
invalid JSON. SEA already serialized this correctly via the client-side
ComplexTypeSerializingStream wrap.
Mirror JDBC's pattern (DatabricksThriftServiceClient.java) on the C#
Thrift path: always request native Arrow complex types from the server
(ComplexTypesAsArrow=true), then serialize to JSON strings client-side
via System.Text.Json when EnableComplexDatatypeSupport=false. The
existing ComplexTypeSerializingStream now flattens its exposed schema
so complex columns report StringType — matching the converted batches
regardless of whether the inner schema came from SEA's manifest (already
StringType) or Thrift's IPC (native).
Updated ComplexTypesValueTests's PECO-3014 skips to fire on both
protocols (no longer SEA-only) since Thrift now uses the same
client-side serializer.
Closes PECO-3032
Co-authored-by: Isaac
|
Is this specific to the Databricks back-end implementation or does it affect Spark in general? (i.e. does this also need to be fixed in the Spark code of the HiveServer2 project?) |
Same fix applied to ComplexTypesNativeArrowE2ETests on PR #457: blanket Remove(uri) worked for the local config (which had both 'uri' and 'hostName') but broke CI configs that only have 'uri'. Co-authored-by: Isaac
The conditional Remove(uri) was working around a buggy local test config (both 'uri' and 'hostName' set). The fix belongs in the local config, not in test code. Co-authored-by: Isaac
Nope, I think the OpenSource Spark use a different Thrift protocol and will not support pass those complex type as native arrow. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes malformed JSON produced by Thrift server-side serialization for complex MAP values containing quotes by always requesting native Arrow complex types and serializing them client-side when complex support is disabled.
Changes:
- Always request native complex types from Thrift (
ComplexTypesAsArrow=true) and wrap result streams withComplexTypeSerializingStreamwhenEnableComplexDatatypeSupport=false. - Adjust
ComplexTypeSerializingStreamto expose a schema with complex columns flattened toStringTypeto match serialized output. - Add E2E coverage for JSON escaping and update existing complex-type value tests/skips accordingly.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| csharp/src/DatabricksStatement.cs | Always request Arrow complex types and conditionally wrap streams to serialize complex values client-side. |
| csharp/src/ComplexTypeSerializingStream.cs | Flattens complex columns to StringType in the exposed schema to match produced batches. |
| csharp/test/E2E/ComplexTypesValueTests.cs | Broadens PECO-3014-related skips to cover both protocols under the unified serializer behavior. |
| csharp/test/E2E/ThriftMapJsonEscapingE2ETests.cs | New E2E test ensuring complex MAP JSON containing quotes is valid/parseable on both protocols. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| protected override async System.Threading.Tasks.Task ValidateTestArrayData(string projection, string value) | ||
| { | ||
| Skip.If(TestConfiguration.Protocol == "rest", "SEA returns array elements in different format for NUMERIC/DOUBLE/DATE/TIMESTAMP/INTERVAL (PECO-3014)"); | ||
| Skip.If(true, "Array element format differs from upstream expectation for NUMERIC/DOUBLE/DATE/TIMESTAMP/INTERVAL (PECO-3014)"); | ||
| await base.ValidateTestArrayData(projection, value); | ||
| } |
| protected override async System.Threading.Tasks.Task ValidateTestMapData(string projection, string value) | ||
| { | ||
| Skip.If(TestConfiguration.Protocol == "rest", "SEA returns map values in different format (PECO-3014)"); | ||
| Skip.If(true, "Map value format differs from upstream expectation (PECO-3014)"); | ||
| await base.ValidateTestMapData(projection, value); | ||
| } |
| private async Task<(AdbcConnection conn, IArrowArrayStream stream)> ExecuteAsync(string sql, string protocol) | ||
| { | ||
| var properties = TestEnvironment.GetDriverParameters(TestConfiguration); | ||
| properties[DatabricksParameters.Protocol] = protocol; | ||
|
|
||
| AdbcDriver driver = new DatabricksDriver(); | ||
| AdbcDatabase database = driver.Open(properties); | ||
| AdbcConnection connection = database.Connect(properties); | ||
|
|
||
| var statement = connection.CreateStatement(); | ||
| statement.SqlQuery = sql; | ||
| QueryResult result = await statement.ExecuteQueryAsync(); | ||
| return (connection, result.Stream!); | ||
| } |
| private async Task<(AdbcConnection conn, IArrowArrayStream stream)> ExecuteAsync(string sql, string protocol) | ||
| { | ||
| var properties = TestEnvironment.GetDriverParameters(TestConfiguration); | ||
| properties[DatabricksParameters.Protocol] = protocol; |
Summary
Thrift's server-side serialization of MAP values containing double-quote characters produces malformed JSON. The inner quotes are not escaped, yielding:
SEA already serialized this correctly via the client-side
ComplexTypeSerializingStreamwrap. This PR mirrors JDBC's pattern (DatabricksThriftServiceClient.java) on the C# Thrift path: always request native Arrow complex types from the server (ComplexTypesAsArrow=true), then serialize to JSON strings client-side viaSystem.Text.JsonwhenEnableComplexDatatypeSupport=false.Changes
DatabricksStatement.cs: setComplexTypesAsArrow = trueunconditionally on Thrift requests; addedMaybeWrapComplexTypesthat wraps the result stream withComplexTypeSerializingStreamwhenEnableComplexDatatypeSupport=false.ComplexTypeSerializingStream.cs: the constructor now flattens complex columns in the exposed schema toStringType. This is a no-op for SEA (whose inner schema already hasStringTypefor complex columns via the manifest) and correct for Thrift (whose inner IPC schema has native types withComplexTypesAsArrow=true).ComplexTypesValueTests.cs: the PECO-3014 element-format skips forValidateTestArrayData/ValidateTestMapDatanow fire on both protocols rather than SEA only — Thrift now uses the same client-side serializer as SEA, so the format-divergence concern is uniform.ThriftMapJsonEscapingE2ETests.cs(new): runsSELECT MAP('key', 'val "quote"')on boththriftandrestprotocols and asserts the returned JSON is parseable bySystem.Text.Json.JsonDocument.Parsewith the right value.Reproduction (before the fix)
Test plan
databricks-test-config-sea.json).dotnet buildclean (netstandard2.0 + net8.0).Trade-off note
Thrift's complex-type JSON output now uses
System.Text.Jsonformatting (matching SEA), not the server's emitted format. This may differ forNUMERIC/DOUBLE/DATE/TIMESTAMP/INTERVALarray/map elements — tracked separately as PECO-3014. Existing test skips updated accordingly.Closes PECO-3032
This pull request was AI-assisted by Isaac.