[SPARK-57544][SQL] Rework column ID validation for nested fields in DSv2 by aokolnychyi · Pull Request #56619 · apache/spark

aokolnychyi · 2026-06-19T18:40:18Z

What changes were proposed in this pull request?

This PR reworks column ID validation for nested fields in DSv2.

Why are the changes needed?

The original implementation detected dropped-and-re-added columns by comparing top-level Column.id() strings in a dedicated validateColumnIds pass, but this approach had no visibility into nested struct fields, array elements, or map keys/values. To work around this limitation, connectors had to encode nested field IDs into the top-level ID string (as demonstrated by ComposedColumnIdTableCatalog), placing an unreasonable burden on connector authors and making the feature fragile by design.

Does this PR introduce any user-facing change?

Yes but it targets unreleased functionality and must be cherry picked to 4.2.

How was this patch tested?

Existing and new tests.

Was this patch authored or co-authored using generative AI tooling?

Claude Code v2.1.183

aokolnychyi · 2026-06-19T18:40:41Z

This is related to #55376.

aokolnychyi · 2026-06-19T18:45:29Z

@gengliangwang @juliuszsompolski @andreaschat-db @longvu-db, what do you folks think?

aokolnychyi · 2026-06-19T20:03:49Z

@huaxingao, this is the PR I was talking about on the dev list.

aokolnychyi · 2026-06-19T20:40:13Z

This is not yet complete as it doesn't handle stripping column IDs on CTAS/RTAS.

aokolnychyi · 2026-06-20T01:12:31Z

+        .build();
  }

-  /**


Switching newly added create methods to a builder. Hasn't been released yet.

aokolnychyi · 2026-06-20T01:14:12Z

+   * @param name the name of the column
+   * @param dataType the data type of the column
+   * @return a new builder
+   * @since 4.2.0


Intentional to be cherry-picked into 4.2.

aokolnychyi · 2026-06-20T01:18:57Z

+      return this;
+    }
+
+    public Builder metadataInJSON(String metadataInJSON) {


Probably call it metadata.

gengliangwang

2 blocking, 2 non-blocking, 1 nit.
Clean recursive generalization of the column-ID check, but two blockers: leftover references to the removed v1 machinery will fail tests, and the field-ID error message names the wrong (or no) field.

Design / architecture (2)

Incomplete removal of the column-ID-v1 machinery: DSv2IncrementallyConstructedQueryTests (run by the modified DataSourceV2DataFrameSuite and by DataSourceV2DataFrameConnectSuite) still asserts the removed INCOMPATIBLE_TABLE_CHANGE_AFTER_ANALYSIS.COLUMN_ID_MISMATCH at lines 324/414/488 — the path now throws COLUMNS_MISMATCH, so those checkErrors fail. ComposedColumnIdTableCatalog (now unreferenced anywhere; it is the workaround this PR eliminates) and NullTableIdAndNullColumnIdInMemoryTableCatalog still cite the removed validateColumnIds in comments. Migrate the trait's assertions to COLUMNS_MISMATCH and delete ComposedColumnIdTableCatalog. (likely overlaps the known "not yet complete" work)
CatalogV2Util.scala:676: FIELD_ID may ride into user-facing df.schema metadata — see inline (question).

Correctness (2)

SchemaUtils.scala:453: field-ID error names the parent (or empty) path, not the field whose ID changed — masked by tests asserting the wrong output — see inline.
SchemaUtils.scala:544: removeFieldIds struct branch always reallocates, contradicting its own Scaladoc — see inline.

Nits: 1 minor item (see inline comments).

PR description suggestions

Document: the new mechanism (field IDs stored per-field in StructField metadata and validated within validateSchemaCompatibility), not only the old approach's limitation — the "What changes" section is currently one line.
Document: the removal of the dedicated validateColumnIds pass and the COLUMN_ID_MISMATCH error subclass; field-ID changes now report as COLUMNS_MISMATCH.

gengliangwang · 2026-06-22T19:03:37Z

            case Some(otherField) =>
+              if (checkFieldIds) {
+                for (id <- field.id; otherId <- otherField.id if id != otherId) {
+                  errors += s"${colPath.fullyQuoted} field ID has changed from $id to $otherId"


This reports the parent struct's path, not the field whose ID changed. The check runs one level above the recursion that appends field.name, so unlike the sibling errors (formatField for removed/added, and the type/nullability messages emitted inside the recursion) it omits the field name: a top-level column change becomes " field ID has changed from 1 to 99" (empty path), and a nested change names the container — two changed sibling fields then produce indistinguishable messages.

Suggested change

errors += s"${colPath.fullyQuoted} field ID has changed from $id to $otherId"

errors += s"${(colPath :+ field.name).fullyQuoted} field ID has changed from $id to $otherId"

The six V2TableUtilSuite expectations (lines 569/587/602/621-622/639/656) assert the current output and will need updating; the explanatory comment at line 586 also carries a non-ASCII em-dash — switch it to ASCII while you're there.

gengliangwang · 2026-06-22T19:03:37Z

+   */
+  def removeFieldIds(dataType: DataType): DataType = dataType match {
+    case s: StructType =>
+      StructType(s.fields.map { field =>


This branch always rebuilds via StructType(s.fields.map(...)), so it never returns the original s — but the Scaladoc promises "the original instance unchanged when there is nothing to strip," which only the ArrayType/MapType branches honor. InMemoryBaseTable.assignFieldIds (line 945) shows the struct short-circuit: track whether any field changed and return s if none did. Either apply that here or relax the comment.

gengliangwang · 2026-06-22T19:03:37Z

      f = encodeDefaultValue(default, f)
    }
+    Option(col.id()).foreach { id =>
+      f = f.withId(id)


Encoding the top-level ID here also rides into the user-facing schema: DataSourceV2Relation.create builds its output from table.columns.asSchema (which calls this) via toAttributes, and FIELD_ID isn't in INTERNAL_METADATA_KEYS, so it looks like it ends up in df.schema field metadata for every column / nested field of an ID-tracking connector. Is that intended? If not, adding FIELD_ID to INTERNAL_METADATA_KEYS (or stripping it at the relation boundary) would keep it internal. I traced the encode path but didn't confirm end-user visibility — could you check?

gengliangwang · 2026-06-22T19:03:37Z

-   * IDs are preserved across type changes, keeping the same column ID through type
-   * widening and nested field additions. [[TypeChangeResetsColIdTableCatalog]] overrides
-   * this behavior for testing scenarios where type changes should produce a new ID.
+   * Existing IDs are preserved: Column → StructType → Column round-trip encodes them in


Non-ASCII arrows (U+2192); use ASCII (CLAUDE.md / scalastyle):

Suggested change

* Existing IDs are preserved: Column → StructType → Column round-trip encodes them in

* Existing IDs are preserved: Column -> StructType -> Column round-trip encodes them in

aokolnychyi force-pushed the spark-57544 branch 2 times, most recently from 0b3f26f to 30864e4 Compare June 20, 2026 01:11

aokolnychyi commented Jun 20, 2026

View reviewed changes

[SPARK-57544][SQL] Rework column ID validation for nested fields in DSv2

1a82d94

aokolnychyi force-pushed the spark-57544 branch from 30864e4 to 1a82d94 Compare June 20, 2026 01:13

aokolnychyi commented Jun 20, 2026

View reviewed changes

gengliangwang reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57544][SQL] Rework column ID validation for nested fields in DSv2#56619

[SPARK-57544][SQL] Rework column ID validation for nested fields in DSv2#56619
aokolnychyi wants to merge 1 commit into
apache:masterfrom
aokolnychyi:spark-57544

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi Jun 20, 2026

Uh oh!

aokolnychyi Jun 20, 2026

Uh oh!

aokolnychyi Jun 20, 2026

Uh oh!

gengliangwang left a comment

Uh oh!

gengliangwang Jun 22, 2026

Uh oh!

gengliangwang Jun 22, 2026

Uh oh!

gengliangwang Jun 22, 2026

Uh oh!

gengliangwang Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	errors += s"${colPath.fullyQuoted} field ID has changed from $id to $otherId"
	errors += s"${(colPath :+ field.name).fullyQuoted} field ID has changed from $id to $otherId"

	* Existing IDs are preserved: Column → StructType → Column round-trip encodes them in
	* Existing IDs are preserved: Column -> StructType -> Column round-trip encodes them in

Conversation

aokolnychyi commented Jun 19, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi commented Jun 19, 2026

Uh oh!

aokolnychyi Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Design / architecture (2)

Correctness (2)

PR description suggestions

Uh oh!

gengliangwang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants