-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-57454][SQL] Add type coercion and widening rules for nanosecond-precision timestamp types #56638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-57454][SQL] Add type coercion and widening rules for nanosecond-precision timestamp types #56638
Changes from all commits
76a39cc
5d894c5
aec31dd
4a5e17b
fbfe68d
789297e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,6 +17,7 @@ | |
|
|
||
| package org.apache.spark.sql.catalyst.analysis | ||
|
|
||
| import org.apache.spark.SparkException | ||
| import org.apache.spark.sql.catalyst.analysis.TypeCoercion.PromoteStrings.conf | ||
| import org.apache.spark.sql.catalyst.expressions.{ | ||
| Alias, | ||
|
|
@@ -82,6 +83,8 @@ import org.apache.spark.sql.types.{ | |
| StringType, | ||
| StringTypeExpression, | ||
| StructType, | ||
| TimestampLTZNanosType, | ||
| TimestampNTZNanosType, | ||
| TimestampNTZType, | ||
| TimestampType, | ||
| TimestampTypeExpression, | ||
|
|
@@ -244,14 +247,58 @@ abstract class TypeCoercionHelper { | |
| (d1, d2) match { | ||
| case (_, _: TimeType) => None | ||
| case (_: TimeType, _) => None | ||
| case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => | ||
| Some(TimestampType) | ||
|
|
||
| case (_: TimestampType, _: TimestampNTZType) | (_: TimestampNTZType, _: TimestampType) => | ||
| Some(TimestampType) | ||
|
|
||
| case (_: TimestampNTZType, _: DateType) | (_: DateType, _: TimestampNTZType) => | ||
| Some(TimestampNTZType) | ||
| // The remaining datetime types (DATE and the micro/nanos TIMESTAMP_LTZ / TIMESTAMP_NTZ | ||
| // families) widen along two independent axes: | ||
| // - time-zone family: the result is LTZ if either input is LTZ-family, otherwise NTZ. This | ||
| // mirrors the microsecond precedent where TIMESTAMP + TIMESTAMP_NTZ widens to TIMESTAMP. | ||
| // DATE is family-neutral and adopts the family of the other side. | ||
| // - precision: the maximum of the two precisions, where the micro types and DATE count as 6 | ||
| // and the nanos types contribute their own precision p in [7, 9]. | ||
| // The (family, precision) pair then maps back to a concrete type: precision 6 yields the | ||
| // micro type, precision in [7, 9] yields the nanos type. | ||
| // | ||
| // Note: this common-type resolution is intentionally more permissive than the nanosecond | ||
| // conversion rules in Cast.canUpCast / Cast.canANSIStoreAssign, which keep cross-family and | ||
| // DATE <-> nanos casts explicit-CAST-only while the nanos types are unreleased (SPARK-57323 | ||
| // etc.). Coercion here mirrors the microsecond precedent so that UNION / CASE / coalesce / | ||
| // IN / comparison resolve a common type the same way they do for the micro families; the | ||
| // stricter explicit-only stance is deliberately scoped to up-cast and store assignment, not | ||
| // to common-type resolution. | ||
| case _ => | ||
| // Fractional-seconds precision of the microsecond timestamp types; the nanos types carry | ||
| // 7-9. DATE has no time component and is treated as the micro precision so that | ||
| // DATE <-> micro widens to the micro type and DATE <-> nanos to the nanos type. | ||
| val MicrosPrecision = 6 | ||
| def isLtz(d: DatetimeType): Boolean = | ||
| d.isInstanceOf[TimestampType] || d.isInstanceOf[TimestampLTZNanosType] | ||
| def isNtz(d: DatetimeType): Boolean = | ||
| d.isInstanceOf[TimestampNTZType] || d.isInstanceOf[TimestampNTZNanosType] | ||
| def precisionOf(d: DatetimeType): Int = d match { | ||
| case t: TimestampLTZNanosType => t.precision | ||
| case t: TimestampNTZNanosType => t.precision | ||
| case _ => MicrosPrecision // DateType / TimestampType / TimestampNTZType | ||
| } | ||
| // Beyond TimeType (handled above), the only datetime types are DATE and the micro/nanos | ||
| // timestamp families. Guard so that a future DatetimeType subtype fails fast here instead | ||
| // of being silently mis-widened (treated as a family-neutral precision-6 type and folded | ||
| // into DATE) when it should be wired in explicitly. | ||
| def isWidenable(d: DatetimeType): Boolean = | ||
| isLtz(d) || isNtz(d) || d.isInstanceOf[DateType] | ||
| if (!isWidenable(d1) || !isWidenable(d2)) { | ||
| throw SparkException.internalError( | ||
| s"Unexpected datetime types in findWiderDateTimeType: $d1, $d2") | ||
| } else if (!isLtz(d1) && !isNtz(d1) && !isLtz(d2) && !isNtz(d2)) { | ||
| // Both sides are DATE; callers short-circuit equal types, so this is just defensive. | ||
| Some(DateType) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to confirm - this is unreachable via current callers?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Confirmed - unreachable via the current callers. |
||
| } else { | ||
| val p = math.max(precisionOf(d1), precisionOf(d2)) | ||
| if (isLtz(d1) || isLtz(d2)) { | ||
| Some(if (p <= MicrosPrecision) TimestampType else TimestampLTZNanosType(p)) | ||
| } else { | ||
| Some(if (p <= MicrosPrecision) TimestampNTZType else TimestampNTZNanosType(p)) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should we future-proof the TimeType guards here? The arms match _: TimeType specifically rather than _: AnyTimeType. If another AnyTimeType subtype is ever added, it would fall through to case _ and hit the internalError (acceptable fail-fast), but matching _: AnyTimeType there would be marginally more intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye -
TimeTypeis currently the onlyAnyTimeTypesubtype, so the two match the same set today. I'd lean toward keeping_: TimeTypethough, for two reasons: (1) these two arms are pre-existing (this PR only adds thecase _ =>arm), so broadening them is orthogonal to nanos widening; and (2) it would partly defeat the fail-fast guard just below - the intent is that a genuinely newDatetimeType(a futureAnyTimeTypesubtype included) tripsinternalErrorand gets triaged explicitly, rather than silently resolving toNone, which may not be the right common-type semantics for it (it could warrant its own precision-widening rule, like the timestamp families do). Happy to switch if you feel strongly, but my preference is the minimal pre-existing form.