[SPARK-57526][SQL] Add the timestamp_nanos function to create nanosecond-precision timestamps from numeric nanoseconds#56616
[SPARK-57526][SQL] Add the timestamp_nanos function to create nanosecond-precision timestamps from numeric nanoseconds#56616MaxGekk wants to merge 4 commits into
timestamp_nanos function to create nanosecond-precision timestamps from numeric nanoseconds#56616Conversation
…econd-precision timestamps from numeric nanoseconds ### What changes were proposed in this pull request? This PR adds a new built-in function `timestamp_nanos(expr)` that interprets `expr` as the number of nanoseconds since `1970-01-01 00:00:00 UTC` and returns a nanosecond-precision `TIMESTAMP_LTZ(9)`. Concretely: - Adds a `NanosToTimestamp` expression in `datetimeExpressions.scala`. It declares a single `DECIMAL` input type with `ImplicitCastInputTypes`, so integral arguments are coerced to their natural decimal automatically while `DECIMAL` arguments are accepted as-is. - Maps the nanosecond count `N` to the internal `(epochMicros, nanosWithinMicro)` pair with floor semantics (`epochMicros = floorDiv(N, 1000)`, `nanosWithinMicro = floorMod(N, 1000)`, always in `[0, 999]`), computed via `BigInteger` in both the interpreted (`eval`) and codegen (`doGenCode`) paths. `longValueExact` throws `ArithmeticException` when the value is outside the representable timestamp range. - A `DECIMAL` input (rather than `BIGINT`) is required to reach the full `[0001, 9999]` calendar range: nanoseconds for year 9999 (~2.5e20) overflow a 64-bit `BIGINT`, the same reason the inverse `unix_nanos` returns `DECIMAL(21, 0)`. - Registers `timestamp_nanos` in `FunctionRegistry` and adds the Scala `functions.timestamp_nanos`. - Adds catalyst unit tests (interpreted + codegen, full-range and round-trip with `unix_nanos`, overflow), Scala/SQL end-to-end tests, and SQL golden-file coverage. Scope notes: the PySpark API (classic and Spark Connect Python) and R are out of scope here and tracked as follow-ups; `timestamp_nanos` is recorded in the PySpark function-parity allowlist in the meantime. The Scala Spark Connect client picks up `timestamp_nanos` automatically because `functions.scala` lives in the shared `sql/api` module. ### Why are the changes needed? Part of the [SPARK-56822](https://issues.apache.org/jira/browse/SPARK-56822) umbrella (timestamps with nanosecond precision). Spark has `timestamp_seconds` / `timestamp_millis` / `timestamp_micros` but no nanosecond counterpart, which is the natural inverse of `unix_nanos`. ### Does this PR introduce _any_ user-facing change? Yes. A new `timestamp_nanos(expr)` function is available in SQL and the Scala API (including the Scala Spark Connect client). It returns `TIMESTAMP_LTZ(9)`. This is a change only within the unreleased nanosecond-timestamp preview. Example: ```sql SELECT timestamp_nanos(1230219000123456789); -- 2008-12-25 07:30:00.123456789 ``` ### How was this patch tested? - `build/sbt 'catalyst/testOnly org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite'` - `build/sbt 'sql/testOnly org.apache.spark.sql.TimestampNanosFunctionsAnsiOnSuite org.apache.spark.sql.TimestampNanosFunctionsAnsiOffSuite'` - `build/sbt 'sql/testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite org.apache.spark.sql.ExpressionsSchemaSuite'` - `SPARK_GENERATE_GOLDEN_FILES=1 build/sbt 'sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z "nanos"'` - `./dev/scalastyle` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor
… analysis `NanosToTimestamp` declared `inputTypes = Seq(DecimalType)` with `ImplicitCastInputTypes`, which silently coerced FLOAT/DOUBLE/STRING to DECIMAL(14,7)/(30,15)/(38,18). Those targets hold far fewer integer digits than a realistic nanosecond count, so a finite FLOAT/DOUBLE argument overflowed the coerced decimal and yielded NULL (ANSI off) or an overflow error (ANSI on) instead of a timestamp -- contrary to the documented "accepted and floored" behavior. Switch to `ExpectsInputTypes` with `Seq(TypeCollection(IntegralType, DecimalType))` so only integral and DECIMAL nanosecond counts are accepted; FLOAT/DOUBLE/STRING now fail at analysis with a clear DATATYPE_MISMATCH, matching the "count of time units" semantics of timestamp_micros/millis. The interpreted and codegen paths widen an integral argument to BigInteger directly and keep the DECIMAL floor path unchanged. Add catalyst coverage for the integral path and the FLOAT/DOUBLE/STRING rejection, a SQL rejection case, and regenerate the golden files. Co-authored-by: Isaac
…ow and add negative tests `NanosToTimestamp` let `BigInteger.longValueExact()` throw a raw `java.lang.ArithmeticException` when `epochMicros` overflows a 64-bit long. Surface it instead as a proper Spark error condition: add `QueryExecutionErrors.timestampNanosOverflowError`, which raises a `SparkArithmeticException` with the `DATETIME_OVERFLOW` condition (SQLSTATE 22008), and catch/rethrow in both the interpreted and codegen paths. Strengthen the negative coverage: the catalyst FLOAT/DOUBLE/STRING rejection now asserts the `UNEXPECTED_INPUT_TYPE` `DataTypeMismatch` (not just `isFailure`), the overflow test asserts the `DATETIME_OVERFLOW` condition via `checkErrorInExpression`, and a SQL golden case exercises the runtime overflow end-to-end. Regenerate the golden files. Co-authored-by: Isaac
| checkAnswer(sqlRes, Row(instant)) | ||
| assert(sqlRes.schema.head.dataType === TimestampLTZNanosType(9)) | ||
|
|
||
| // A BIGINT argument is implicitly cast to DECIMAL, so the integral literal works directly. |
There was a problem hiding this comment.
Nit: This comment seems inaccurate; the expression uses ExpectsInputTypes (not ImplicitCastInputTypes), so a BIGINT is not cast to DECIMAL — it goes through the dedicated IntegralType path (BigInteger.valueOf(... longValue())).
There was a problem hiding this comment.
Good catch, fixed in e81da36. The comment was left over from the original ImplicitCastInputTypes + Seq(DecimalType) design; updated it to describe the dedicated IntegralType path (widened to BigInteger, no DECIMAL cast).
| val micros = try { | ||
| n.subtract(rem).divide(thousand).longValueExact() | ||
| } catch { | ||
| case _: ArithmeticException => throw QueryExecutionErrors.timestampNanosOverflowError(n) |
There was a problem hiding this comment.
One question here for my curiosity: Overflow guard only catches epochMicros not fitting in a 64-bit long, not the documented calendar range. This is consistent with timestamp_micros (which also does no calendar-range validation); so I'm wondering - is it intentional?
Inputs whose epochMicros fits in a long but represents a year > 9999 (or < 0001) — up to ~year 292471 — silently produce an out-of-range TimestampNanosVal, since fromParts validates only nanosWithinMicro.
There was a problem hiding this comment.
Intentional. It matches the sibling timestamp_micros/timestamp_millis/timestamp_seconds, which likewise guard only the 64-bit boundary (Math.multiplyExact) and do not validate the [0001, 9999] calendar range, so an epochMicros that fits in a long but lands past year 9999 (up to the long-micros maximum, ~year 294247) yields an out-of-range value rather than an error. I added an inline comment in e81da36 documenting this so the behavior is explicit. I kept it consistent with the micro constructors rather than introducing calendar-range validation here; happy to add that in a follow-up if we'd prefer the stricter behavior across all of them.
| checkEvaluation(NanosToTimestamp(Literal(-1L)), nanosVal(-1L, 999)) | ||
| checkEvaluation(NanosToTimestamp(Literal(1000)), nanosVal(1L, 0)) |
There was a problem hiding this comment.
Nit about integral-width coverage: the catalyst test exercises Int (Literal(1000)) and Long, which is enough to cover the (long) $c codegen cast, but a TINYINT/SMALLINT case would fully nail the IntegralType branch.
There was a problem hiding this comment.
Added TINYINT (Literal(2.toByte)) and SMALLINT (Literal(1000.toShort)) cases in e81da36 so every integral width exercises the (long) codegen cast.
- Fix a stale test comment that still claimed a BIGINT argument is implicitly cast to DECIMAL; after the switch to ExpectsInputTypes it goes through the dedicated IntegralType path (widened to BigInteger), so the comment is updated to match. - Document that, like timestamp_micros/millis/seconds, NanosToTimestamp does not validate the [0001, 9999] calendar range: only the 64-bit epochMicros boundary is guarded (counts up to ~year 294247 are accepted), which is intentional for consistency with the microsecond constructors. - Extend the catalyst IntegralType coverage with TINYINT (Byte) and SMALLINT (Short) literals so every integral width exercises the (long) codegen cast.
|
@stevomitric @uros-b Could you look at the PR, please. |
What changes were proposed in this pull request?
Adds a built-in
timestamp_nanos(expr)function. It readsexpras a count of nanoseconds since1970-01-01 00:00:00 UTCand returns a nanosecond-precisionTIMESTAMP_LTZ(9)— the natural inverse ofunix_nanos.The argument is an integral or
DECIMALcount.DECIMALis what lets it reach the whole[0001, 9999]calendar range, since year-9999 nanoseconds (~2.5e20) overflow a 64-bitBIGINT— the same reasonunix_nanosreturnsDECIMAL(21, 0).FLOAT/DOUBLE/STRINGare rejected at analysis (a fractional or string nanosecond count isn't meaningful), and a count outside the representable range fails with theDATETIME_OVERFLOWerror condition.Implementation: a new
NanosToTimestampexpression indatetimeExpressions.scala(interpreted + codegen), registered inFunctionRegistry, and exposed asfunctions.timestamp_nanosin the sharedsql/apimodule so the Scala Spark Connect client picks it up automatically. PySpark and R are out of scope and tracked as follow-ups;timestamp_nanosis on the PySpark function-parity allowlist meanwhile.Follow-up: the peer
timestamp_seconds/timestamp_millis/timestamp_microsstill throw a rawArithmeticExceptionon overflow; migrating them toDATETIME_OVERFLOWis tracked in SPARK-57577.Why are the changes needed?
Part of the SPARK-56822 umbrella (nanosecond-precision timestamps). Spark has
timestamp_seconds/timestamp_millis/timestamp_microsbut no nanosecond counterpart.Does this PR introduce any user-facing change?
Yes — a new
timestamp_nanos(expr)function in SQL and the Scala API (including the Scala Spark Connect client), returningTIMESTAMP_LTZ(9). This is a change only within the unreleased nanosecond-timestamp preview.How was this patch tested?
build/sbt 'catalyst/testOnly org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite'build/sbt 'sql/testOnly org.apache.spark.sql.TimestampNanosFunctionsAnsiOnSuite org.apache.spark.sql.TimestampNanosFunctionsAnsiOffSuite'build/sbt 'sql/testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite org.apache.spark.sql.ExpressionsSchemaSuite'SPARK_GENERATE_GOLDEN_FILES=1 build/sbt 'sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z "nanos"'./dev/scalastyleWas this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor