[SPARK-55299][PS] Infer the correct unit for calculated timedeltas by fangchenli · Pull Request #56624 · apache/spark

fangchenli · 2026-06-19T22:25:08Z

What changes were proposed in this pull request?

This PR adds _with_inferred_unit to TimedeltaOps.sub/rsub to set the
result dtype from the operand units (numpy timedelta64 dtype, or
pd.Timedelta.unit for scalars)

Why are the changes needed?

To match pandas 3 behavior.

Does this PR introduce any user-facing change?

Yes. On pandas 3, the dtype of a timedelta produced by subtraction now
follows the finer resolution of the operands (capped at microseconds) instead
of always timedelta64[us]. Behavior on pandas < 3.0.0 is unchanged.

How was this patch tested?

Unitetests added.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

### What changes were proposed in this pull request? This PR makes pandas-on-Spark report the correct resolution (unit) for a timedelta that is produced by subtraction, instead of always reporting microseconds. From pandas 3.0.0, subtracting timedeltas promotes the result to the finer resolution of the operands (e.g. `timedelta64[s] - timedelta64[s]` is `timedelta64[s]`, and `timedelta64[s] - timedelta64[ms]` is `timedelta64[ms]`). The computed Spark column is always a `DayTimeIntervalType` (microsecond), so the result was previously restored as `timedelta64[us]`, losing the operand unit. A new `_with_inferred_unit` helper overrides the result field dtype using `numpy.promote_types`, capped at microseconds since `DayTimeIntervalType` cannot represent finer resolutions. It is wired into `TimedeltaOps.sub`/`rsub`, the only paths that produce a calculated timedelta. This is scoped to pandas 3.0.0+, matching the existing gating in `TimedeltaOps.restore` and `spark_type_to_pandas_dtype`. Before pandas 3.0.0, pandas-on-Spark represents timedelta as nanosecond resolution everywhere, so there is no unit to infer. ### Why are the changes needed? To match pandas behavior on pandas 3. Previously a calculated timedelta always reported `timedelta64[us]` even when pandas would report a coarser unit. ### Does this PR introduce _any_ user-facing change? Yes. On pandas 3.0.0+, the dtype of a timedelta produced by subtraction now follows the finer resolution of the operands (capped at microseconds) instead of always being `timedelta64[us]`. Behavior on pandas < 3.0.0 is unchanged. ### How was this patch tested? Added `test_sub_unit` covering second/millisecond/microsecond and mixed-unit subtraction, `datetime.timedelta` scalars, and the `TimedeltaIndex` path. Verified `test_timedelta_ops` passes under both pandas 3.0.3 and pandas 2.3.3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… unit inference ### What changes were proposed in this pull request? Guards the timedelta unit inference added in the parent change against object-backed interval columns. pandas-on-Spark can store a timedelta column as `InternalField(dtype=object, DayTimeIntervalType)` (see `InternalFrame.prepare_pandas_frame`), and on pandas 3 subtracting such a series reached `_with_inferred_unit`, which called `numpy.datetime_data` on the `object` dtype and raised `TypeError: cannot get datetime metadata from non-datetime type`. `unit_of` now checks `numpy.issubdtype(dtype, numpy.timedelta64)` and falls back to microseconds (the resolution of the underlying `DayTimeIntervalType`) for object-backed columns. ### Why are the changes needed? To avoid a regression: subtraction on object-backed timedelta columns must keep working on pandas 3. ### Does this PR introduce _any_ user-facing change? No, beyond fixing the regression above; the result matches the previous `timedelta64[us]` behavior for object-backed columns. ### How was this patch tested? Extended `test_sub_unit` with object-backed timedelta subtraction (against a scalar and another series). Verified `test_timedelta_ops` passes under both pandas 3.0.3 and pandas 2.3.3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review feedback: reduce inline comments and switch the dtype format strings from %-formatting to f-strings. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The right-hand operand reaching `_with_inferred_unit` is already narrowed by the isinstance guards in `sub`/`rsub` to either an `IndexOpsMixin` or a `datetime.timedelta`, so type it as `Union[IndexOpsMixin, timedelta]` instead of `Any`. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`pd.Timedelta` is a `datetime.timedelta` subclass, so it is accepted by the `isinstance(right, timedelta)` checks in `sub`/`rsub`, but unlike a plain `datetime.timedelta` it carries its own resolution. `unit_of` now reads `pd.Timedelta.unit` before falling back to microseconds, so e.g. `timedelta64[s] - pd.Timedelta(1, unit="s")` stays seconds instead of being promoted to microseconds. Extended `test_sub_unit` with `pd.Timedelta(..., unit="s"/"ms")` cases for both sub and rsub. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add coverage for the previously-untested branch where DayTimeIntervalType cannot represent nanoseconds: nanosecond operands (a timedelta64[ns] series and a pd.Timedelta with unit="ns") are capped at microseconds rather than matching pandas' nanosecond result. Asserts the dtype directly since this intentionally diverges from pandas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

uros-b

@fangchenli Please update the component tag. The current PR title contains [PYTHON], but it seems that [PS] would be more appropriate here, given that Pandas-on-Spark changes conventionally use [PS].

…elta-unit

… comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fangchenli and others added 6 commits June 19, 2026 07:39

[SPARK-55299][PS][FOLLOWUP] Trim comments and use f-strings

579aa1c

Address review feedback: reduce inline comments and switch the dtype format strings from %-formatting to f-strings. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fangchenli changed the title ~~[Spark-55299][PYTHON] Infer the correct unit for calculated timedeltas~~ [SPARK-55299][PYTHON] Infer the correct unit for calculated timedeltas Jun 19, 2026

uros-b reviewed Jun 20, 2026

View reviewed changes

Comment thread python/pyspark/pandas/data_type_ops/timedelta_ops.py

fangchenli changed the title ~~[SPARK-55299][PYTHON] Infer the correct unit for calculated timedeltas~~ [SPARK-55299][PS] Infer the correct unit for calculated timedeltas Jun 20, 2026

uros-b reviewed Jun 20, 2026

View reviewed changes

Comment thread python/pyspark/pandas/data_type_ops/timedelta_ops.py Outdated

fangchenli and others added 2 commits June 21, 2026 23:21

Merge remote-tracking branch 'upstream/master' into SPARK-55299-timed…

755ed05

…elta-unit

[SPARK-55299][PS] Address review: early-return common case and reword…

a67f830

… comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55299][PS] Infer the correct unit for calculated timedeltas#56624

[SPARK-55299][PS] Infer the correct unit for calculated timedeltas#56624
fangchenli wants to merge 8 commits into
apache:masterfrom
fangchenli:SPARK-55299-timedelta-unit

fangchenli commented Jun 19, 2026

Uh oh!

uros-b left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fangchenli commented Jun 19, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants