Skip to content

[SPARK-55299][PS] Infer the correct unit for calculated timedeltas#56624

Open
fangchenli wants to merge 8 commits into
apache:masterfrom
fangchenli:SPARK-55299-timedelta-unit
Open

[SPARK-55299][PS] Infer the correct unit for calculated timedeltas#56624
fangchenli wants to merge 8 commits into
apache:masterfrom
fangchenli:SPARK-55299-timedelta-unit

Conversation

@fangchenli

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR adds _with_inferred_unit to TimedeltaOps.sub/rsub to set the
result dtype from the operand units (numpy timedelta64 dtype, or
pd.Timedelta.unit for scalars)

Why are the changes needed?

To match pandas 3 behavior.

Does this PR introduce any user-facing change?

Yes. On pandas 3, the dtype of a timedelta produced by subtraction now
follows the finer resolution of the operands (capped at microseconds) instead
of always timedelta64[us]. Behavior on pandas < 3.0.0 is unchanged.

How was this patch tested?

Unitetests added.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

fangchenli and others added 6 commits June 19, 2026 07:39
### What changes were proposed in this pull request?

This PR makes pandas-on-Spark report the correct resolution (unit) for a
timedelta that is produced by subtraction, instead of always reporting
microseconds.

From pandas 3.0.0, subtracting timedeltas promotes the result to the finer
resolution of the operands (e.g. `timedelta64[s] - timedelta64[s]` is
`timedelta64[s]`, and `timedelta64[s] - timedelta64[ms]` is
`timedelta64[ms]`). The computed Spark column is always a
`DayTimeIntervalType` (microsecond), so the result was previously restored as
`timedelta64[us]`, losing the operand unit. A new `_with_inferred_unit` helper
overrides the result field dtype using `numpy.promote_types`, capped at
microseconds since `DayTimeIntervalType` cannot represent finer resolutions.
It is wired into `TimedeltaOps.sub`/`rsub`, the only paths that produce a
calculated timedelta.

This is scoped to pandas 3.0.0+, matching the existing gating in
`TimedeltaOps.restore` and `spark_type_to_pandas_dtype`. Before pandas 3.0.0,
pandas-on-Spark represents timedelta as nanosecond resolution everywhere, so
there is no unit to infer.

### Why are the changes needed?

To match pandas behavior on pandas 3. Previously a calculated timedelta always
reported `timedelta64[us]` even when pandas would report a coarser unit.

### Does this PR introduce _any_ user-facing change?

Yes. On pandas 3.0.0+, the dtype of a timedelta produced by subtraction now
follows the finer resolution of the operands (capped at microseconds) instead
of always being `timedelta64[us]`. Behavior on pandas < 3.0.0 is unchanged.

### How was this patch tested?

Added `test_sub_unit` covering second/millisecond/microsecond and mixed-unit
subtraction, `datetime.timedelta` scalars, and the `TimedeltaIndex` path.
Verified `test_timedelta_ops` passes under both pandas 3.0.3 and pandas 2.3.3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… unit inference

### What changes were proposed in this pull request?

Guards the timedelta unit inference added in the parent change against
object-backed interval columns. pandas-on-Spark can store a timedelta column
as `InternalField(dtype=object, DayTimeIntervalType)` (see
`InternalFrame.prepare_pandas_frame`), and on pandas 3 subtracting such a
series reached `_with_inferred_unit`, which called `numpy.datetime_data` on the
`object` dtype and raised `TypeError: cannot get datetime metadata from
non-datetime type`. `unit_of` now checks `numpy.issubdtype(dtype,
numpy.timedelta64)` and falls back to microseconds (the resolution of the
underlying `DayTimeIntervalType`) for object-backed columns.

### Why are the changes needed?

To avoid a regression: subtraction on object-backed timedelta columns must
keep working on pandas 3.

### Does this PR introduce _any_ user-facing change?

No, beyond fixing the regression above; the result matches the previous
`timedelta64[us]` behavior for object-backed columns.

### How was this patch tested?

Extended `test_sub_unit` with object-backed timedelta subtraction (against a
scalar and another series). Verified `test_timedelta_ops` passes under both
pandas 3.0.3 and pandas 2.3.3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback: reduce inline comments and switch the dtype format
strings from %-formatting to f-strings. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The right-hand operand reaching `_with_inferred_unit` is already narrowed by
the isinstance guards in `sub`/`rsub` to either an `IndexOpsMixin` or a
`datetime.timedelta`, so type it as `Union[IndexOpsMixin, timedelta]` instead
of `Any`. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`pd.Timedelta` is a `datetime.timedelta` subclass, so it is accepted by the
`isinstance(right, timedelta)` checks in `sub`/`rsub`, but unlike a plain
`datetime.timedelta` it carries its own resolution. `unit_of` now reads
`pd.Timedelta.unit` before falling back to microseconds, so e.g.
`timedelta64[s] - pd.Timedelta(1, unit="s")` stays seconds instead of being
promoted to microseconds.

Extended `test_sub_unit` with `pd.Timedelta(..., unit="s"/"ms")` cases for both
sub and rsub.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add coverage for the previously-untested branch where DayTimeIntervalType
cannot represent nanoseconds: nanosecond operands (a timedelta64[ns] series and
a pd.Timedelta with unit="ns") are capped at microseconds rather than matching
pandas' nanosecond result. Asserts the dtype directly since this intentionally
diverges from pandas.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fangchenli fangchenli changed the title [Spark-55299][PYTHON] Infer the correct unit for calculated timedeltas [SPARK-55299][PYTHON] Infer the correct unit for calculated timedeltas Jun 19, 2026

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fangchenli Please update the component tag. The current PR title contains [PYTHON], but it seems that [PS] would be more appropriate here, given that Pandas-on-Spark changes conventionally use [PS].

Comment thread python/pyspark/pandas/data_type_ops/timedelta_ops.py
@fangchenli fangchenli changed the title [SPARK-55299][PYTHON] Infer the correct unit for calculated timedeltas [SPARK-55299][PS] Infer the correct unit for calculated timedeltas Jun 20, 2026
Comment thread python/pyspark/pandas/data_type_ops/timedelta_ops.py Outdated
fangchenli and others added 2 commits June 21, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants