[SPARK-55299][PS] Infer the correct unit for calculated timedeltas#56624
Open
fangchenli wants to merge 8 commits into
Open
[SPARK-55299][PS] Infer the correct unit for calculated timedeltas#56624fangchenli wants to merge 8 commits into
fangchenli wants to merge 8 commits into
Conversation
### What changes were proposed in this pull request? This PR makes pandas-on-Spark report the correct resolution (unit) for a timedelta that is produced by subtraction, instead of always reporting microseconds. From pandas 3.0.0, subtracting timedeltas promotes the result to the finer resolution of the operands (e.g. `timedelta64[s] - timedelta64[s]` is `timedelta64[s]`, and `timedelta64[s] - timedelta64[ms]` is `timedelta64[ms]`). The computed Spark column is always a `DayTimeIntervalType` (microsecond), so the result was previously restored as `timedelta64[us]`, losing the operand unit. A new `_with_inferred_unit` helper overrides the result field dtype using `numpy.promote_types`, capped at microseconds since `DayTimeIntervalType` cannot represent finer resolutions. It is wired into `TimedeltaOps.sub`/`rsub`, the only paths that produce a calculated timedelta. This is scoped to pandas 3.0.0+, matching the existing gating in `TimedeltaOps.restore` and `spark_type_to_pandas_dtype`. Before pandas 3.0.0, pandas-on-Spark represents timedelta as nanosecond resolution everywhere, so there is no unit to infer. ### Why are the changes needed? To match pandas behavior on pandas 3. Previously a calculated timedelta always reported `timedelta64[us]` even when pandas would report a coarser unit. ### Does this PR introduce _any_ user-facing change? Yes. On pandas 3.0.0+, the dtype of a timedelta produced by subtraction now follows the finer resolution of the operands (capped at microseconds) instead of always being `timedelta64[us]`. Behavior on pandas < 3.0.0 is unchanged. ### How was this patch tested? Added `test_sub_unit` covering second/millisecond/microsecond and mixed-unit subtraction, `datetime.timedelta` scalars, and the `TimedeltaIndex` path. Verified `test_timedelta_ops` passes under both pandas 3.0.3 and pandas 2.3.3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… unit inference ### What changes were proposed in this pull request? Guards the timedelta unit inference added in the parent change against object-backed interval columns. pandas-on-Spark can store a timedelta column as `InternalField(dtype=object, DayTimeIntervalType)` (see `InternalFrame.prepare_pandas_frame`), and on pandas 3 subtracting such a series reached `_with_inferred_unit`, which called `numpy.datetime_data` on the `object` dtype and raised `TypeError: cannot get datetime metadata from non-datetime type`. `unit_of` now checks `numpy.issubdtype(dtype, numpy.timedelta64)` and falls back to microseconds (the resolution of the underlying `DayTimeIntervalType`) for object-backed columns. ### Why are the changes needed? To avoid a regression: subtraction on object-backed timedelta columns must keep working on pandas 3. ### Does this PR introduce _any_ user-facing change? No, beyond fixing the regression above; the result matches the previous `timedelta64[us]` behavior for object-backed columns. ### How was this patch tested? Extended `test_sub_unit` with object-backed timedelta subtraction (against a scalar and another series). Verified `test_timedelta_ops` passes under both pandas 3.0.3 and pandas 2.3.3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback: reduce inline comments and switch the dtype format strings from %-formatting to f-strings. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The right-hand operand reaching `_with_inferred_unit` is already narrowed by the isinstance guards in `sub`/`rsub` to either an `IndexOpsMixin` or a `datetime.timedelta`, so type it as `Union[IndexOpsMixin, timedelta]` instead of `Any`. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`pd.Timedelta` is a `datetime.timedelta` subclass, so it is accepted by the `isinstance(right, timedelta)` checks in `sub`/`rsub`, but unlike a plain `datetime.timedelta` it carries its own resolution. `unit_of` now reads `pd.Timedelta.unit` before falling back to microseconds, so e.g. `timedelta64[s] - pd.Timedelta(1, unit="s")` stays seconds instead of being promoted to microseconds. Extended `test_sub_unit` with `pd.Timedelta(..., unit="s"/"ms")` cases for both sub and rsub. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add coverage for the previously-untested branch where DayTimeIntervalType cannot represent nanoseconds: nanosecond operands (a timedelta64[ns] series and a pd.Timedelta with unit="ns") are capped at microseconds rather than matching pandas' nanosecond result. Asserts the dtype directly since this intentionally diverges from pandas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
uros-b
reviewed
Jun 20, 2026
uros-b
left a comment
Member
There was a problem hiding this comment.
@fangchenli Please update the component tag. The current PR title contains [PYTHON], but it seems that [PS] would be more appropriate here, given that Pandas-on-Spark changes conventionally use [PS].
uros-b
reviewed
Jun 20, 2026
uros-b
reviewed
Jun 20, 2026
… comment Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds
_with_inferred_unittoTimedeltaOps.sub/rsubto set theresult dtype from the operand units (numpy
timedelta64dtype, orpd.Timedelta.unitfor scalars)Why are the changes needed?
To match pandas 3 behavior.
Does this PR introduce any user-facing change?
Yes. On pandas 3, the dtype of a timedelta produced by subtraction now
follows the finer resolution of the operands (capped at microseconds) instead
of always timedelta64[us]. Behavior on pandas < 3.0.0 is unchanged.
How was this patch tested?
Unitetests added.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.8