Skip to content

[SPARK-57574][PANDAS] Support the TIME data type in pandas API on Spark#56635

Open
marcuslin123 wants to merge 6 commits into
apache:masterfrom
marcuslin123:SPARK-57574-time-type-pandas
Open

[SPARK-57574][PANDAS] Support the TIME data type in pandas API on Spark#56635
marcuslin123 wants to merge 6 commits into
apache:masterfrom
marcuslin123:SPARK-57574-time-type-pandas

Conversation

@marcuslin123

@marcuslin123 marcuslin123 commented Jun 20, 2026

Copy link
Copy Markdown

What changes were proposed in this pull request?

Add support for TimeType columns in pandas API on Spark (pyspark.pandas):

  • Map datetime.time to TimeType in the dtype translation layer (typehints.py)
  • Map TimeType to np.dtype("object") for pandas representation
  • Create TimeOps class for column operations (comparisons supported, arithmetic rejected)
  • Register TimeOps in the dispatch system (base.py)

Why are the changes needed?

pyspark.pandas does not handle TimeType — the Spark-to-pandas dtype machinery treats datetime.time as a generic object with no explicit mapping. Without these changes, creating a pandas-on-Spark DataFrame with datetime.time values fails, and column operations on TIME columns crash with TypeError.

The underlying Arrow conversion already supports TIME (SPARK-53263 / SPARK-53305), so this wires up the remaining pyspark.pandas layer.

Does this PR introduce any user-facing change?

Yes. Users can now work with TimeType columns in pyspark.pandas:

import pyspark.pandas as ps
import datetime

df = ps.DataFrame({"shift_start": [datetime.time(8, 0), datetime.time(14, 0)]})
df["shift_start"].dtype  # returns object
afternoon = df[df["shift_start"] > datetime.time(12, 0)]  # comparisons work

Previously this would fail with a TypeError or produce incorrect results.

How was this patch tested?

  • Added datetime.time mapping to test_typedef.py
  • Added new test_time_ops.py covering arithmetic rejection and comparison operations
  • All tests pass locally: python/run-tests --testnames pyspark.pandas.tests.test_typedef and python/run-tests --testnames pyspark.pandas.tests.data_type_ops.test_time_ops

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (used as an assistive tool for implementation guidance)

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 blocking, 2 non-blocking, 0 nits.
The implementation faithfully mirrors the verified DateOps analogue and looks correct. The blockers are all test wiring/coverage: the new tests don't run in CI, there's no Spark Connect parity test, and the custom astype is untested.

Design / architecture (3)

  • dev/sparktestsupport/modules.py (~line 905): test_time_ops is not registered, so CI never collects or runs it (discovery uses explicit goal lists, not globbing). Add it next to test_date_ops/test_datetime_ops. [blocking]
  • new file python/pyspark/pandas/tests/connect/data_type_ops/test_parity_time_ops.py: no Spark Connect parity test. Every peer data_type_ops test has a ~10-line test_parity_* subclass registered under the connect module (modules.py:1340-1341); without one, TimeOps is untested under Spark Connect. [blocking]
  • python/docs/source/tutorial/pandas_on_spark/types.rst (~line 190): add the datetime.time → TimeType row next to the existing datetime.date → DateType. [non-blocking]

Correctness (2)

  • time_ops.py:62: custom astype has no test — see inline. [blocking]
  • test_time_ops.py:27: coverage gaps vs the DateOps suite (eq/ne, isnull, value round-trip, mixed-type TypeError) — see inline. [non-blocking]

_sanitize_list_like(right)
return column_op(PySparkColumn.__gt__)(left, right)

def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

astype is the only custom (non-inherited) logic in TimeOps — categorical / bool / string / other branches — but the suite has no test_astype. test_date_ops.py:190 tests astype(str), astype(bool), and a categorical cast; please mirror it.

The string branch is the one to watch: null_str=str(None) plus Spark CAST(TIME AS STRING) is exactly where pandas-vs-Spark formatting can diverge for sub-second precision (pandas str(time(.., 500000))"...:00.500000" vs Spark "...:00.5"). A test_astype would confirm or refute this.

from pyspark.pandas.tests.data_type_ops.testing_utils import OpsTestBase


class TimeOpsTestsMixin:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suite covers arithmetic rejection and the four ordering comparisons, but is missing cases the peer DateOpsTestsMixin has:

  • test_eq / test_ne — eq/ne are inherited and reachable for TimeType but never exercised here.
  • test_isnull.
  • test_from_to_pandas — nothing asserts the spark→pandas round-trip of actual TIME values (the new TimeType → object mapping); the comparison tests only assert boolean results.
  • The peer comparison tests also assert that a pandas-Series RHS raises TypeError (e.g. self.assertRaises(TypeError, lambda: psdf["this"] == pdf["this"])); worth adding here too.

bool BooleanType
datetime.datetime TimestampType
datetime.date DateType
datetime.time TimeType

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make it properly supported in PySpark itself first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants