[SPARK-57574][PANDAS] Support the TIME data type in pandas API on Spark#56635
[SPARK-57574][PANDAS] Support the TIME data type in pandas API on Spark#56635marcuslin123 wants to merge 6 commits into
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
3 blocking, 2 non-blocking, 0 nits.
The implementation faithfully mirrors the verified DateOps analogue and looks correct. The blockers are all test wiring/coverage: the new tests don't run in CI, there's no Spark Connect parity test, and the custom astype is untested.
Design / architecture (3)
dev/sparktestsupport/modules.py(~line 905):test_time_opsis not registered, so CI never collects or runs it (discovery uses explicit goal lists, not globbing). Add it next totest_date_ops/test_datetime_ops. [blocking]- new file
python/pyspark/pandas/tests/connect/data_type_ops/test_parity_time_ops.py: no Spark Connect parity test. Every peerdata_type_opstest has a ~10-linetest_parity_*subclass registered under the connect module (modules.py:1340-1341); without one,TimeOpsis untested under Spark Connect. [blocking] python/docs/source/tutorial/pandas_on_spark/types.rst(~line 190): add thedatetime.time → TimeTyperow next to the existingdatetime.date → DateType. [non-blocking]
Correctness (2)
- time_ops.py:62: custom
astypehas no test — see inline. [blocking] - test_time_ops.py:27: coverage gaps vs the DateOps suite (eq/ne, isnull, value round-trip, mixed-type TypeError) — see inline. [non-blocking]
| _sanitize_list_like(right) | ||
| return column_op(PySparkColumn.__gt__)(left, right) | ||
|
|
||
| def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike: |
There was a problem hiding this comment.
astype is the only custom (non-inherited) logic in TimeOps — categorical / bool / string / other branches — but the suite has no test_astype. test_date_ops.py:190 tests astype(str), astype(bool), and a categorical cast; please mirror it.
The string branch is the one to watch: null_str=str(None) plus Spark CAST(TIME AS STRING) is exactly where pandas-vs-Spark formatting can diverge for sub-second precision (pandas str(time(.., 500000)) → "...:00.500000" vs Spark "...:00.5"). A test_astype would confirm or refute this.
| from pyspark.pandas.tests.data_type_ops.testing_utils import OpsTestBase | ||
|
|
||
|
|
||
| class TimeOpsTestsMixin: |
There was a problem hiding this comment.
This suite covers arithmetic rejection and the four ordering comparisons, but is missing cases the peer DateOpsTestsMixin has:
test_eq/test_ne— eq/ne are inherited and reachable forTimeTypebut never exercised here.test_isnull.test_from_to_pandas— nothing asserts the spark→pandas round-trip of actual TIME values (the newTimeType → objectmapping); the comparison tests only assert boolean results.- The peer comparison tests also assert that a pandas-Series RHS raises
TypeError(e.g.self.assertRaises(TypeError, lambda: psdf["this"] == pdf["this"])); worth adding here too.
…s, TypeError assertions, docs update
…, not arithmetic)
| bool BooleanType | ||
| datetime.datetime TimestampType | ||
| datetime.date DateType | ||
| datetime.time TimeType |
There was a problem hiding this comment.
We should probably make it properly supported in PySpark itself first.
What changes were proposed in this pull request?
Add support for
TimeTypecolumns in pandas API on Spark (pyspark.pandas):datetime.timetoTimeTypein the dtype translation layer (typehints.py)TimeTypetonp.dtype("object")for pandas representationTimeOpsclass for column operations (comparisons supported, arithmetic rejected)TimeOpsin the dispatch system (base.py)Why are the changes needed?
pyspark.pandasdoes not handleTimeType— the Spark-to-pandas dtype machinery treatsdatetime.timeas a generic object with no explicit mapping. Without these changes, creating a pandas-on-Spark DataFrame withdatetime.timevalues fails, and column operations on TIME columns crash withTypeError.The underlying Arrow conversion already supports TIME (SPARK-53263 / SPARK-53305), so this wires up the remaining pyspark.pandas layer.
Does this PR introduce any user-facing change?
Yes. Users can now work with
TimeTypecolumns inpyspark.pandas:Previously this would fail with a TypeError or produce incorrect results.
How was this patch tested?
datetime.timemapping totest_typedef.pytest_time_ops.pycovering arithmetic rejection and comparison operationspython/run-tests --testnames pyspark.pandas.tests.test_typedefandpython/run-tests --testnames pyspark.pandas.tests.data_type_ops.test_time_opsWas this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (used as an assistive tool for implementation guidance)