fix(caching): include shape/dtype/schema in fingerprint hashing by elijahbenizzy · Pull Request #1617 · apache/hamilton

elijahbenizzy · 2026-05-30T02:08:37Z

Summary

The numpy, pandas, and polars fingerprint hashes in hamilton/caching/fingerprinting.py previously hashed only the underlying values, dropping shape, dtype, column names, and Series names. This caused cache key collisions between semantically distinct inputs:

np.array([1,2,3,4,5,6]) vs np.array([[1,2,3],[4,5,6]]) (same .tobytes(), different shapes)
np.array([1,2,3], int32) vs np.array([1,0,2,0,3,0], int16) (same bytes, different dtypes)
pd.DataFrame({"customer_revenue": [100,200]}) vs pd.DataFrame({"product_cost": [100,200]}) (same row values, different column names)
Same pattern for polars

With caching enabled, the cache returned the prior result for the second input silently — producing incorrect outputs with no warning.

What this PR does

hash_numpy_array: prepends shape and dtype to the byte buffer.
hash_polars_dataframe: includes obj.schema alongside row hashes.
hash_pandas_obj: includes (column, dtype) pairs (DataFrame) or (name, dtype) (Series/Index) alongside row hashes.

Test plan

8 new regression tests in tests/caching/test_fingerprinting.py covering each collision case + identical-input determinism.
Updated test_hash_pandas and test_hash_numpy golden hashes for the new schema-aware computation.
All 117 tests/caching/ tests pass locally.

The numpy, pandas, and polars fingerprint hashes in hamilton/caching/fingerprinting.py previously hashed only the underlying values, dropping shape, dtype, column names, and Series names. This caused cache key collisions between semantically distinct inputs: * np.array([1,2,3,4,5,6]) vs np.array([[1,2,3],[4,5,6]]) (same .tobytes(), different shapes) * np.array([1,2,3], int32) vs np.array([1,0,2,0,3,0], int16) (same .tobytes(), different dtypes) * pd.DataFrame({customer_revenue: [100,200]}) vs pd.DataFrame({product_cost: [100,200]}) (same row values, different column names) * pl.DataFrame({customer_revenue: [100,200]}) vs pl.DataFrame({product_cost: [100,200]}) (same row values, different column names) With caching enabled, the cache returned the prior result for the second input silently -- producing incorrect outputs with no warning. The fix incorporates the missing metadata into the hash: - hash_numpy_array: prepends shape and dtype to the byte buffer. - hash_polars_dataframe: includes obj.schema alongside row hashes. - hash_pandas_obj: includes column-name/dtype pairs (DataFrame) or name/dtype (Series/Index) alongside row hashes. Adds focused regression tests for each collision case plus identical-input determinism sanity checks. Existing golden hashes for test_hash_pandas / test_hash_numpy are updated to reflect the new schema-aware computation.

elijahbenizzy · 2026-05-30T18:04:18Z

Fixed in #1616

elijahbenizzy closed this May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(caching): include shape/dtype/schema in fingerprint hashing#1617

fix(caching): include shape/dtype/schema in fingerprint hashing#1617
elijahbenizzy wants to merge 1 commit into
mainfrom
fix/caching-fingerprint-collisions

elijahbenizzy commented May 30, 2026

Uh oh!

elijahbenizzy commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elijahbenizzy commented May 30, 2026

Summary

What this PR does

Test plan

Uh oh!

elijahbenizzy commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant