Skip to content

fix(caching): include shape/dtype/schema in fingerprint hashing#1617

Closed
elijahbenizzy wants to merge 1 commit into
mainfrom
fix/caching-fingerprint-collisions
Closed

fix(caching): include shape/dtype/schema in fingerprint hashing#1617
elijahbenizzy wants to merge 1 commit into
mainfrom
fix/caching-fingerprint-collisions

Conversation

@elijahbenizzy
Copy link
Copy Markdown
Contributor

Summary

The numpy, pandas, and polars fingerprint hashes in hamilton/caching/fingerprinting.py previously hashed only the underlying values, dropping shape, dtype, column names, and Series names. This caused cache key collisions between semantically distinct inputs:

  • np.array([1,2,3,4,5,6]) vs np.array([[1,2,3],[4,5,6]]) (same .tobytes(), different shapes)
  • np.array([1,2,3], int32) vs np.array([1,0,2,0,3,0], int16) (same bytes, different dtypes)
  • pd.DataFrame({"customer_revenue": [100,200]}) vs pd.DataFrame({"product_cost": [100,200]}) (same row values, different column names)
  • Same pattern for polars

With caching enabled, the cache returned the prior result for the second input silently — producing incorrect outputs with no warning.

What this PR does

  • hash_numpy_array: prepends shape and dtype to the byte buffer.
  • hash_polars_dataframe: includes obj.schema alongside row hashes.
  • hash_pandas_obj: includes (column, dtype) pairs (DataFrame) or (name, dtype) (Series/Index) alongside row hashes.

Test plan

  • 8 new regression tests in tests/caching/test_fingerprinting.py covering each collision case + identical-input determinism.
  • Updated test_hash_pandas and test_hash_numpy golden hashes for the new schema-aware computation.
  • All 117 tests/caching/ tests pass locally.

The numpy, pandas, and polars fingerprint hashes in hamilton/caching/fingerprinting.py previously hashed only the underlying values, dropping shape, dtype, column names, and Series names. This caused cache key collisions between semantically distinct inputs:

  * np.array([1,2,3,4,5,6])     vs np.array([[1,2,3],[4,5,6]])
    (same .tobytes(), different shapes)
  * np.array([1,2,3], int32)    vs np.array([1,0,2,0,3,0], int16)
    (same .tobytes(), different dtypes)
  * pd.DataFrame({customer_revenue: [100,200]})
    vs pd.DataFrame({product_cost:  [100,200]})
    (same row values, different column names)
  * pl.DataFrame({customer_revenue: [100,200]})
    vs pl.DataFrame({product_cost:  [100,200]})
    (same row values, different column names)

With caching enabled, the cache returned the prior result for the second input silently -- producing incorrect outputs with no warning.

The fix incorporates the missing metadata into the hash:
- hash_numpy_array: prepends shape and dtype to the byte buffer.
- hash_polars_dataframe: includes obj.schema alongside row hashes.
- hash_pandas_obj: includes column-name/dtype pairs (DataFrame) or name/dtype (Series/Index) alongside row hashes.

Adds focused regression tests for each collision case plus identical-input determinism sanity checks. Existing golden hashes for test_hash_pandas / test_hash_numpy are updated to reflect the new schema-aware computation.
@elijahbenizzy
Copy link
Copy Markdown
Contributor Author

Fixed in #1616

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant