Skip to content

Conversation

@cavedave
Copy link
Contributor

Phash Implementation: Perceptual Hashing for Image Similarity Detection

Overview

This PR implements perceptual hashing (phash) functionality for AIL, enabling automatic detection of visually similar images. The implementation follows the same architectural pattern as DomHash and HHHash, creating Phash objects that correlate with Images and other similar Phash objects.

Key Features

  • Perceptual hashing: Uses imagehash library to calculate 64-bit perceptual hashes for images
  • Automatic correlation: Creates Phash ↔ Image and Phash ↔ Phash correlations
  • Similarity detection: Finds similar images using Hamming distance (configurable threshold, default: 8)
  • UI integration: Adds Phash object browser and correlation graph visualization
  • Comprehensive testing: 100% coverage for Phash objects, 92%+ for modules

Why Phash Was Chosen

Perceptual hashing (pHash) was selected for image similarity detection in AIL for several reasons:

  • Robustness: Detects visually similar images even after compression, resizing, or minor modifications
  • Efficiency: 64-bit hash enables fast comparison using Hamming distance
  • Proven algorithm: Uses well-established DCT-based approach implemented in imagehash library
  • Consistency: Follows same pattern as existing DomHash/HHHash implementations in AIL
  • Scalability: Lightweight hash values enable efficient storage and comparison

For detailed analysis, see: Image Analysis Document


Files Changed

New Files Created

  1. bin/lib/objects/Phashs.py (122 lines)

    • Phash class: Represents a perceptual hash value
    • Phashs collection class: Manages Phash objects
    • Module-level create() function: Idempotent Phash object creation
  2. bin/modules/ImagePhash.py (71 lines)

    • Processes images from Image queue
    • Calculates phash, stores in Image metadata, creates Phash objects
    • Creates Phash ↔ Image correlations
  3. bin/modules/PhashCorrelation.py (121 lines)

    • Processes Phash objects from queue
    • Finds similar phashes using Hamming distance
    • Creates Phash ↔ Phash correlations
  4. var/www/blueprints/objects_phash.py (74 lines)

    • Flask blueprint for Phash object routes
    • /objects/phashes - List view
    • /objects/phash/post - Form handling
    • /objects/phash/range/json - Chart data
  5. var/www/templates/objects/phash/PhashDaterange.html (164 lines)

    • Jinja2 template for Phash object browser UI
    • Date range filtering and object listing
  6. tests/test_objects_phashes.py (339 lines)

    • Comprehensive unit tests for Phash objects
    • 27 tests covering all methods and edge cases
    • 100% code coverage

Modified Files

  1. bin/lib/objects/Images.py

    • Added calculate_phash(): Calculates phash using imagehash.phash()
    • Added get_phash(): Lazy-loads phash (calculates if not cached)
    • Added set_phash(): Stores phash in Image metadata
    • Added compare_phash(): Calculates Hamming distance between two phashes
    • Added find_similar_images_by_phash(): Utility function for finding similar images
  2. bin/lib/objects/Screenshots.py

    • Added calculate_phash(), get_phash(), set_phash() (same as Images)
  3. bin/lib/correlations_engine.py

    • Added "phash": ["image", "phash"] to CORRELATION_TYPES_BY_OBJ
    • Enables Phash ↔ Image and Phash ↔ Phash correlations
  4. bin/lib/objects/ail_objects.py

    • Registered Phash in OBJECTS_CLASS dictionary
  5. bin/lib/ail_core.py

    • Added 'phash' to AIL_OBJECTS set
    • Added 'phash' to AIL_OBJECTS_CORRELATIONS_DEFAULT set
  6. configs/modules.cfg

    • Added [ImagePhash] section with queue configuration
    • Added [PhashCorrelation] section with queue configuration
  7. bin/LAUNCH.sh

    • Added ImagePhash module to launch sequence
    • Added PhashCorrelation module to launch sequence
  8. configs/core.cfg.sample

    • Added phash_max_hamming_distance = 8 configuration option
  9. var/www/Flask_server.py

    • Imported objects_phash blueprint
    • Registered objects_phash blueprint with Flask app
  10. tests/test_modules.py

    • Added TestModuleImagePhash test class (8 tests)
    • Added TestModulePhashCorrelation test class (3 tests)
    • Added TestModulePhashCorrelationFindSimilar test class (6 tests)
    • Added TestModulePhashCorrelationCompute test class (8 tests)
    • Total: 25 module tests

Architecture: Following DomHash/HHHash Pattern

Phash follows the same object pattern as DomHash and HHHash:

  • Object Class: Phash extends AbstractDaterangeObject
  • Collection Class: Phashs extends AbstractDaterangeObjects
  • Hash Value = Object ID: Phash value becomes the object's unique identifier
  • Correlation via add(): Creates Phash ↔ Image correlation automatically
  • Module-based Creation: Uses ImagePhash module (like Phash uses modules, unlike DomHash/HHHash which are inline)

Key Difference: Phash adds similarity matching (Hamming distance) which DomHash/HHHash don't need (they use exact matching).


Correlations

Phash ↔ Image Correlation

  • Created automatically when ImagePhash module processes an image
  • Uses phash_obj.add(date, image) method
  • Bidirectional: Phash object shows correlated Images, Image shows correlated Phash

Phash ↔ Phash Correlation

  • Created by PhashCorrelation module
  • Finds similar phashes using Hamming distance ≤ phash_max_hamming_distance
  • Uses phash_obj.add_correlation('phash', '', similar_phash_id)
  • Enables graph visualization of similar images

Algorithm Details

Perceptual Hashing

  • Library: imagehash (external dependency)
  • Algorithm: DCT-based perceptual hash
  • Output: 64-bit hash as 16-character hex string (e.g., c6073f39b0949d4b)

Hamming Distance

  • Library: imagehash built-in subtraction operator
  • Range: 0-64 (0 = identical, 64 = completely different)
  • Default Threshold: 8 (configurable via phash_max_hamming_distance)

Testing

Test Coverage

  • Phash Objects: 100% coverage (27 tests)
  • ImagePhash Module: 92% coverage (8 tests)
  • PhashCorrelation Module: 96% coverage (17 tests)
  • Total: 52 tests

Running Tests

# Run all phash-related tests
python3 -m nose2 tests.test_objects_phashes tests.test_modules.TestModuleImagePhash tests.test_modules.TestModulePhashCorrelation -v

# With coverage
python3 -m nose2 tests.test_objects_phashes tests.test_modules.TestModuleImagePhash tests.test_modules.TestModulePhashCorrelation --coverage bin/lib/objects/Phashs.py bin/modules/ImagePhash.py bin/modules/PhashCorrelation.py --with-coverage --coverage-report term-missing

Configuration

Required Configuration

Add to configs/core.cfg:

[Images]
phash_max_hamming_distance = 8

Module Configuration

Already added to configs/modules.cfg:

[ImagePhash]
subscribe = Image
publish = PhashCorrelation

[PhashCorrelation]
subscribe = PhashCorrelation

Dependencies

Optional Dependencies

  • imagehash - Perceptual hashing library
  • PIL (Pillow) - Image processing

Both are gracefully handled if missing (functions return None).


Performance Considerations

  • Phash calculation: Performed once per image, cached in metadata
  • Similarity search: O(n) scan of all Phash objects (acceptable for current scale)
  • Future optimization: Could add indexing or approximate nearest neighbor search for large datasets

Note: Performance analysis needs to be performed with large volume of data. Initial tests on small datasets (<100 files) and library documentation indicate these functions (phash and hamming) are fast.


Known Limitations

  1. Old images: Images imported before phash implementation won't have phash until:

    • ImagePhash module processes them (if re-queued)
    • Manual reprocessing via backfill script
    • This is expected behavior
  2. Correlation display: "Direct Correlations" shows phash: 0 for images because:

    • Phash correlations are stored as Phash ↔ Image (not Image ↔ Phash in direct view)
    • Graph view correctly shows phash correlations
    • This matches DomHash/HHHash pattern

Breaking Changes

None. This is a new feature addition with no breaking changes to existing functionality.


Screenshots

Screenshot from 2026-01-27 13-23-00

Checklist

  • Code follows existing patterns (DomHash/HHHash)
  • All tests passing
  • High test coverage (100% Phash, 92%+ modules)
  • Error handling implemented

@Terrtia Terrtia self-requested a review January 28, 2026 11:06
@adulau
Copy link
Member

adulau commented Jan 29, 2026

Thanks a lot for the pull-request.

We are currently discussing this aspect:

  • Similarity search: O(n) scan of all Phash objects (acceptable for current scale)

Could we imagine that we do BITOPT directly and structure it in a tree to make faster lookup especially when we have a lot of images?

Maybe we could already keep your version as-is (after a review @Terrtia ) and update the processing model later to have an efficient storage structure to match the binary value in a tree-like structure (or other).

"gtracker": ["domain", "item"],
"hhhash": ["domain"],
"image": ["barcode", "chat", "chat-subchannel", "chat-thread", "message", "ocr", "qrcode", "user-account"], # TODO subchannel + threads ????
"image": ["barcode", "chat", "chat-subchannel", "chat-thread", "message", "ocr", "phash", "qrcode", "user-account", "image", "screenshot"], # TODO subchannel + threads ????
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the image object shouldn't correlate with another image or screenshot.

"phash": ["image", "phash"], is used to correlate phash with images and screenshots

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the new code

"pgp": ["chat", "domain", "item", "message", "ocr"],
"qrcode": ["chat", "cve", "cryptocurrency", "decoded", "domain", "image", "message", "screenshot"], # "chat-subchannel", "chat-thread" ?????
"screenshot": ["barcode", "domain", "item", "qrcode"],
"screenshot": ["barcode", "domain", "item", "qrcode", "image"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a screenshot object shouldn't correlate with an image.

@@ -0,0 +1,121 @@
#!/usr/bin/env python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagehash is missing from the requirements.

description = description.replace("`", ' ')
return description

def calculate_phash(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pHash-related functions should be moved to lib/objects/Phashs.
This would make the code clearer, easier to maintain, and help avoid import issues.

Since pHash is now implemented as a dedicated object, it can be retrieved by the correlation engine using:
self.get_correlation('phash').get('phash')

model = get_default_image_description_model()
return self._get_field(f'desc:{model}')

def calculate_phash(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pHash-related functions should be moved to lib/objects/Phashs.
This would make the code clearer, easier to maintain, and help avoid import issues.

Since pHash is now implemented as a dedicated object, it can be retrieved by the correlation engine using:
self.get_correlation('phash').get('phash')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in commit 6556837

@@ -0,0 +1,70 @@
#!/usr/bin/env python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if the ImagePhash and PhashCorrelation modules were merged into a single module to reduce overhead.

The combined module should first check whether the image already has an existing pHash correlation using:
image.exists_correlation('phash')

If no pHash exists, it should:

  • Create a new pHash object
  • Add a new correlation between the image and the created pHash
  • Compare it with existing pHashes
  • Add a direct correlation between the two pHashes if the Hamming distance is below the threshold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. This is a better design

@adulau
Copy link
Member

adulau commented Jan 29, 2026

@cavedave We did a test of rewrite with copilot for the tree-like structure. If you want to review it, feel free.

#319

This sounds sane but some tests from your side would be awesome.

@cavedave
Copy link
Contributor Author

@adulau @Terrtia I agree with these suggestions. It will take me a few days to go through these fixes

- **Consistency**: Follows same pattern as existing DomHash/HHHash implementations in AIL
- **Scalability**: Lightweight hash values enable efficient storage and comparison

For detailed analysis, see: [Image Analysis Document](file:///home/david-curran/Nextcloud/Hoplite/docs/ImageAnalysisDec.docx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for your information, maybe you want to the document in the repo and not the reference to the nextcloud local directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there good location to share this document with you? I am not sure it should be public so I am not sure it should be in the repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can keep it in the HOPLITE repo and remove the reference in this markdown. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants