-
Notifications
You must be signed in to change notification settings - Fork 117
Add perceptual hashing (phash) support for image similarity detection #318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks a lot for the pull-request. We are currently discussing this aspect:
Could we imagine that we do Maybe we could already keep your version |
| "gtracker": ["domain", "item"], | ||
| "hhhash": ["domain"], | ||
| "image": ["barcode", "chat", "chat-subchannel", "chat-thread", "message", "ocr", "qrcode", "user-account"], # TODO subchannel + threads ???? | ||
| "image": ["barcode", "chat", "chat-subchannel", "chat-thread", "message", "ocr", "phash", "qrcode", "user-account", "image", "screenshot"], # TODO subchannel + threads ???? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the image object shouldn't correlate with another image or screenshot.
"phash": ["image", "phash"], is used to correlate phash with images and screenshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the new code
| "pgp": ["chat", "domain", "item", "message", "ocr"], | ||
| "qrcode": ["chat", "cve", "cryptocurrency", "decoded", "domain", "image", "message", "screenshot"], # "chat-subchannel", "chat-thread" ????? | ||
| "screenshot": ["barcode", "domain", "item", "qrcode"], | ||
| "screenshot": ["barcode", "domain", "item", "qrcode", "image"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a screenshot object shouldn't correlate with an image.
| @@ -0,0 +1,121 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imagehash is missing from the requirements.
bin/lib/objects/Images.py
Outdated
| description = description.replace("`", ' ') | ||
| return description | ||
|
|
||
| def calculate_phash(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All pHash-related functions should be moved to lib/objects/Phashs.
This would make the code clearer, easier to maintain, and help avoid import issues.
Since pHash is now implemented as a dedicated object, it can be retrieved by the correlation engine using:
self.get_correlation('phash').get('phash')
bin/lib/objects/Screenshots.py
Outdated
| model = get_default_image_description_model() | ||
| return self._get_field(f'desc:{model}') | ||
|
|
||
| def calculate_phash(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All pHash-related functions should be moved to lib/objects/Phashs.
This would make the code clearer, easier to maintain, and help avoid import issues.
Since pHash is now implemented as a dedicated object, it can be retrieved by the correlation engine using:
self.get_correlation('phash').get('phash')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this in commit 6556837
| @@ -0,0 +1,70 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if the ImagePhash and PhashCorrelation modules were merged into a single module to reduce overhead.
The combined module should first check whether the image already has an existing pHash correlation using:
image.exists_correlation('phash')
If no pHash exists, it should:
- Create a new pHash object
- Add a new correlation between the image and the created pHash
- Compare it with existing pHashes
- Add a direct correlation between the two pHashes if the Hamming distance is below the threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. This is a better design
…or retrieval; tests pass
| - **Consistency**: Follows same pattern as existing DomHash/HHHash implementations in AIL | ||
| - **Scalability**: Lightweight hash values enable efficient storage and comparison | ||
|
|
||
| For detailed analysis, see: [Image Analysis Document](file:///home/david-curran/Nextcloud/Hoplite/docs/ImageAnalysisDec.docx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for your information, maybe you want to the document in the repo and not the reference to the nextcloud local directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there good location to share this document with you? I am not sure it should be public so I am not sure it should be in the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can keep it in the HOPLITE repo and remove the reference in this markdown. Thank you!
Phash Implementation: Perceptual Hashing for Image Similarity Detection
Overview
This PR implements perceptual hashing (phash) functionality for AIL, enabling automatic detection of visually similar images. The implementation follows the same architectural pattern as DomHash and HHHash, creating Phash objects that correlate with Images and other similar Phash objects.
Key Features
imagehashlibrary to calculate 64-bit perceptual hashes for imagesWhy Phash Was Chosen
Perceptual hashing (pHash) was selected for image similarity detection in AIL for several reasons:
imagehashlibraryFor detailed analysis, see: Image Analysis Document
Files Changed
New Files Created
bin/lib/objects/Phashs.py(122 lines)Phashclass: Represents a perceptual hash valuePhashscollection class: Manages Phash objectscreate()function: Idempotent Phash object creationbin/modules/ImagePhash.py(71 lines)bin/modules/PhashCorrelation.py(121 lines)var/www/blueprints/objects_phash.py(74 lines)/objects/phashes- List view/objects/phash/post- Form handling/objects/phash/range/json- Chart datavar/www/templates/objects/phash/PhashDaterange.html(164 lines)tests/test_objects_phashes.py(339 lines)Modified Files
bin/lib/objects/Images.pycalculate_phash(): Calculates phash usingimagehash.phash()get_phash(): Lazy-loads phash (calculates if not cached)set_phash(): Stores phash in Image metadatacompare_phash(): Calculates Hamming distance between two phashesfind_similar_images_by_phash(): Utility function for finding similar imagesbin/lib/objects/Screenshots.pycalculate_phash(),get_phash(),set_phash()(same as Images)bin/lib/correlations_engine.py"phash": ["image", "phash"]toCORRELATION_TYPES_BY_OBJbin/lib/objects/ail_objects.pyPhashinOBJECTS_CLASSdictionarybin/lib/ail_core.py'phash'toAIL_OBJECTSset'phash'toAIL_OBJECTS_CORRELATIONS_DEFAULTsetconfigs/modules.cfg[ImagePhash]section with queue configuration[PhashCorrelation]section with queue configurationbin/LAUNCH.shImagePhashmodule to launch sequencePhashCorrelationmodule to launch sequenceconfigs/core.cfg.samplephash_max_hamming_distance = 8configuration optionvar/www/Flask_server.pyobjects_phashblueprintobjects_phashblueprint with Flask apptests/test_modules.pyTestModuleImagePhashtest class (8 tests)TestModulePhashCorrelationtest class (3 tests)TestModulePhashCorrelationFindSimilartest class (6 tests)TestModulePhashCorrelationComputetest class (8 tests)Architecture: Following DomHash/HHHash Pattern
Phash follows the same object pattern as DomHash and HHHash:
PhashextendsAbstractDaterangeObjectPhashsextendsAbstractDaterangeObjectsadd(): Creates Phash ↔ Image correlation automaticallyImagePhashmodule (like Phash uses modules, unlike DomHash/HHHash which are inline)Key Difference: Phash adds similarity matching (Hamming distance) which DomHash/HHHash don't need (they use exact matching).
Correlations
Phash ↔ Image Correlation
ImagePhashmodule processes an imagephash_obj.add(date, image)methodPhash ↔ Phash Correlation
PhashCorrelationmodulephash_max_hamming_distancephash_obj.add_correlation('phash', '', similar_phash_id)Algorithm Details
Perceptual Hashing
imagehash(external dependency)c6073f39b0949d4b)Hamming Distance
imagehashbuilt-in subtraction operatorphash_max_hamming_distance)Testing
Test Coverage
Running Tests
Configuration
Required Configuration
Add to
configs/core.cfg:Module Configuration
Already added to
configs/modules.cfg:Dependencies
Optional Dependencies
imagehash- Perceptual hashing libraryPIL(Pillow) - Image processingBoth are gracefully handled if missing (functions return
None).Performance Considerations
Note: Performance analysis needs to be performed with large volume of data. Initial tests on small datasets (<100 files) and library documentation indicate these functions (phash and hamming) are fast.
Known Limitations
Old images: Images imported before phash implementation won't have phash until:
Correlation display: "Direct Correlations" shows
phash: 0for images because:Breaking Changes
None. This is a new feature addition with no breaking changes to existing functionality.
Screenshots
Checklist