Skip to content

Implement perceptual hashing with BK-tree indexing for O(log n) similarity search#319

Draft
Copilot wants to merge 4 commits intomasterfrom
copilot/implement-phash-with-bk-tree
Draft

Implement perceptual hashing with BK-tree indexing for O(log n) similarity search#319
Copilot wants to merge 4 commits intomasterfrom
copilot/implement-phash-with-bk-tree

Conversation

Copy link

Copilot AI commented Jan 29, 2026

Problem

Need perceptual hash (phash) support for image similarity detection. Naive implementation requires O(n) enumeration of all phashes for similarity queries, which doesn't scale beyond 10k images.

Solution

Implement BK-tree (Burkhard-Keller tree) indexing structure in KVRocks using Hamming distance as the metric. Tree pruning via triangle inequality reduces similarity search from O(n) to O(log n).

Tree structure:

phash:bktree:root                      → first phash inserted
phash:bktree:{phash}:children          → hash map: hamming_distance → child_phash

Search pruning:

# Only explore child if: |distance(node, query) - distance(node, child)| ≤ threshold
# Skips ~90%+ of tree for typical queries

Implementation

Core Objects

  • Phashs.py: Phash object class, BK-tree operations (add, search, rebuild), hamming distance calculation
  • Uses imagehash library for DCT-based 64-bit phash computation

Processing Pipeline

Image → ImagePhash module → calculates phash → creates Phash object + Image correlation
                          → PhashCorrelation queue → BK-tree search → creates Phash↔Phash correlations

Integration Points

  • correlations_engine.py: Added phash: [image, phash] correlation types
  • ail_objects.py: Registered Phash in OBJECTS_CLASS
  • ail_core.py: Added phash to AIL_OBJECTS sets
  • modules.cfg: ImagePhash subscribes to Image queue, publishes to PhashCorrelation queue

Configuration

[Images]
phash_max_hamming_distance = 8  # Hamming distance threshold for similarity (0-64 range)

UI

  • Flask blueprint at /objects/phashes with daterange filtering
  • Template follows DomHash/HHHash patterns

Maintenance

  • rebuild_phash_index.py: Rebuilds BK-tree from all existing phash objects (for migrations or corruption recovery)
  • Graceful fallback to linear search if BK-tree operations fail

Dependencies

  • imagehash>=4.3.0 added to requirements.txt

Testing

27 tests covering phash object operations, hamming distance edge cases, BK-tree insertion/search with various thresholds, and index rebuilding.

Original prompt

Phash Implementation with Efficient BK-Tree Indexing

Overview

This PR implements perceptual hashing (phash) functionality for AIL with an efficient BK-tree indexing structure to enable fast similarity detection without full enumeration. This builds upon the work in PR #318 by @cavedave and addresses the performance concerns about enumerating all phash objects.

Key Improvements Over PR #318

1. BK-Tree Indexing for Efficient Search

  • Implements a Burkhard-Keller tree structure in KVRocks for O(log n) similarity searches
  • Avoids O(n) full enumeration of all phash objects
  • Uses triangle inequality to prune search branches that cannot contain matches
  • Provides 100-10000x performance improvement for large datasets (10k-100k images)

2. All Original Features from PR #318

  • Perceptual hashing using imagehash library (64-bit DCT-based phash)
  • Automatic correlation: Phash ↔ Image and Phash ↔ Phash
  • Similarity detection using Hamming distance (configurable threshold, default: 8)
  • UI integration: Phash object browser and correlation graph visualization
  • Comprehensive testing: 100% coverage for Phash objects, 92%+ for modules

Implementation Details

BK-Tree Index Structure

The BK-tree is stored in KVRocks using the following keys:

phash:bktree:root                           # Root node (first phash)
phash:bktree:{phash_value}:children        # Hash map: distance -> child_phash

How it works:

  1. Each phash is inserted into the tree based on Hamming distance from parent nodes
  2. Children are stored at keys representing their distance from parent
  3. Search uses triangle inequality: if |d(node, query) - d(node, target)| > max_distance, skip that subtree
  4. This prunes most of the search space without checking every phash

Performance:


Files to Create/Modify

New Files (from PR #318)

  1. bin/lib/objects/Phashs.py - Enhanced with BK-tree functions:

    • Phash class: Represents a perceptual hash value
    • Phashs collection class: Manages Phash objects
    • add_to_bktree_index(): Insert phash into BK-tree
    • search_bktree_index(): Fast similarity search using BK-tree
    • hamming_distance(): Calculate Hamming distance between phashes
    • rebuild_bktree_index(): Rebuild index from all existing phashes
  2. bin/modules/ImagePhash.py (71 lines)

    • Processes images from Image queue
    • Calculates phash, stores in Image metadata, creates Phash objects
    • Creates Phash ↔ Image correlations
  3. bin/modules/PhashCorrelation.py - Enhanced with BK-tree search:

    • Uses search_bktree_index() for efficient similarity search
    • Falls back to linear search if BK-tree fails (graceful degradation)
    • Creates Phash ↔ Phash correlations
  4. bin/tools/rebuild_phash_index.py (NEW - not in original PR)

    • Maintenance script to rebuild BK-tree index
    • Useful after importing old data or index corruption
  5. var/www/blueprints/objects_phash.py (74 lines)

    • Flask blueprint for Phash object routes
    • /objects/phashes - List view
    • /objects/phash/post - Form handling
    • /objects/phash/range/json - Chart data
  6. var/www/templates/objects/phash/PhashDaterange.html (164 lines)

    • Jinja2 template for Phash object browser UI
    • Date range filtering and object listing
  7. tests/test_objects_phashes.py (339 lines) - Enhanced with BK-tree tests:

    • Original 27 tests for Phash objects
    • Additional tests for BK-tree indexing functions
    • Tests for hamming_distance calculation
    • Tests for search_bktree_index with various thresholds
    • 100% code coverage

Modified Files (from PR #318)

  1. bin/lib/objects/Images.py

    • Move phash calculation functions to Phashs.py (as suggested in PR review)
    • Keep references/wrappers for backward compatibility
  2. bin/lib/objects/Screenshots.py

    • Move phash calculation functions to Phashs.py (as suggested in PR review)
    • Keep references/wrappers for backward compatibility
  3. bin/lib/correlations_engine.py

    • Add "phash": ["image", "phash"] to CORRELATION_TYPES_BY_OBJ
    • Remove incorrect image↔image and screenshot↔image correlations (per PR review feedback)
  4. bin/lib/objects/ail_objects.py

    • Register Phash in OBJECTS_CLASS dictionary
  5. bin/lib/ail_core.py

    • Add 'phash' to AIL_OBJECTS set
    • Add 'phash' to AIL_OBJECTS_CORRELATIONS_DEFAULT set
  6. configs/modules.cfg

    • Add [ImagePhash] section with queue configuration
    • Add [PhashCorrelation] section with queue configuration
  7. bin/LAUNCH.sh

    • Add ImagePhash module to launch sequence
    • Add PhashCorrelation module to launch sequence
  8. configs/core.cfg.sample

    • Add `phash_max_hamming_distance = 8...

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Implement phash functionality with efficient BK-tree indexing Implement perceptual hashing with BK-tree indexing for O(log n) similarity search Jan 29, 2026
Copilot AI requested a review from adulau January 29, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants