-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
help wantedExtra attention is neededExtra attention is needed
Description
Problem
AVX2 Slim Teddy shows 6x regression compared to SSSE3 in benchmarks with high false-positive rates.
Benchmarks
| Benchmark | SSSE3 (main) | AVX2 (PR #73) | Regression |
|---|---|---|---|
| AhoCorasickLargeInput 64KB | 106µs | 640µs | +502% |
| AhoCorasickManyPatterns 10 | 63ns | 169ns | +168% |
Analysis
Direct SIMD benchmark (NO verification loop)
- AVX2: 18 GB/s (15,699 MB/s)
- SSSE3: 9.4 GB/s (5,348 MB/s)
- AVX2 is 2x faster ✓
Integrated benchmark (WITH verification loop)
- SSSE3: 87-106µs
- AVX2: 500-640µs
- SSSE3 is 6x faster ✗
False positive analysis
Pattern: error|warning|critical|fatal|debug|info|trace|notice|alert|emergency|panic|exception|failure|timeout|refused
2-byte fingerprint prefixes in 64KB English text:
in: 1576 occurrences
no: 788 occurrences
al: 788 occurrences
ex: 788 occurrences
Total: 3940 false positive candidates
Average: 15.6 bytes between candidates
Hypotheses
-
Per-call overhead: AVX2 has higher setup/teardown cost per call
- 256-bit register save/restore
- VZEROUPPER before RET
-
Restart penalty: After each false positive, search restarts
- With 3940 false positives, findSIMD() called ~4000 times
- Each call reinitializes prev0 = 0xFF
-
AMD EPYC specifics: CI uses AMD EPYC 7763
- 256-bit AVX2 split into two 128-bit µops
- More severe cache line crossing penalties
Current workaround
Keep SSSE3 for integrated Teddy prefilter. AVX2 functions remain available for direct use in specialized scenarios.
Questions to investigate
- Is there a bug in AVX2 assembly causing slowdown?
- Would a size-based threshold help? (e.g., only AVX2 for >4KB uninterrupted chunks)
- Can we reduce per-call overhead by restructuring the code?
- Profile with
perfto identify hotspots - Test on different CPUs (Intel vs AMD)
- Compare with Rust aho-corasick (uses compile-time dispatch)
References
- Rust regex PR #456: goodbye simd crate, hello std::arch
- Intel AVX-SSE transition penalties documentation
- AMD Zen 3 architecture manual
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is needed