Skip to content

perf: AVX2 Slim Teddy shows 6x regression in high false-positive workloads #74

@kolkov

Description

@kolkov

Problem

AVX2 Slim Teddy shows 6x regression compared to SSSE3 in benchmarks with high false-positive rates.

Benchmarks

Benchmark SSSE3 (main) AVX2 (PR #73) Regression
AhoCorasickLargeInput 64KB 106µs 640µs +502%
AhoCorasickManyPatterns 10 63ns 169ns +168%

Analysis

Direct SIMD benchmark (NO verification loop)

  • AVX2: 18 GB/s (15,699 MB/s)
  • SSSE3: 9.4 GB/s (5,348 MB/s)
  • AVX2 is 2x faster

Integrated benchmark (WITH verification loop)

  • SSSE3: 87-106µs
  • AVX2: 500-640µs
  • SSSE3 is 6x faster

False positive analysis

Pattern: error|warning|critical|fatal|debug|info|trace|notice|alert|emergency|panic|exception|failure|timeout|refused

2-byte fingerprint prefixes in 64KB English text:

in: 1576 occurrences
no: 788 occurrences  
al: 788 occurrences
ex: 788 occurrences
Total: 3940 false positive candidates
Average: 15.6 bytes between candidates

Hypotheses

  1. Per-call overhead: AVX2 has higher setup/teardown cost per call

    • 256-bit register save/restore
    • VZEROUPPER before RET
  2. Restart penalty: After each false positive, search restarts

    • With 3940 false positives, findSIMD() called ~4000 times
    • Each call reinitializes prev0 = 0xFF
  3. AMD EPYC specifics: CI uses AMD EPYC 7763

    • 256-bit AVX2 split into two 128-bit µops
    • More severe cache line crossing penalties

Current workaround

Keep SSSE3 for integrated Teddy prefilter. AVX2 functions remain available for direct use in specialized scenarios.

Questions to investigate

  • Is there a bug in AVX2 assembly causing slowdown?
  • Would a size-based threshold help? (e.g., only AVX2 for >4KB uninterrupted chunks)
  • Can we reduce per-call overhead by restructuring the code?
  • Profile with perf to identify hotspots
  • Test on different CPUs (Intel vs AMD)
  • Compare with Rust aho-corasick (uses compile-time dispatch)

References

  • Rust regex PR #456: goodbye simd crate, hello std::arch
  • Intel AVX-SSE transition penalties documentation
  • AMD Zen 3 architecture manual

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions