Custom Self-Attention and Late Interaction for Basket Model: Capturing Complementarity and Purchase Intent in Retail
Vincent Auriau1, 2, Michaël Teboul1, Martin Možina3 and Emmanuel Malherbe1
1 Artefact Research Center, 2 MICS - CentraleSupélec, 3 Fortenova Group
In large-scale retail environments involving thousands of products, understanding how products are purchased, in particular if and how they interact together, is of great importance. Such insights, like products complementarity and substitution, are crucial for assortment optimization, promotion planning, and store layout design. While methods modeling products with embeddings have demonstrated strong performance in learning meaningful representations, modeling efficiently a basket of products as a whole remains a challenge. In this work, we leverage self-attention, a core operation of the Transformer architecture, well known for its ability in language modeling to enrich the representation of tokens given their context. We propose an architecture, training procedure and scoring function adapted to the structure of basket of items, that remain fairly simple and interpretable. It achieves state-of-the-art performance on the basket completion task for several datasets, including a large-scale one from a private retail actor. We provide a detailed analysis of the different architecture components and show how they can be interpreted to better understand products: in terms of popularity, clusters and interactions between them. These insights can be efficiently leveraged in an industrial context, such as within a dashboard we developed for category managers.
git clone --recurse-submodules git@github.com:artefactory/saber.git
The different datasets can be downloaded using the following links and placed in the folder "/datasets".
- Python >= 3.10
- NumPy
- TensorFlow
- pyreadr
- choice-learn
pip install requirements.txt
python experiments/training.py
python experiments/evaluate.py| Configuration | MRR |
HR@50 |
NDCG |
|---|---|---|---|
| Full Architecture | 0.0632 | 27.9 | 0.190 |
| Components Ablation | |||
| w/o Res-FFN | 0.0628 | 27.8 | 0.190 |
| w/o self-Attention | 0.0608 | 27.0 | 0.188 |
| w/o Popularity bias | 0.0510 | 22.8 | 0.174 |
| w/ Value Matrix | 26.8 | 0.186 | |
| Model Capacity | |||
| 4 Heads - 1 Layer | 0.0617 | 27.4 | 0.188 |
| 1 Head - 2 Layers | 0.0628 | 27.6 | 0.188 |
| 1 Head - 4 Layers | 0.0429 | 19.2 | 0.163 |
| Mapping Strategy | |||
| w/o Weight Tying | 0.0585 | 26.1 | 0.181 |
| Prod2Vec | AleaCarta | AttRec | BERT4Rec | SABER | |
|---|---|---|---|---|---|
| Embedding distance | cosine | cosine | cosine | cosine | |
| Uses self-attention | No | No | Yes | Yes | Yes |
| w/ value matrix | - | - | No | Yes | No |
| multi head/layers | No | No | No | Yes | No |
| w/ price effect | No | Yes | No | No | Yes |
| Ties input & output embeddings | No | Yes | Yes | Yes | Yes |
| Designed for variable sized sets | Yes | Yes | No | No | Yes |
| Order invariant | Yes | Yes | No | No | Yes |
| Nb tokens hidden | 1 | 1 | 1 | N | 1 |



