An advanced, high-performance PyTorch implementation of the Titans architecture (Google Research, Jan 2025). This repo provides the tools to build models with Infinite Context using Neural Long-Term Memory.
- Core Concept
- Mathematical Foundation
- Architecture Comparison
- Variants Detailed
- Advanced Configuration
- Performance & Parallel Scan
- Project Structure
Traditional Transformers use a fixed-size Short-Term Memory (Attention). As the sequence grows, the cost becomes quadratic ($O(T^2)$), and ancient information is eventually truncated.
Titans solve this by adding a Neural Memory branch. This branch is a deep MLP that acts as an associative store. For every new token, the model:
- Reads from memory to get context.
- Computes the "surprise" (loss) of the new token.
- Updates its own weights via one step of gradient descent to "learn" the token.
The Neural Memory update follows these core equations from the paper:
| Feature | Transformers | RNN / LSTM | Mamba / SSM | Titans (Ours) |
|---|---|---|---|---|
| Context Length | Fixed (Linear/Quad) | Infinite (but lossy) | Infinite | Infinite (High Fidelity) |
| Logic | Matching | Compression | Linear Dynamics | Test-Time Learning |
| Scaling | ||||
| Stability | Very High | Low | High | Very High |
The gold standard for long-context RAG-style tasks.
- Workflow:
Retrieve Memory->Prepend to Attention->Full Attention. - Best for: Coding assistants, legal document analysis.
- Workflow: Attention and Memory branches run in parallel; their outputs are gated via a SiLU-based mechanism.
- Best for: Creative writing and reasoning where short-term and long-term context must blend.
- Workflow: A sequence is passed through Neural Memory, followed by a Sliding Window Attention layer.
- Best for: General-purpose LLMs seeking a balance between speed and precision.
Our TitansConfig allows for granular control over the memory dynamics:
from titans.utils import TitansConfig
cfg = TitansConfig(
variant="MAC",
d_model=512,
n_layers=12,
mem_layers=2, # Depth of the internal Neural Memory MLP
n_persistent=16, # Constant tokens that stay in memory
chunk_size=64, # Parallelization chunk size (Inner-loop)
use_momentum=True, # Enable η surprise flow
use_decay=True # Enable α forgetting gate
)In version 0.3.0, we implemented a Binary Tree Associative Scan.
Why it matters: Standard RNN-like updates must run token-by-token (one after another). Our associative scan allows the GPU to process entire chunks of a sequence at once by using the associative property of the linear recurrence, reducing latency from
titans-memory/
├── titans/
│ ├── memory/ # Neural & Persistent Memory cores
│ ├── models/ # MAC, MAG, MAL, LMM variants
│ ├── ops/ # Parallel Associative Scan & Attention
│ └── utils/
│ ├── hf.py # HuggingFace Transformers wrapper
│ ├── training.py # DDP & Optimizer helpers
│ └── config.py # Unified TitansConfig
├── tests/ # Full test suite (51+ tests)
├── scripts/ # Weight conversion & local scripts
├── examples/ # Quickstart & Training demos
├── pyproject.toml # Build system & Dependencies
└── README.md
Developed with precision by the Neuranox team.