Add comprehensive n-step returns analysis documentation#4
Draft
Add comprehensive n-step returns analysis documentation#4
Conversation
Co-authored-by: davepl <[email protected]>
Co-authored-by: davepl <[email protected]>
Co-authored-by: davepl <[email protected]>
Co-authored-by: davepl <[email protected]>
Copilot
AI
changed the title
[WIP] How high would we conceivably want to push n_step, and what are the tradeoffs/benefits in making it larger?
Add comprehensive n-step returns analysis documentation
Oct 2, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds comprehensive documentation analyzing n-step returns in the Tempest AI reinforcement learning system, directly answering the question: "How high would we conceivably want to push n_step, and what are the tradeoffs/benefits in making it larger?"
Key Findings
Current Configuration is Optimal ✅
The existing
n_step=7configuration is already in the optimal range (5-10) for Tempest AI:batch_size=16,384effectively mitigates the 7× variance increaseMaximum Viable Limits
Beyond n=15: Variance and expert contamination costs exceed credit assignment benefits.
Implementation Verification
Code review confirmed the n-step implementation is mathematically correct:
All integration points verified:
Documentation Structure
Quick Start (10 minutes)
Comprehensive Analysis (45 minutes)
Advanced Topics (15 minutes, optional)
Total: 8 files, 2,531 lines, ~70 minutes to read everything
Key Tradeoffs Explained
Benefits of Higher N-Step
Costs of Higher N-Step
The Balance
At n=7 with your configuration:
Recommendation
🎯 Keep n_step=7 for the long training run. No changes needed.
The current configuration is already optimal. Focus on starting the training run rather than hyperparameter optimization.
Optional future experiment: After completing this run, could test n=10 after 6M frames (when expert_ratio=10%) on a checkpoint. Expected improvement: +3-5% performance with low risk.
Comparison to Literature
Your configuration is more aggressive than Rainbow (n=3) but aligns with state-of-the-art systems. The exceptionally large batch size (32-256× larger than typical) justifies using n=7 instead of n=3.
Changes Made
This PR is documentation-only - it provides deep analysis and guidance without modifying any functional code.
Original prompt
How high would we conceivably want to push n_step, and what are the tradeoffs/benefits in making it larger?
rd accounting is correct**
The code at lines 323-329 in
AI: correctly tracks which rewards came from expert vs DQN actions:
This is used for metrics only - it correctly tells you which actions earned which rewards.
However, the replay buffer stores ALL transitions together without marking their source:
(state, expert_action, reward, next_state, done)(state, dqn_action, reward, next_state, done)Both get mixed in the same buffer and sampled randomly during training.
Why This Is a Problem:
Expert transitions teach wrong Q-values: When the DQN samples an expert transition, it learns:
Policy mismatch: The expert's policy π_expert ≠ DQN's policy π_DQN
Off-policy corruption: Even though DQN is theoretically off-policy, training heavily on expert data biases the learned Q-function toward expert behavior patterns
Visualization:
When training samples randomly, ~95% early on are expert transitions (due to expert_ratio=0.95), so the Q-network learns mostly from expert demonstrations.
🔧 Potential Solutions
You have several options to fix this:
Option 1: Filter training samples (Recommended for testing)
AI: field to replay buffer
Option 2: Weighted sampling (Most balanced)
Option 3: Separate buffers (Clean architecture)
Option 4: Trust the diversity bonus (Current approach)
💡 My Recommendation
For testing n-step and diversity bonus, I'd suggest Option 1 (filter out expert transitions during training) because:
Want me to implement Option 1 so you can test pure DQN learning with your new n-step/diversity features?
User: How does the diversity bonus work? What does it watch for and do?
AI: Let me explain how the diversity bonus works by showing you the actual implementation:
AI:
Now let me show you where it gets applied in the socket server:
AI:
Perfect! Now let me explain how it works:
How the Diversity Bonus Works
Core Concept: Reward Trying New Things
The diversity bonus gives extra reward when the agent tries an action it hasn't tried before in a similar situation. This encourages counterfactual exploration - asking "what if I do X instead of Y?"
Step-by-Step Process:
1. State Clustering (Line 1139)
(5.5, 10.2, 3.1, ...)2. Action Fingerprinting (Lines 1145-1146)
(2, 0.5)= "Fire + Zap with ...✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.