[fix] Improve async training pipeline: ServerAdapter weight sync, robustness, and packaging#11
Open
dafu-wu wants to merge 1 commit intoAgentR1:mainfrom
Open
[fix] Improve async training pipeline: ServerAdapter weight sync, robustness, and packaging#11dafu-wu wants to merge 1 commit intoAgentR1:mainfrom
dafu-wu wants to merge 1 commit intoAgentR1:mainfrom
Conversation
…ustness, and packaging Core improvements: - async_main: Fix Rollouter/Trainer lifecycle so Trainer is not cancelled when Rollouter finishes first; shutdown DataPool after Trainer completes; propagate critical env vars (PYTHONPATH, CUDA_HOME, etc.) to Ray workers - async_rollouter: Use sys.executable instead of hardcoded 'python'; log gateway stdout/stderr to files; use FQDN hostname for multi-node gateway access; defer DataPool shutdown to async_main; gracefully handle validation failures - async_trainer: Fix need_reference_policy(config) call; remove deprecated num_examine param; add debug logging for reward computation diagnostics - blackbox_agent_flow: Add HTTP POST retry with exponential backoff; handle transient connection errors during agent runs (reward=0 fallback); support _run_agent returning (num_turns, reward) tuple - gsm8k_agent/gsm8k_agent_flow: Return (turns_used, reward) tuple from solve() for direct reward passing to Gateway - data_pool: Add get_last_step() for auto-reward computation; fix max_queue_size display when None - training_backend: Add debug logging for reward status in VerlBackend.convert() - detach_workers: Add ServerAdapter support — skip NCCL broadcast for vLLM HTTP server mode; add extract_actor_weights() and receive_and_update_weights() for Ray object store weight transfer; handle DTensor extraction - gateway: Add auto-reward computation for black-box agents via RewardLoopWorker; improve vLLM proxy error handling - param_sync: Add ServerAdapter weight sync path via Ray object store as alternative to NCCL; auto-detect ServerAdapter rollouts - reward_loop: Fix import compatibility (reward_manager vs reward_loop module); add safe defaults for reward_loop_source and reward_manager config - main_agent_ppo: Fix need_reference_policy call; remove num_examine; set reward_fn as attributes to match RayPPOTrainer API - pyproject.toml: Add PEP 621 project metadata, build-system config, and setuptools package discovery for pip-installable packaging
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Significantly improves the async training pipeline with focus on:
Key Changes
Note
Includes some
[DEBUG]logging for reward diagnostics — can be cleaned up in follow-up.