[fix] Improve async training pipeline: ServerAdapter weight sync, robustness, and packaging by dafu-wu · Pull Request #11 · AgentR1/Claw-R1

dafu-wu · 2026-04-02T04:50:16Z

Summary

Significantly improves the async training pipeline with focus on:

ServerAdapter weight sync — new weight transfer via Ray object store for vLLM HTTP server mode
Multi-node robustness — retry logic, graceful degradation, proper lifecycle management
Packaging — PEP 621 metadata and import compatibility fixes

Key Changes

Fix Rollouter/Trainer lifecycle: Rollouter finishing first no longer cancels Trainer
Add ServerAdapter detection and Ray object store weight sync path (alternative to NCCL)
Add HTTP POST retry with exponential backoff for transient errors
GSM8KAgent returns (turns_used, reward) for direct reward passing
Gateway auto-computes reward for black-box agents via RewardLoopWorker
Use FQDN hostname for gateway URL (multi-node accessibility)
Propagate critical env vars to Ray workers
Add PEP 621 project metadata for pip-installable packaging
Fix import compatibility for reward_loop module

Note

Includes some [DEBUG] logging for reward diagnostics — can be cleaned up in follow-up.

…ustness, and packaging Core improvements: - async_main: Fix Rollouter/Trainer lifecycle so Trainer is not cancelled when Rollouter finishes first; shutdown DataPool after Trainer completes; propagate critical env vars (PYTHONPATH, CUDA_HOME, etc.) to Ray workers - async_rollouter: Use sys.executable instead of hardcoded 'python'; log gateway stdout/stderr to files; use FQDN hostname for multi-node gateway access; defer DataPool shutdown to async_main; gracefully handle validation failures - async_trainer: Fix need_reference_policy(config) call; remove deprecated num_examine param; add debug logging for reward computation diagnostics - blackbox_agent_flow: Add HTTP POST retry with exponential backoff; handle transient connection errors during agent runs (reward=0 fallback); support _run_agent returning (num_turns, reward) tuple - gsm8k_agent/gsm8k_agent_flow: Return (turns_used, reward) tuple from solve() for direct reward passing to Gateway - data_pool: Add get_last_step() for auto-reward computation; fix max_queue_size display when None - training_backend: Add debug logging for reward status in VerlBackend.convert() - detach_workers: Add ServerAdapter support — skip NCCL broadcast for vLLM HTTP server mode; add extract_actor_weights() and receive_and_update_weights() for Ray object store weight transfer; handle DTensor extraction - gateway: Add auto-reward computation for black-box agents via RewardLoopWorker; improve vLLM proxy error handling - param_sync: Add ServerAdapter weight sync path via Ray object store as alternative to NCCL; auto-detect ServerAdapter rollouts - reward_loop: Fix import compatibility (reward_manager vs reward_loop module); add safe defaults for reward_loop_source and reward_manager config - main_agent_ppo: Fix need_reference_policy call; remove num_examine; set reward_fn as attributes to match RayPPOTrainer API - pyproject.toml: Add PEP 621 project metadata, build-system config, and setuptools package discovery for pip-installable packaging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Improve async training pipeline: ServerAdapter weight sync, robustness, and packaging#11

[fix] Improve async training pipeline: ServerAdapter weight sync, robustness, and packaging#11
dafu-wu wants to merge 1 commit intoAgentR1:mainfrom
dafu-wu:main

dafu-wu commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dafu-wu commented Apr 2, 2026

Summary

Key Changes

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant