Skip to content

[fix] Improve async training pipeline: ServerAdapter weight sync, robustness, and packaging#11

Open
dafu-wu wants to merge 1 commit intoAgentR1:mainfrom
dafu-wu:main
Open

[fix] Improve async training pipeline: ServerAdapter weight sync, robustness, and packaging#11
dafu-wu wants to merge 1 commit intoAgentR1:mainfrom
dafu-wu:main

Conversation

@dafu-wu
Copy link
Copy Markdown

@dafu-wu dafu-wu commented Apr 2, 2026

Summary

Significantly improves the async training pipeline with focus on:

  1. ServerAdapter weight sync — new weight transfer via Ray object store for vLLM HTTP server mode
  2. Multi-node robustness — retry logic, graceful degradation, proper lifecycle management
  3. Packaging — PEP 621 metadata and import compatibility fixes

Key Changes

  • Fix Rollouter/Trainer lifecycle: Rollouter finishing first no longer cancels Trainer
  • Add ServerAdapter detection and Ray object store weight sync path (alternative to NCCL)
  • Add HTTP POST retry with exponential backoff for transient errors
  • GSM8KAgent returns (turns_used, reward) for direct reward passing
  • Gateway auto-computes reward for black-box agents via RewardLoopWorker
  • Use FQDN hostname for gateway URL (multi-node accessibility)
  • Propagate critical env vars to Ray workers
  • Add PEP 621 project metadata for pip-installable packaging
  • Fix import compatibility for reward_loop module

Note

Includes some [DEBUG] logging for reward diagnostics — can be cleaned up in follow-up.

…ustness, and packaging

Core improvements:
- async_main: Fix Rollouter/Trainer lifecycle so Trainer is not cancelled when
  Rollouter finishes first; shutdown DataPool after Trainer completes; propagate
  critical env vars (PYTHONPATH, CUDA_HOME, etc.) to Ray workers
- async_rollouter: Use sys.executable instead of hardcoded 'python'; log gateway
  stdout/stderr to files; use FQDN hostname for multi-node gateway access; defer
  DataPool shutdown to async_main; gracefully handle validation failures
- async_trainer: Fix need_reference_policy(config) call; remove deprecated
  num_examine param; add debug logging for reward computation diagnostics
- blackbox_agent_flow: Add HTTP POST retry with exponential backoff; handle
  transient connection errors during agent runs (reward=0 fallback); support
  _run_agent returning (num_turns, reward) tuple
- gsm8k_agent/gsm8k_agent_flow: Return (turns_used, reward) tuple from solve()
  for direct reward passing to Gateway
- data_pool: Add get_last_step() for auto-reward computation; fix max_queue_size
  display when None
- training_backend: Add debug logging for reward status in VerlBackend.convert()
- detach_workers: Add ServerAdapter support — skip NCCL broadcast for vLLM HTTP
  server mode; add extract_actor_weights() and receive_and_update_weights() for
  Ray object store weight transfer; handle DTensor extraction
- gateway: Add auto-reward computation for black-box agents via RewardLoopWorker;
  improve vLLM proxy error handling
- param_sync: Add ServerAdapter weight sync path via Ray object store as
  alternative to NCCL; auto-detect ServerAdapter rollouts
- reward_loop: Fix import compatibility (reward_manager vs reward_loop module);
  add safe defaults for reward_loop_source and reward_manager config
- main_agent_ppo: Fix need_reference_policy call; remove num_examine; set
  reward_fn as attributes to match RayPPOTrainer API
- pyproject.toml: Add PEP 621 project metadata, build-system config, and
  setuptools package discovery for pip-installable packaging
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant