support determinisitc RL training with sglang#1644
support determinisitc RL training with sglang#1644YanhuiDua wants to merge 2 commits intoInternLM:mainfrom
Conversation
|
@claude review |
| data_path = os.environ["DATA_PATH"] | ||
| eval_data_path = os.environ["EVAL_DATA_PATH"] | ||
| enable_evaluate = True if eval_data_path != "" else False | ||
| enbale_partial_rollout = int(os.environ.get("ENBALE_PARTIAL_ROLLOUT", "0")) |
There was a problem hiding this comment.
Claude: Typo: enbale_partial_rollout should be enable_partial_rollout. The env var name ENBALE_PARTIAL_ROLLOUT has the same typo.
I see this typo is inherited from other config files (rl_qwen3_8B_grpo.py, rl_qwen25_7B_dapo.py), so it's a pre-existing issue. However, since this is a rename/cleanup of this config file, it would be a good opportunity to fix it here and in the env var name.
| stops: Annotated[list[str], Parameter(help="List of stop sequences.")] = [] | ||
| stop_token_ids: Annotated[list[int], Parameter(help="List of stop token IDs.")] = [] | ||
| skip_special_tokens: Annotated[bool, Parameter(help="Whether to skip special tokens.")] = True | ||
| sampling_seed: Annotated[int, Parameter(help="The seed for random number generator in sampling.")] = 0 |
There was a problem hiding this comment.
Claude: A default sampling_seed of 0 can be ambiguous — some backends treat 0 as "use random seed" while others treat it as an actual seed value. Consider using None as the default with int | None type, and only pass it to the backend when XTUNER_DETERMINISTIC is enabled. This would make the non-deterministic path clearer and avoid accidentally seeding when determinism isn't intended.
| if self.rollout_controller: | ||
| response_future = [] | ||
| for sample in group_data_items: | ||
| for i, sample in enumerate(group_data_items): |
There was a problem hiding this comment.
Claude: copy.deepcopy(sample_params) is now called unconditionally for every sample in the loop, whereas before it was only called in the partial rollout branch. This is a performance regression for the non-deterministic path — deep-copying a Pydantic BaseModel on every iteration is unnecessary overhead when the params aren't being mutated.
Consider guarding the deepcopy:
| for i, sample in enumerate(group_data_items): | |
| update_sample_params = copy.deepcopy(sample_params) if XTUNER_DETERMINISTIC else sample_params |
| for sample in group_data_items: | ||
| for i, sample in enumerate(group_data_items): | ||
| sample.data.extra_info["root_id"] = sample.uid.root_id | ||
| sample.data.extra_info["action_id"] = sample.uid.action_id |
There was a problem hiding this comment.
Claude: self.rollout_cfg is accessed here to get random_seed, but rollout_cfg can be None (see the __init__ signature at line 44: rollout_cfg=None). If XTUNER_DETERMINISTIC is True and rollout_cfg is None, this will raise AttributeError. Add a guard or assert that rollout_cfg is not None when deterministic mode is enabled.
|
|
||
| self.server_func = launch_server | ||
| self.endpoints["health_generate"] = "health_generate" | ||
| self.endpoints["health_generate"] = "health" |
There was a problem hiding this comment.
Claude: Changing the health endpoint from "health_generate" to "health" is a behavioral change that isn't related to deterministic inference. The "health_generate" endpoint checks if the server can actually handle generate requests, while "health" may only confirm the HTTP server is up. This could mask readiness issues where the server is alive but not ready to serve inference.
Was this change intentional? If so, please document the reason (e.g., SGLang version compatibility).
| sglang_server_args.enable_deterministic_inference = enable_deterministic_inference | ||
|
|
||
| if XTUNER_DETERMINISTIC: | ||
| sglang_server_args.enable_deterministic_inference = True |
There was a problem hiding this comment.
Claude: rl_on_policy_target is not a recognized attribute of SGLang's ServerArgs as of the latest public SGLang release. This will raise an AttributeError (or be silently ignored if SGLang uses a permissive config model) on standard SGLang installations.
Is this targeting a custom/internal fork of SGLang? If so, please add a comment noting the required SGLang version or fork, so users don't hit runtime errors with standard SGLang.
| train_log_info = {} | ||
| eval_log_info = {} |
There was a problem hiding this comment.
Claude: Initializing train_log_info = {} and eval_log_info = {} before the if not self._debug_rollout block fixes a potential UnboundLocalError when _debug_rollout is True. Good fix — but this is a pre-existing bug fix unrelated to deterministic training. Per the project's PR standards, consider splitting this into a separate [Fix] commit/PR to keep this PR focused on the deterministic feature.
|
|
||
| def _sort_rollout_outputs(self, data_groups, multimodal_train_infos=None): |
There was a problem hiding this comment.
Claude: The _sort_rollout_outputs method lacks a docstring (required for public methods per project standards). Also, the multimodal_train_infos parameter should have a type annotation:
def _sort_rollout_outputs(self, data_groups: list[list[RLDataFlowItem]], multimodal_train_infos: list | None = None):| # examples of usage: | ||
| # qwen3_8B_grpo_gsm8k training: | ||
| # bash examples/v1/scripts/run_rl.sh examples/v1/config/rl_qwen3_8B_grpo.py "sglang" $MODEL_PATH $DATA_PATH $EVAL_DATA_PATH | ||
| # qwen2.5_7B_dapo_math training: | ||
| # bash examples/v1/scripts/run_rl.sh examples/v1/config/rl_qwen25_7B_dapo.py "sglang" $MODEL_PATH $DATA_PATH $EVAL_DATA_PATH |
There was a problem hiding this comment.
Claude: The usage comments reference run_rl.sh but this script is run_rl_deterministic.sh. Also, the usage shows 5 positional args (CONFIG_PATH, "sglang", MODEL_PATH, DATA_PATH, EVAL_DATA_PATH) but the script only accepts 4 (CONFIG_PATH, MODEL_PATH, DATA_PATH, EVAL_DATA_PATH) — it doesn't take an inference backend argument. These comments appear to be copy-pasted from run_rl.sh without updating.
| export WORK_DIR="${DIR}/work_dirs/${model_dir_name}_${data_dir_name}_${infer_backend_lower}" | ||
| else | ||
| export WORK_DIR=$WORK_DIR | ||
| fi |
There was a problem hiding this comment.
Claude: infer_backend_lower is used here but never defined in this script. This will produce a work directory path ending with an empty string (e.g., work_dirs/model_data_). Looks like it was copied from run_rl.sh where the inference backend variable exists.
|
Claude: ## Summary This PR adds deterministic RL training support with SGLang by introducing seed propagation, deterministic ID generation, and output sorting. It also renames the test config to fix a typo ( IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES |
Uh oh!
There was an error while loading. Please reload this page.