fix rl mem leak by RangiLyu · Pull Request #1646 · InternLM/xtuner

RangiLyu · 2026-04-01T02:45:30Z

No description provided.

- Introduced new environment variables for RL memory monitoring: XTUNER_RL_MEM_INTERVAL, XTUNER_RL_OBJECT_LIMIT, and XTUNER_RL_OBJECT_TOP_K. - Updated run_rl.sh and run_rl_submit.sh to utilize these new variables for configuring memory monitoring. - Enhanced rl_monitor_actor_memory function to accept additional parameters for object limit and top K objects to monitor. - Added a new summarize_group_payload function in replay_buffer.py to provide detailed statistics on grouped data items. - Implemented memory reference management improvements in controller.py and replay_buffer.py to optimize memory usage during training. These changes aim to improve the flexibility and efficiency of memory monitoring in RL training workflows.

RangiLyu · 2026-04-01T03:32:34Z

@claude review

claude · 2026-04-01T03:35:50Z

xtuner/v1/utils/misc.py

+        return libc.malloc_trim(0)
+    except Exception as e:


Claude: Nit: _TRIM_MEMORY_WARNED is a module-level global mutated without any locking. Since trim_memory is called from BaseMLLMTokenizeFn.__call__ which runs in data-loading workers (potentially multi-threaded), there's a small race window where multiple threads see False and all log the warning. Not a correctness issue, but if you want to be tidy a threading.Lock or just removing the "warn once" logic (always warn or never warn) would be cleaner.

claude · 2026-04-01T03:35:52Z

xtuner/v1/ray/dataflow/replay_buffer.py

+        "judged_observations": 0,
+        "has_multimodal_prompt": False,
+    }
+    if not grouped_dataitem:


Claude: Nit: summarize_group_payload is a public (module-level) function. Per project standards it should have a docstring with Google Style format, parameter types, and return type.

claude · 2026-04-01T03:35:57Z

xtuner/v1/rl/base/controller.py

    def __init__(self, workers: list[TrainingWorker]) -> None:
        self.workers = workers

+    def _collect_object_refs(self, obj, refs: list[ObjectRef]):
+        if isinstance(obj, ObjectRef):
+            refs.append(obj)
+            return


Claude: Warning: _collect_object_refs recurses into lists/tuples but doesn't handle dict values. If pixel_values or rollout_routed_experts ever hold refs inside a dict (e.g., via extra_info), those would be missed.

Not necessarily a problem today since pixel_values is torch.FloatTensor | None and rollout_routed_experts is torch.Tensor | None, but the recursive traversal pattern suggests it's intended to be generic. If so, consider also handling dicts. If not, the method name/docstring should clarify it only handles lists/tuples of ObjectRef.

claude · 2026-04-01T03:36:55Z

xtuner/v1/ray/dataflow/replay_buffer.py

+    def _strip_rollout_payload_for_rerun(self, replay_meta: ReplayMeta, new_state: RolloutState):
+        """Keep prompt refs only and drop rollout outputs that will not be reused."""
+        old_obs_refs = [ref for ref in replay_meta.observation_refs if ref is not None]
+        if old_obs_refs:
+            ray.internal.free(old_obs_refs, local_only=False)
+        replay_meta.observation_refs = [ray.put(RLEnvDataItem()) for _ in replay_meta.observation_ids]


Claude: Warning: Each call to _strip_rollout_payload_for_rerun creates N new Ray objects via ray.put(RLEnvDataItem()) — one per observation ID. For actions with many observations, or when many actions are stripped (expired/aborted batches), this could accumulate significant overhead in the object store.

Consider caching a single empty ref and reusing it:

empty_ref = ray.put(RLEnvDataItem()) replay_meta.observation_refs = [empty_ref for _ in replay_meta.observation_ids]

Since RLEnvDataItem() is always identical, a single shared object ref avoids N redundant copies in the object store.

claude · 2026-04-01T03:37:01Z

xtuner/v1/ray/dataflow/replay_buffer.py

            for action_id in self._root2actions[root_id]:
+                if action_id == current_action_id:
+                    continue
                new_replay_meta = self._actions.pop(action_id, None)
                if new_replay_meta:
                    self._clear_meta_for_actions(new_replay_meta)


Claude: Nit: _clear_meta_for_actions now pops from self._actions internally (line 866 — good, makes it a proper "single source of truth" for cleanup). However, this loop still does self._actions.pop(action_id, None) before calling _clear_meta_for_actions(new_replay_meta), resulting in a redundant double-pop on self._actions for the same key.

Not a bug (the second pop returns None harmlessly), but it's confusing for readers since _clear_meta_for_actions is documented as "the single source of truth for deleting an action." Consider letting _clear_meta_for_actions own the pop:

for action_id in self._root2actions[root_id]: if action_id == current_action_id: continue new_replay_meta = self._actions.get(action_id) if new_replay_meta: self._clear_meta_for_actions(new_replay_meta)

claude · 2026-04-01T03:37:24Z

Claude: ## Summary

This PR fixes Ray object reference memory leaks in RL training by:

Freeing Ray object refs (action_ref, observation_refs) when actions are cleared or stripped
Stripping rollout payloads (keeping only prompts) for expired/aborted actions via _strip_rollout_payload_for_rerun
Freeing pixel_values and rollout_routed_experts refs after ray.get in controller.fit
Adding trim_memory (glibc malloc_trim) to reclaim heap pages during data loading
Fixing PIL Image.open resource leaks by using context managers
Expanding memory monitoring with Ray object store stats

Issues

Warning

replay_buffer.py:387-392 — _strip_rollout_payload_for_rerun creates N new ray.put(RLEnvDataItem()) objects per call. A single cached empty ref would avoid redundant object store entries.
controller.py:29-35 — _collect_object_refs recurses into lists/tuples but not dicts. If this is meant to be generic, consider handling dicts too; if not, the scope should be documented.
misc.py:232-233 — _TRIM_MEMORY_WARNED global is not thread-safe; minor race possible in multi-threaded data loading.

Nit

replay_buffer.py:78 — summarize_group_payload is a public function missing a docstring.
replay_buffer.py:889-894 — _clear_meta_for_root loop still pops from self._actions before calling _clear_meta_for_actions, which now also pops — redundant double-pop.

Verdict

COMMENT — The core memory leak fixes (ref freeing in replay buffer cleanup paths, controller ref freeing, PIL context managers) look correct and well-targeted. The monitoring enhancements are a nice addition. The items above are minor improvements, nothing blocking.

RangiLyu added 2 commits March 31, 2026 15:31

apply trim memory

6032f20

claude bot reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix rl mem leak#1646

fix rl mem leak#1646
RangiLyu wants to merge 2 commits intoInternLM:mainfrom
RangiLyu:lcq/rl-mem

RangiLyu commented Apr 1, 2026

Uh oh!

RangiLyu commented Apr 1, 2026

Uh oh!

claude bot Apr 1, 2026

Uh oh!

claude bot Apr 1, 2026

Uh oh!

claude bot Apr 1, 2026

Uh oh!

claude bot Apr 1, 2026

Uh oh!

claude bot Apr 1, 2026

Uh oh!

claude bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RangiLyu commented Apr 1, 2026

Uh oh!

RangiLyu commented Apr 1, 2026

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 1, 2026

Issues

Warning

Nit

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant