Skip to content

LR scheduler progress can be inconsistent with dynamic batching in Megatron actor training #442

@56546256576885

Description

@56546256576885

Summary

There seems to be a potential mismatch between LR scheduler budgeting and LR scheduler consumption in the Megatron actor training path when dynamic batching is enabled.

The scheduler budget appears to be computed from nominal/static batch semantics, for example based on rollout_batch_size, ppo_epochs, per_device_train_batch_size, and gradient_accumulation_steps.

However, with use_dynamic_batching_in_train: true, the actor train batch is first split into token-bounded dynamic micro-batches. These micro-batches are then grouped by gradient_accumulation_steps, so the number of scheduler-consuming optimizer updates can scale roughly as ceil(num_dynamic_micro_batches / gradient_accumulation_steps), rather than being determined only by the nominal sample-level batch formula.

As a result, changing dynamic batching behavior may change how fast the LR scheduler consumes its step budget, even when the intended logical training batch semantics are unchanged.

Potential issue

The code seems to mix two notions of training progress:

  • scheduler budget: nominal/static sample-level batch progress
  • scheduler consumption: actual optimizer update count after token-level dynamic batching and gradient-accumulation grouping

This can cause the LR schedule to progress faster than intended when dynamic batching produces more actual optimizer updates than the nominal static-batch formula expects.

In one long-context agentic RL run, I observed the saved scheduler state progressing far beyond the configured decay horizon: lr_decay_steps = 2400, while num_steps = 8429.

This example is only meant as a symptom. The underlying concern is more general: it may be unexpected for LR scheduler progress to depend on token-level dynamic batching decisions if the logical training batch configuration is unchanged.

Expected behavior

Changing max_tokens_per_microbatch_in_train should affect memory usage and internal batching, but it should not silently accelerate LR decay relative to the nominal training budget.

The LR scheduler should either explicitly follow the actual optimizer update count after dynamic batching, or follow the nominal/static training progress implied by the user configuration.

If the second behavior is intended, then scheduler progress may need to be normalized when dynamic batching changes the actual number of optimizer updates.

Possible direction

One possible fix direction is to compute, for each actor train batch:

  • nominal_steps: expected optimizer update count under static batch semantics
  • actual_steps: actual scheduler-consuming optimizer update count after dynamic batching
  • scheduler_step_increment = nominal_steps / actual_steps

Then the scheduler can advance by normalized progress rather than one full step per actual optimizer update.

The exact implementation does not have to use this API literally. The main point is to decouple LR scheduler progress from token-level dynamic batching when the logical training progress is unchanged.

Questions

Could you clarify the intended LR scheduler semantics for Megatron actor training with dynamic batching?

  1. Should the scheduler follow the actual optimizer update count after dynamic batching?
  2. Or should it follow the nominal/static training progress from the configured batch semantics?
  3. If the latter is intended, would normalizing scheduler progress under dynamic batching be an acceptable fix direction?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions