mix chord算法复现问题

您好。我最近用您的框架复现qwen2.5-math-7b-base在luffy数据集的性能。
yaml如下：
```
project: "mix_chord"
name: "mix_chord_math_qwen2.5-math_luffy"
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,/r-contentsecurity/share/datas_yl/yanlong/trinity_ckpt}
algorithm:
  algorithm_type: mix_chord
  repeat_times: 8 # or 16 for better performance in math related tasks
  kl_loss_fn_args:
    kl_coef: 0.00
  sample_strategy_args:
    expert_data_ratio: 0.111
  policy_loss_fn_args: # feel free to change, we encourage you to try out different hyperparameters
    mu_warmup_steps: 0  # 0 for chord-mu and chord-phi
    mu_decay_steps: 0 # 200 for chord-mu and 0 for chord-phi
    mu_peak: 0.5 # 0.9 for chord-mu and 0.1 for chord-phi
    mu_valley: 0.1 # 0.05 for chord-mu and 0.1 for chord-phi
    enable_phi_function: true # false for chord-mu and true for chord-phi
    clip_range: 0.28
    sft_loss_agg_mode: "token-mean"
    use_dynamic_bsz: true
    ppo_mini_batch_size: 576 # 320 = 256 + 64; if you set repeat times = 16, then it shoudle be 32 * 16 + 64
    ppo_micro_batch_size_per_gpu: 8
    ngpus_trainer: 8
    train_batch_size_expert: 64
    train_batch_size_usual: 512 # 32 batchsize * 8 repeat times
model:
  model_path: ${oc.env:TRINITY_MODEL_PATH,/r-contentsecurity/share/datas_yl/yanlong/checkpoints/Qwen2.5-Math-7B}
  max_response_tokens: 3072
  max_model_len: 4096
cluster:
  node_num: 1
  gpu_per_node: 16
buffer:
  total_epochs: 1
  batch_size: 64
  train_batch_size: 576
  explorer_input:
    taskset:
      name: math_aligned
      storage_type: file
      path: ${oc.env:TRINITY_TASKSET_PATH, /mnt/nas/yanlong/data-important/ContinualRL/datasets/data/chord/luffy/rl}
      split: 'train'
      format:
        prompt_key: 'question'
        response_key: 'answer'
        system_prompt: "Your task is to follow a systematic, thorough reasoning process before providing the final solution. This involves analyzing, summarizing, exploring, reassessing, and refining your thought process through multiple iterations. Structure your response into two sections: Thought and Solution. In the Thought section, present your reasoning using the format: \"<think>\n {thoughts} </think>\n\". Each thought should include detailed analysis, brainstorming, verification, and refinement of ideas. After \"</think>\n,\" in the Solution section, provide the final, logical, and accurate answer, clearly derived from the exploration in the Thought section. If applicable, include the answer in \\boxed{} for closed-form results like multiple choices or mathematical solutions."
      rollout_args:
        temperature: 1.0
        logprobs: 0
      workflow_args:
        with_think: false
    eval_tasksets:
    - name: AIME2024
      storage_type: file
      path: ${oc.env:TRINITY_TASKSET_PATH, /mnt/nas/yanlong/datasets/aime_2024}  # e.g. path to AIME2024
      split: 'test'
      repeat_times: 8
      format:
        prompt_key: 'Problem'
        response_key: 'Answer'
        system_prompt: "Your task is to follow a systematic, thorough reasoning process before providing the final solution. This involves analyzing, summarizing, exploring, reassessing, and refining your thought process through multiple iterations. Structure your response into two sections: Thought and Solution. In the Thought section, present your reasoning using the format: \"<think>\n {thoughts} </think>\n\". Each thought should include detailed analysis, brainstorming, verification, and refinement of ideas. After \"</think>\n,\" in the Solution section, provide the final, logical, and accurate answer, clearly derived from the exploration in the Thought section. If applicable, include the answer in \\boxed{} for closed-form results like multiple choices or mathematical solutions."
      rollout_args:
        temperature: 1.0
        top_p: 0.7
    default_workflow_type: 'math_boxed_workflow'
  trainer_input:
    experience_buffer:
      name: math_buffer
      storage_type: queue
      path: 'sqlite:////mnt/nas/yanlong/data-important/cache/mix_chord_math_qwen2.5-math.db'
    auxiliary_buffers:
      sft_dataset:
        total_epochs: 1
        name: SFT_data
        storage_type: file
        schema_type: sft
        path: ${oc.env:TRINITY_SFT_DATASET_PATH, /mnt/nas/yanlong/data-important/ContinualRL/datasets/data/chord/luffy/sft}
        split: 'train'
        format:
          prompt_type: messages
          messages_key: 'messages'
explorer:
  eval_interval: 10
  runner_per_model: 8
  rollout_model:
    engine_num: 8
    tensor_parallel_size: 1
    enable_prefix_caching: true
    enforce_eager: false
    dtype: bfloat16
    seed: 42
synchronizer:
  sync_method: 'nccl'
  sync_interval: 1
  sync_timeout: 1200
trainer:
  save_interval: 50
  trainer_config:
    actor_rollout_ref:
      model:
        use_remove_padding: true
      actor:
        use_dynamic_bsz: true
        ppo_max_token_len_per_gpu: 25600
        ulysses_sequence_parallel_size: 2
        optim:
          lr: 1e-6 # or 5e-6, larger lr with warm up can result in better performance for SFT training.
      ref:
        log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
        log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
        ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size}
monitor:
  monitor_type: wandb
```

但是感觉效果不是很理想。training reward和aime eval accuracy如下：

<img width="807" height="391" alt="Image" src="https://github.com/user-attachments/assets/1f178818-8958-4e27-9326-1b278ea3c378" />

<img width="808" height="396" alt="Image" src="https://github.com/user-attachments/assets/1f493c70-53c0-4981-a7a8-3e97398a5c39" />

请问是哪里的设置有问题吗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mix chord算法复现问题 #425

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mix chord算法复现问题 #425

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions