Skip to content

Add verl-FL-fep#31

Open
heavyrain-lzy wants to merge 8 commits into
flagos-ai:mainfrom
heavyrain-lzy:add_verl-fep
Open

Add verl-FL-fep#31
heavyrain-lzy wants to merge 8 commits into
flagos-ai:mainfrom
heavyrain-lzy:add_verl-fep

Conversation

@heavyrain-lzy

Copy link
Copy Markdown

Add verl-FL-fep

@JosephNew JosephNew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: 测试计划需要补充

✅ CPU 单元测试 — OK

5 条测试命令全部对应仓库中存在的文件:

tests/plugin/test_platform_abstraction.py
tests/plugin/test_device_on_cpu.py
tests/plugin/test_engine_registry_on_cpu.py
tests/plugin/test_fl_env_manager_on_cpu.py
tests/special_sanity/check_device_api_usage.py

❌ E2E 训练 — 缺失

测试计划表格中写了 CUDA E2E GRPO training 的期望结果,但没有给出具体命令。仓库中 examples/grpo_trainer/run_qwen3-0.6b_fl.sh 存在但 FEP 未引用,且脚本中包含硬编码路径(/share/project/lizhiyu/...),第三方无法直接运行。需要:

  • 在测试计划中明确引用该脚本
  • 说明如何替换模型路径、数据路径、FlagCX 路径
  • 给出期望的训练收敛指标(如 reward/accuracy 达到多少)

❌ MetaX 平台 — 零命令

测试表格列出了 MetaX 的期望结果,但没有一条测试命令或验证步骤。需要补充:具体测试命令 + 环境要求(MACA 版本等)。

❌ MUSA 异构训练 — 零命令

这是 v0.2.0 的核心功能(NVIDIA + MUSA 混合分布式训练),测试计划中完全没有任何可执行的验证步骤。需要补充:异构部署的启动命令、FlagCX 配置、期望的通信/训练结果。

建议

参考 FEP #4 (KernelGenBench) 的测试计划颗粒度,每项验证至少包含:测试命令 + 环境要求 + 期望结果。

@heavyrain-lzy

heavyrain-lzy commented Jun 9, 2026

Copy link
Copy Markdown
Author

Hi, @JosephNew , E2E 训练MetaX 平台E2E,已添加,可以进行测试。 @physics31415926 已添加异构测试

@JosephNew

Copy link
Copy Markdown
Contributor

@heavyrain-lzy Thanks for the update. The three gaps identified in the 6/3 review have all been addressed:

  • NVIDIA E2E — Docker image + run script + expected output ✅
  • MetaX E2E — Full pipeline from image pull → FlagCX/FlagGems/vllm-plugin-FL install → path customization → run ✅
  • MUSA heterogeneous training — NVIDIA + MUSA cross-node with Ray + FlagCX, including pre-verification via torchrun ✅

The test plans are now detailed enough for execution. Hardware requirements (MetaX C500, dual-node InfiniBand for MUSA) are noted — testing will proceed as those resources become available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants