diff --git a/docs_roll/docs/User Guides/Hardware Support/ascend_npu_examples.md b/docs_roll/docs/User Guides/Hardware Support/ascend_npu_examples.md index d1f4ae005..22385bcd7 100644 --- a/docs_roll/docs/User Guides/Hardware Support/ascend_npu_examples.md +++ b/docs_roll/docs/User Guides/Hardware Support/ascend_npu_examples.md @@ -12,7 +12,7 @@ Before running these examples, ensure you have: 2. Verified the environment inside the container (see [Verify the Environment](ascend_docker_usage.md#verify-the-environment)). 3. Downloaded the model weights to a directory accessible from inside the container. -The repository currently includes runnable Ascend examples in `examples/ascend_examples`, including `qwen3_8b_rlvr_deepspeed.yaml`, `qwen3_4B_dpo_deepspeed.yaml`, `run_rlvr_pipeline.sh`, and `run_dpo_pipeline.sh`. +The repository currently includes a runnable Ascend RLVR example in `examples/ascend_examples`, including `qwen3_8b_rlvr_deepspeed.yaml` and `run_rlvr_pipeline.sh`. ## Key Differences from GPU diff --git a/docs_roll/docs/User Guides/Hardware Support/ascend_npu_faq.md b/docs_roll/docs/User Guides/Hardware Support/ascend_npu_faq.md index 8540180a7..ea49f57a9 100644 --- a/docs_roll/docs/User Guides/Hardware Support/ascend_npu_faq.md +++ b/docs_roll/docs/User Guides/Hardware Support/ascend_npu_faq.md @@ -54,7 +54,29 @@ These commands are automatically added to `/root/.bashrc` during the Docker imag - **Atlas 900 A2 PODc** → Use `roll:ascend-a2` (`ascend910b1`) - **Atlas 900 A3 PODc** → Use `roll:ascend-a3` (`ascend910_9391`) -The current repository does not include `Dockerfile.A2` or `Dockerfile.A3`. If you maintain a custom image, ensure its SOC version matches the target hardware. +The current repository includes `docker/Dockerfile.A2` and `docker/Dockerfile.A3` for building custom images. If you maintain a custom image, ensure its SOC version matches the target hardware. + +### Disable FRACTAL_NZ Mode + +**Symptom:** Enabling NZ optimization mode during reinforcement learning is likely to cause precision issues. vLLM-Ascend includes a check for this, and if NZ mode is enabled, it may raise the following error: `ValueError: FRACTAL_NZ mode is enabled. This may cause model parameter precision issues in the RL scenarios.` + +**Solution:** Before running the startup script, add the following environment variable to disable NZ mode: + +```bash +export VLLM_ASCEND_ENABLE_NZ=0 +``` + +### HCCL Parameter Plane Port Binding Failure + +**Symptom:** When the current rank or process establishes a communication operator on the parameter plane, binding the device-side NIC port fails because the port is already occupied. The error may look like: `The IP address XXXX and port XXXX have already been bound`. + +**Solution:** + +1. HCCL uses the device-side NIC port and binds to port 16666 by default. Therefore, if multiple processes run on the same device and all call HCCL communication operator APIs, the port may already be bound by another process, causing the failure. +2. First check whether running multiple processes on the same device is expected for your workload. If it is expected, enable multi-process scenarios by configuring the `HCCL_NPU_SOCKET_PORT_RANGE` environment variable, for example: + ```bash + export HCCL_NPU_SOCKET_PORT_RANGE="auto" + ``` ## Dependency Conflicts diff --git a/docs_roll/docs/User Guides/Hardware Support/ascend_npu_rlvr.md b/docs_roll/docs/User Guides/Hardware Support/ascend_npu_rlvr.md index c0ee2b59e..d51d57bea 100644 --- a/docs_roll/docs/User Guides/Hardware Support/ascend_npu_rlvr.md +++ b/docs_roll/docs/User Guides/Hardware Support/ascend_npu_rlvr.md @@ -40,7 +40,7 @@ docker pull roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:cann851-a3-py docker tag roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:cann851-a3-py311-torch280-vllm0130 roll:ascend-a3 ``` -The current repository does not include `Dockerfile.A2` or `Dockerfile.A3`. If you maintain a custom image, keep the dependency versions aligned with the pre-built image. +The current repository includes `docker/Dockerfile.A2` and `docker/Dockerfile.A3` for building custom images. If you maintain a custom image, keep the dependency versions aligned with the pre-built image. ### 1.3 Start the Container diff --git a/docs_roll/docs/User Guides/Hardware Support/ascend_usage.md b/docs_roll/docs/User Guides/Hardware Support/ascend_usage.md index 6f4c4a8d2..aac1dd8b4 100644 --- a/docs_roll/docs/User Guides/Hardware Support/ascend_usage.md +++ b/docs_roll/docs/User Guides/Hardware Support/ascend_usage.md @@ -126,9 +126,8 @@ python examples/start_agentic_pipeline.py \ | --------------- | ------------------------------------------------------------ | ---------------- | ----------------- | ----------------- | | Agentic | examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_sokoban.sh | DeepSpeed | vLLM | Atlas 900 A3 PODc | | Agentic-Rollout | examples/qwen2.5-0.5B-agentic/run_agentic_rollout_sokoban.sh | DeepSpeed | vLLM | Atlas 900 A3 PODc | -| DPO | examples/qwen2.5-3B-dpo_megatron/run_dpo_pipeline.sh | DeepSpeed | vLLM | Atlas 900 A3 PODc | -| RLVR | examples/qwen2.5-7B-rlvr_megatron/run_rlvr_pipeline.sh | DeepSpeed | vLLM | Atlas 900 A3 PODc | +| RLVR | examples/ascend_examples/run_rlvr_pipeline.sh | DeepSpeed | vLLM | Atlas 900 A2/A3 PODc | ## Disclaimer -The Ascend support provided in ROLL is intended as a reference example. For production use, please consult official channels. \ No newline at end of file +The Ascend support provided in ROLL is intended as a reference example. For production use, please consult official channels. diff --git a/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_npu_examples.md b/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_npu_examples.md index 401f0d836..e77c1531e 100644 --- a/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_npu_examples.md +++ b/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_npu_examples.md @@ -9,10 +9,11 @@ 运行本样例前,请确保: 1. 已拉取与硬件匹配的预构建昇腾镜像(参见 [Docker 使用指南](ascend_docker_usage.md))。 -2. 已在容器内验证环境(参见 [验证环境](ascend_docker_usage.md#verify-the-environment))。 +2. 已在容器内验证环境(参见 [验证环境](ascend_docker_usage.md#验证环境))。 3. 已将模型权重下载到容器可访问的目录。 -当前仓库在 `examples/ascend_examples` 中提供可直接运行的昇腾示例,包括 `qwen3_8b_rlvr_deepspeed.yaml`、`qwen3_4B_dpo_deepspeed.yaml`、`run_rlvr_pipeline.sh` 和 `run_dpo_pipeline.sh`。 +当前仓库在 `examples/ascend_examples` 中提供可直接运行的昇腾 RLVR 示例,包括 `qwen3_8b_rlvr_deepspeed.yaml` 和 `run_rlvr_pipeline.sh`。 + ## GPU 与 NPU 的关键差异 diff --git a/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_npu_faq.md b/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_npu_faq.md index 188736b10..adc9ea8cf 100644 --- a/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_npu_faq.md +++ b/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_npu_faq.md @@ -54,7 +54,29 @@ source /usr/local/Ascend/nnal/atb/set_env.sh - **Atlas 900 A2 PODc** → 使用 `roll:ascend-a2`(`ascend910b1`) - **Atlas 900 A3 PODc** → 使用 `roll:ascend-a3`(`ascend910_9391`) -当前仓库不包含 `Dockerfile.A2` 或 `Dockerfile.A3`。如果维护自定义镜像,请确保 SOC 版本与目标硬件匹配。 +当前仓库包含用于构建自定义镜像的 `docker/Dockerfile.A2` 和 `docker/Dockerfile.A3`。如果维护自定义镜像,请确保 SOC 版本与目标硬件匹配。 + +### 禁用 FRACTAL_NZ模式 + +**现象:** 在强化学习中开启NZ优化模式很有可能导致精度问题,vllm_ascend中存在该校验,若开启会出现 `ValueError: FRACTAL_NZ mode is enabled. This may cause model parameter precision issues in the RL scenarios.`错误 + +**解决方案:** 启动脚本前,添加环境变量,禁用NZ: + ```bash + export VLLM_ASCEND_ENABLE_NZ=0 + ``` + +### HCCL参数面端口绑定失败 + +**现象:** 当前rank或进程在通信算子参数面建链时绑定device侧网卡端口失败,端口被占用,出现 `The IP address XXXX and port XXXX have already been bound`错误 + +**解决方案:** + +1. HCCL使用device侧网卡的端口时默认需绑定16666端口,因此若有多个进程执行在同一个device上,且均会调用HCCL的通信算子接口,那么就会出现端口已被其他进程绑定导致失败的问题。 +2. 此时可先从业务上排查多个进程跑在同一个device上是否符合任务预期,若符合任务预期结果,可通过配置HCCL_NPU_SOCKET_PORT_RANGE环境变量使能多进程场景,如: + ```bash + export HCCL_NPU_SOCKET_PORT_RANGE="auto" + ``` + ## 依赖冲突 diff --git a/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_usage.md b/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_usage.md index c2026c36e..39f7d110e 100644 --- a/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_usage.md +++ b/docs_roll/i18n/zh-Hans/docusaurus-plugin-content-docs/current/User Guides/Hardware Support/ascend_usage.md @@ -130,8 +130,7 @@ python examples/start_agentic_pipeline.py \ | ---- | ---- | -------- | -------- | ---- | | Agentic | examples/qwen2.5-0.5B-agentic/run_agentic_pipeline_sokoban.sh | DeepSpeed | vLLM | Atlas 900 A3 PODc | | Agentic-Rollout | examples/qwen2.5-0.5B-agentic/run_agentic_rollout_sokoban.sh | DeepSpeed | vLLM | Atlas 900 A3 PODc | -| DPO | examples/qwen2.5-3B-dpo_megatron/run_dpo_pipeline.sh | DeepSpeed | vLLM | Atlas 900 A3 PODc | -| RLVR | examples/qwen2.5-7B-rlvr_megatron/run_rlvr_pipeline.sh | DeepSpeed | vLLM | Atlas 900 A3 PODc | +| RLVR | examples/ascend_examples/run_rlvr_pipeline.sh | DeepSpeed | vLLM | Atlas 900 A2/A3 PODc | ## 声明 diff --git a/examples/ascend_examples/run_rlvr_pipeline.sh b/examples/ascend_examples/run_rlvr_pipeline.sh index ab4207cf5..770bd9340 100644 --- a/examples/ascend_examples/run_rlvr_pipeline.sh +++ b/examples/ascend_examples/run_rlvr_pipeline.sh @@ -1,5 +1,8 @@ #!/bin/bash set +x +export HCCL_NPU_SOCKET_PORT_RANGE="auto" +export VLLM_ASCEND_ENABLE_NZ=0 + CONFIG_PATH=$(basename $(dirname $0)) python examples/start_rlvr_pipeline.py --config_path $CONFIG_PATH --config_name qwen3_8b_rlvr_deepspeed