-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Megatron] Precision Loss When Qwen3-Next HF -> mcore -> HF #7102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @Giyu917, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug in the Qwen3-Next model's conversion process between Hugging Face and Megatron-Core, where a subtle difference in RMSNorm implementations led to significant precision loss and potential model instability. By integrating a custom RMSNorm layer that precisely mirrors the Hugging Face specification, the PR ensures accurate weight mapping and robust numerical stability, preventing corrupted outputs during inference. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a precision loss issue during weight conversion for the Qwen3-Next model by introducing a custom Qwen3NextRMSNorm implementation that aligns with the Hugging Face version. The changes are well-contained and logical, correctly removing the need for weight offsets during conversion. I've made a couple of minor suggestions to improve code clarity and robustness. Overall, this is a solid fix.
| def __init__(self, config: TransformerConfig, submodules: SelfAttentionSubmodules, layer_number: int, **kwargs): | ||
| assert config.context_parallel_size == 1, 'Qwen3Next currently does not support context parallel.' | ||
| assert _Qwen3NextGatedDeltaNet is not object, 'please update the `transformers` version.' | ||
| _Qwen3NextGatedDeltaNet.__init__(self, config, layer_number) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion assert _Qwen3NextGatedDeltaNet is not object, 'please update the transformers version.' was removed. This check provided a clear error message if Qwen3NextGatedDeltaNet failed to import, guiding the user to update their transformers installation. Without it, a missing dependency will cause a less intuitive TypeError inside _Qwen3NextGatedDeltaNet.__init__. It would be beneficial to restore this assertion for better error handling and developer experience. You could add it back before this line.
| if layer_type == 'linear_attention': | ||
| layer_spec.submodules.input_layernorm = TENorm | ||
| layer_spec.submodules.self_attention.submodules.linear_qkv = TEColumnParallelLinear | ||
| layer_spec.submodules.input_layernorm = layer_norm_impl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
thanks a lot |
It is an awesome framework, and I enjoy using it. Thank you for your continuous efforts. |
|
Hi, I've tested it and found another bug. The linear_qkv in full_attention still contains a layer_norm. I will submit a new PR based on this one. |
|
Thank you for your time and effort. |
PR type
PR information
This PR fixes a precision loss issue ($1e-3$ ) during the weight conversion process for the Qwen3-Next model between Hugging Face (HF) and Megatron-Core (MCore) formats. During the round-trip conversion (HF -> MCore -> HF), we observed a considerable discrepancy in the LayerNorm weights. This precision drift (up to $10^{-3}$ ) may be sufficient to cause numerical instability during inference, leading to corrupted outputs (e.g., the model generating repetitive ! in some corner cases).
Root Cause
The issue stems from a subtle implementation difference between the standard RMSNorm in Megatron-Core and the native implementation used in qwen3-next.
Result
When MCore's default RMSNorm is used to load Qwen3 weights, the transformation is not strictly bijective, leading to the observed loss.
Solution
I have replaced the generic Megatron RMSNorm with the Qwen3 native RMSNorm implementation within the MCore model definition for Qwen3-Next. This ensures that the weight mapping and the mathematical operations during inference are perfectly aligned with the original Hugging Face implementation.