docs(ultra3): update quantization recipe and precision table by yueshen2016 · Pull Request #257 · NVIDIA-NeMo/Nemotron

yueshen2016 · 2026-06-04T22:41:51Z

Summary

Replace the generic NVFP4 description in docs/nemotron/ultra3/quantization.md with the actual hybrid recipe used for the released Nemotron 3 Ultra checkpoint, and align the precision settings table with super-nvfp4-max-calib.yaml.

Changes

Hybrid checkpoint description: clarify that the released checkpoint is NVFP4 W4A4 for routed experts, FP8 per-tensor for shared experts and Mamba mixer.{in,out}_proj, FP8 KV cache, BF16 everywhere else (attention, latent MoE, MTP, lm_head, embeddings, Mamba conv1d/SSM).
Precision table: updated to reflect the recipe layer-by-layer; old table marked attention GEMMs and Mamba GEMMs as NVFP4, which is not what the recipe actually applies.
Commands section: replaces the Recipe Execution section. Updates the quantize command to pass the YAML recipe path via --export-quant-cfg instead of the built-in mamba_moe_nvfp4_conservative shorthand (the released recipe is a YAML file, not a built-in name).
Memory-reduction figure: tightened from ~4× to the measured ~3.3× (1.1 TB BF16 → 331 GB hybrid).
Parameter count: tightened to ≈560B (inferred from the BF16 checkpoint size).
Removed the "Quantization Configurations" table listing four built-in shorthand names — the released recipe is a YAML file and the table is no longer the authoritative list.
Removed the "Parallelism Configuration" subsection inside Infrastructure — redundant with the --pp/--tp/--ep flags shown in each command.
Kept Infrastructure components table and Reference section unchanged.

Test plan

Verified the new precision table matches the recipe YAML's enable list (*mlp.experts* → NVFP4, *mlp.shared_experts* and *mixer.{in,out}_proj* → FP8, everything else disabled).
Verified the commands match examples/quantization/{quantize,ptq_generate,export}.py in Megatron-Bridge.
Verified the super-nvfp4-max-calib.yaml recipe path and that PTQ + export + sanity-prompt round-trip works end-to-end on this checkpoint.

🤖 Generated with Claude Code

Replace the generic NVFP4 description with the actual hybrid recipe used for the Nemotron 3 Ultra release: NVFP4 W4A4 for routed experts, FP8 per-tensor for shared experts and Mamba mixer linears, FP8 KV cache, BF16 for attention / latent MoE / lm_head / MTP / embeddings. - Rename "NVFP4 Checkpoint" → "NVFP4 + FP8 Hybrid Checkpoint" - Update the precision table to match super-nvfp4-max-calib.yaml - Drop the obsolete "Quantization Configurations" four-name table (the released recipe is a YAML file, not a built-in shorthand name) - Switch the Quantize command to point at the recipe YAML via --export-quant-cfg - Tighten memory-reduction claim to the measured ratio (~3.3×, 1.1 TB → 331 GB) and parameter count (≈560B) - Rename "Recipe Execution" → "Commands" and drop the Direct-Script-Execution preamble (redundant with Megatron-Bridge README) - Drop the "Parallelism Configuration" subsection (already captured by the --pp/--tp/--ep flags shown in each command) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>

yueshen2016 force-pushed the update-ultra3-quantization-doc branch from 26329a8 to 4d1b27b Compare June 4, 2026 22:45

marcromeyn approved these changes Jun 5, 2026

View reviewed changes

marcromeyn merged commit 41bd567 into NVIDIA-NeMo:main Jun 5, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ultra3): update quantization recipe and precision table#257

docs(ultra3): update quantization recipe and precision table#257
marcromeyn merged 1 commit into
NVIDIA-NeMo:mainfrom
yueshen2016:update-ultra3-quantization-doc

yueshen2016 commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yueshen2016 commented Jun 4, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants