Skip to content

docs(ultra3): update quantization recipe and precision table#257

Merged
marcromeyn merged 1 commit into
NVIDIA-NeMo:mainfrom
yueshen2016:update-ultra3-quantization-doc
Jun 5, 2026
Merged

docs(ultra3): update quantization recipe and precision table#257
marcromeyn merged 1 commit into
NVIDIA-NeMo:mainfrom
yueshen2016:update-ultra3-quantization-doc

Conversation

@yueshen2016

Copy link
Copy Markdown

Summary

Replace the generic NVFP4 description in docs/nemotron/ultra3/quantization.md with the actual hybrid recipe used for the released Nemotron 3 Ultra checkpoint, and align the precision settings table with super-nvfp4-max-calib.yaml.

Changes

  • Hybrid checkpoint description: clarify that the released checkpoint is NVFP4 W4A4 for routed experts, FP8 per-tensor for shared experts and Mamba mixer.{in,out}_proj, FP8 KV cache, BF16 everywhere else (attention, latent MoE, MTP, lm_head, embeddings, Mamba conv1d/SSM).
  • Precision table: updated to reflect the recipe layer-by-layer; old table marked attention GEMMs and Mamba GEMMs as NVFP4, which is not what the recipe actually applies.
  • Commands section: replaces the Recipe Execution section. Updates the quantize command to pass the YAML recipe path via --export-quant-cfg instead of the built-in mamba_moe_nvfp4_conservative shorthand (the released recipe is a YAML file, not a built-in name).
  • Memory-reduction figure: tightened from ~4× to the measured ~3.3× (1.1 TB BF16 → 331 GB hybrid).
  • Parameter count: tightened to ≈560B (inferred from the BF16 checkpoint size).
  • Removed the "Quantization Configurations" table listing four built-in shorthand names — the released recipe is a YAML file and the table is no longer the authoritative list.
  • Removed the "Parallelism Configuration" subsection inside Infrastructure — redundant with the --pp/--tp/--ep flags shown in each command.
  • Kept Infrastructure components table and Reference section unchanged.

Test plan

  • Verified the new precision table matches the recipe YAML's enable list (*mlp.experts* → NVFP4, *mlp.shared_experts* and *mixer.{in,out}_proj* → FP8, everything else disabled).
  • Verified the commands match examples/quantization/{quantize,ptq_generate,export}.py in Megatron-Bridge.
  • Verified the super-nvfp4-max-calib.yaml recipe path and that PTQ + export + sanity-prompt round-trip works end-to-end on this checkpoint.

🤖 Generated with Claude Code

Replace the generic NVFP4 description with the actual hybrid recipe used
for the Nemotron 3 Ultra release: NVFP4 W4A4 for routed experts, FP8
per-tensor for shared experts and Mamba mixer linears, FP8 KV cache,
BF16 for attention / latent MoE / lm_head / MTP / embeddings.

- Rename "NVFP4 Checkpoint" → "NVFP4 + FP8 Hybrid Checkpoint"
- Update the precision table to match super-nvfp4-max-calib.yaml
- Drop the obsolete "Quantization Configurations" four-name table (the
  released recipe is a YAML file, not a built-in shorthand name)
- Switch the Quantize command to point at the recipe YAML via
  --export-quant-cfg
- Tighten memory-reduction claim to the measured ratio (~3.3×, 1.1 TB
  → 331 GB) and parameter count (≈560B)
- Rename "Recipe Execution" → "Commands" and drop the
  Direct-Script-Execution preamble (redundant with Megatron-Bridge README)
- Drop the "Parallelism Configuration" subsection (already captured by
  the --pp/--tp/--ep flags shown in each command)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
@yueshen2016 yueshen2016 force-pushed the update-ultra3-quantization-doc branch from 26329a8 to 4d1b27b Compare June 4, 2026 22:45
@marcromeyn marcromeyn merged commit 41bd567 into NVIDIA-NeMo:main Jun 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants