DFlash layers.0.input_layernorm.weight became NaN

We are training a standalone DFlash drafter for a Qwen3.5 9B target model. But after ~ 240step, `layers.0.input_layernorm.weight` got NaN, then dflash hidden became NaN. Could you help for what could led to that? Thanks


DFlash config:
- target model: Qwen3.5 9B
- DFlash hidden size: 4096
- DFlash layers: 5
- target context layers: `[1, 8, 15, 22, 29]`
- block size: 16
- blocks per sequence: 512
- mask token id: 248063
- loss gamma: 7
- DFlash gradient checkpointing: enabled
- attention backend: `flash_attention_2`

Training setup:
- 2 nodes x 8 A100
- bf16
- DeepSpeed ZeRO
- micro batch size per GPU: 64
- gradient accumulation: 1
- lr: `6e-4`
- lr_min: `1e-6`
- warmup steps: 2000
- scheduler: cosine-like decay
- gradient clipping: 1.0
- optimizer: AdamW, betas `(0.9, 0.98)`
- max length: 8192





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFlash layers.0.input_layernorm.weight became NaN #115

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DFlash layers.0.input_layernorm.weight became NaN #115

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions