Skip to content

DFlash layers.0.input_layernorm.weight became NaN #115

@airlsyn

Description

@airlsyn

We are training a standalone DFlash drafter for a Qwen3.5 9B target model. But after ~ 240step, layers.0.input_layernorm.weight got NaN, then dflash hidden became NaN. Could you help for what could led to that? Thanks

DFlash config:

  • target model: Qwen3.5 9B
  • DFlash hidden size: 4096
  • DFlash layers: 5
  • target context layers: [1, 8, 15, 22, 29]
  • block size: 16
  • blocks per sequence: 512
  • mask token id: 248063
  • loss gamma: 7
  • DFlash gradient checkpointing: enabled
  • attention backend: flash_attention_2

Training setup:

  • 2 nodes x 8 A100
  • bf16
  • DeepSpeed ZeRO
  • micro batch size per GPU: 64
  • gradient accumulation: 1
  • lr: 6e-4
  • lr_min: 1e-6
  • warmup steps: 2000
  • scheduler: cosine-like decay
  • gradient clipping: 1.0
  • optimizer: AdamW, betas (0.9, 0.98)
  • max length: 8192

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions