Skip to content

RFE: Add support for more floating point low-precision ML data types (bfloat16, fp8, nvfp4) #930

@mtavenrath

Description

@mtavenrath

Problem Statement
Currently, the WebNN API supports standard data types like float32, float16, int8, and uint8. However, the machine learning hardware ecosystem has rapidly migrated toward specialized, low-precision floating-point formats to handle the extreme memory bandwidth and compute demands of modern Large Language Models (LLMs) and diffusion models.

Developers bringing state-of-the-art models to the web are currently forced into a strict dilemma:

  1. Upcast to keep precision: Converting native bfloat16 or fp8 models to float32 respective float16 preserves the model's accuracy and dynamic range, but immediately bloats the memory footprint and exacerbates memory bandwidth bottlenecks, which can be fatal for client-side inference.
  2. Quantize to integer to save memory: Converting to int8 or int4 achieves the desired performance and footprint, but sacrifices precision. Integers space values uniformly, which crushes the activation outliers prevalent in LLMs, leading to a degradation in reasoning quality and perplexity.

Without native WebNN support for modern low-precision floating-point formats, the API forces developers to choose between performance and accuracy, losing out on the hardware-accelerated "best of both worlds" that these new formats were designed to provide.

The Case for Each Format

  1. bfloat16 (Brain Floating Point)

    • Context: Widely used as the default format for training modern AI models because its 8-bit exponent matches float32, preventing gradient underflow/overflow.
    • Issue: Porting a natively bfloat16-trained model to WebNN currently requires an offline conversion to float32 (doubling the model size) or float16 (which requires careful scaling to avoid out-of-bounds errors).
  2. fp8 (OCP 8-bit Floating Point - E4M3 & E5M2)

    • Context: Now standardized by the Open Compute Project and natively supported by current-gen hardware (NVIDIA Hopper/Blackwell, Ryzen AI, RDNA4, Intel XE2 (Lunar Lake, ARC B-series)).
    • Issue: fp8 acts as the perfect middle ground between the two extremes mentioned above. It provides the memory footprint of int8 while maintaining the dynamic range of a floating-point exponent, allowing it to handle activation outliers without precision loss.
  3. nvfp4 (4-bit Microscaled Floating Point)

    • Context: Introduced with NVIDIA's Blackwell architecture, this format represents the bleeding edge of sub-byte quantization. It utilizes a two-level microscaling block approach (sharing an FP8 scale across 16 values) to drastically reduce quantization error compared to pure integer 4-bit.
    • Issue: Allowing hardware-accelerated 4-bit floating-point inference in the browser would be a massive leap for client-side LLM execution, cutting memory usage by ~3.5x compared to FP16 while maintaining near-baseline accuracy.

Proposed Solution
Update the MLOperandDataType enum and underlying buffer compatibility tables to include:

  • bfloat16
  • fp8 (potentially differentiating between E4M3 and E5M2 variants)
  • nvfp4 (or a generic hardware-agnostic equivalent for 4-bit block-scaled floating point)

Use Cases & Benefits

  • Quicker Model Porting: Developers can load original weights directly into the graph without complex offline upcasting/re-quantization pipelines.
  • Higher Quantization Accuracy: Floating-point quantization (fp8, nvfp4) retains the dynamic range of activations much better than standard integer quantization, reducing perplexity degradation.
  • Reduced Memory Pressure: Keeping tensors in their ultra-low precision formats slashes both memory footprint and memory bandwidth bottlenecks during inference on edge devices.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions