RFE:  Add support for more floating point low-precision ML data types (`bfloat16`, `fp8`, `nvfp4`)

**Problem Statement**
Currently, the WebNN API supports standard data types like `float32`, `float16`, `int8`, and `uint8`. However, the machine learning hardware ecosystem has rapidly migrated toward specialized, low-precision floating-point formats to handle the extreme memory bandwidth and compute demands of modern Large Language Models (LLMs) and diffusion models. 

Developers bringing state-of-the-art models to the web are currently forced into a strict dilemma:
1. **Upcast to keep precision:** Converting native `bfloat16` or `fp8` models to `float32` respective `float16` preserves the model's accuracy and dynamic range, but immediately bloats the memory footprint and exacerbates memory bandwidth bottlenecks, which can be fatal for client-side inference.
2. **Quantize to integer to save memory:** Converting to `int8` or `int4` achieves the desired performance and footprint, but sacrifices precision. Integers space values uniformly, which crushes the activation outliers prevalent in LLMs, leading to a degradation in reasoning quality and perplexity.

Without native WebNN support for modern low-precision floating-point formats, the API forces developers to choose between performance and accuracy, losing out on the hardware-accelerated "best of both worlds" that these new formats were designed to provide.

**The Case for Each Format**

1. **`bfloat16` (Brain Floating Point)**
   * **Context:** Widely used as the default format for training modern AI models because its 8-bit exponent matches `float32`, preventing gradient underflow/overflow. 
   * **Issue:** Porting a natively `bfloat16`-trained model to WebNN currently requires an offline conversion to `float32` (doubling the model size) or `float16` (which requires careful scaling to avoid out-of-bounds errors).
   
2. **`fp8` (OCP 8-bit Floating Point - E4M3 & E5M2)**
   * **Context:** Now standardized by the Open Compute Project and natively supported by current-gen hardware (NVIDIA Hopper/Blackwell, Ryzen AI, RDNA4, Intel XE2 (Lunar Lake, ARC B-series)). 
   * **Issue:** `fp8` acts as the perfect middle ground between the two extremes mentioned above. It provides the memory footprint of `int8` while maintaining the dynamic range of a floating-point exponent, allowing it to handle activation outliers without precision loss.
   
3. **`nvfp4` (4-bit Microscaled Floating Point)**
   * **Context:** Introduced with NVIDIA's Blackwell architecture, this format represents the bleeding edge of sub-byte quantization. It utilizes a two-level microscaling block approach (sharing an FP8 scale across 16 values) to drastically reduce quantization error compared to pure integer 4-bit.
   * **Issue:** Allowing hardware-accelerated 4-bit floating-point inference in the browser would be a massive leap for client-side LLM execution, cutting memory usage by ~3.5x compared to FP16 while maintaining near-baseline accuracy.

**Proposed Solution**
Update the `MLOperandDataType` enum and underlying buffer compatibility tables to include:
* `bfloat16`
* `fp8` (potentially differentiating between E4M3 and E5M2 variants)
* `nvfp4` (or a generic hardware-agnostic equivalent for 4-bit block-scaled floating point)

**Use Cases & Benefits**
* **Quicker Model Porting:** Developers can load original weights directly into the graph without complex offline upcasting/re-quantization pipelines.
* **Higher Quantization Accuracy:** Floating-point quantization (`fp8`, `nvfp4`) retains the dynamic range of activations much better than standard integer quantization, reducing perplexity degradation.
* **Reduced Memory Pressure:** Keeping tensors in their ultra-low precision formats slashes both memory footprint and memory bandwidth bottlenecks during inference on edge devices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: Add support for more floating point low-precision ML data types (`bfloat16`, `fp8`, `nvfp4`) #930

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFE: Add support for more floating point low-precision ML data types (bfloat16, fp8, nvfp4) #930

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RFE: Add support for more floating point low-precision ML data types (`bfloat16`, `fp8`, `nvfp4`) #930