Problem Statement
Currently, the WebNN API supports standard data types like float32, float16, int8, and uint8. However, the machine learning hardware ecosystem has rapidly migrated toward specialized, low-precision floating-point formats to handle the extreme memory bandwidth and compute demands of modern Large Language Models (LLMs) and diffusion models.
Developers bringing state-of-the-art models to the web are currently forced into a strict dilemma:
- Upcast to keep precision: Converting native
bfloat16 or fp8 models to float32 respective float16 preserves the model's accuracy and dynamic range, but immediately bloats the memory footprint and exacerbates memory bandwidth bottlenecks, which can be fatal for client-side inference.
- Quantize to integer to save memory: Converting to
int8 or int4 achieves the desired performance and footprint, but sacrifices precision. Integers space values uniformly, which crushes the activation outliers prevalent in LLMs, leading to a degradation in reasoning quality and perplexity.
Without native WebNN support for modern low-precision floating-point formats, the API forces developers to choose between performance and accuracy, losing out on the hardware-accelerated "best of both worlds" that these new formats were designed to provide.
The Case for Each Format
-
bfloat16 (Brain Floating Point)
- Context: Widely used as the default format for training modern AI models because its 8-bit exponent matches
float32, preventing gradient underflow/overflow.
- Issue: Porting a natively
bfloat16-trained model to WebNN currently requires an offline conversion to float32 (doubling the model size) or float16 (which requires careful scaling to avoid out-of-bounds errors).
-
fp8 (OCP 8-bit Floating Point - E4M3 & E5M2)
- Context: Now standardized by the Open Compute Project and natively supported by current-gen hardware (NVIDIA Hopper/Blackwell, Ryzen AI, RDNA4, Intel XE2 (Lunar Lake, ARC B-series)).
- Issue:
fp8 acts as the perfect middle ground between the two extremes mentioned above. It provides the memory footprint of int8 while maintaining the dynamic range of a floating-point exponent, allowing it to handle activation outliers without precision loss.
-
nvfp4 (4-bit Microscaled Floating Point)
- Context: Introduced with NVIDIA's Blackwell architecture, this format represents the bleeding edge of sub-byte quantization. It utilizes a two-level microscaling block approach (sharing an FP8 scale across 16 values) to drastically reduce quantization error compared to pure integer 4-bit.
- Issue: Allowing hardware-accelerated 4-bit floating-point inference in the browser would be a massive leap for client-side LLM execution, cutting memory usage by ~3.5x compared to FP16 while maintaining near-baseline accuracy.
Proposed Solution
Update the MLOperandDataType enum and underlying buffer compatibility tables to include:
bfloat16
fp8 (potentially differentiating between E4M3 and E5M2 variants)
nvfp4 (or a generic hardware-agnostic equivalent for 4-bit block-scaled floating point)
Use Cases & Benefits
- Quicker Model Porting: Developers can load original weights directly into the graph without complex offline upcasting/re-quantization pipelines.
- Higher Quantization Accuracy: Floating-point quantization (
fp8, nvfp4) retains the dynamic range of activations much better than standard integer quantization, reducing perplexity degradation.
- Reduced Memory Pressure: Keeping tensors in their ultra-low precision formats slashes both memory footprint and memory bandwidth bottlenecks during inference on edge devices.
Problem Statement
Currently, the WebNN API supports standard data types like
float32,float16,int8, anduint8. However, the machine learning hardware ecosystem has rapidly migrated toward specialized, low-precision floating-point formats to handle the extreme memory bandwidth and compute demands of modern Large Language Models (LLMs) and diffusion models.Developers bringing state-of-the-art models to the web are currently forced into a strict dilemma:
bfloat16orfp8models tofloat32respectivefloat16preserves the model's accuracy and dynamic range, but immediately bloats the memory footprint and exacerbates memory bandwidth bottlenecks, which can be fatal for client-side inference.int8orint4achieves the desired performance and footprint, but sacrifices precision. Integers space values uniformly, which crushes the activation outliers prevalent in LLMs, leading to a degradation in reasoning quality and perplexity.Without native WebNN support for modern low-precision floating-point formats, the API forces developers to choose between performance and accuracy, losing out on the hardware-accelerated "best of both worlds" that these new formats were designed to provide.
The Case for Each Format
bfloat16(Brain Floating Point)float32, preventing gradient underflow/overflow.bfloat16-trained model to WebNN currently requires an offline conversion tofloat32(doubling the model size) orfloat16(which requires careful scaling to avoid out-of-bounds errors).fp8(OCP 8-bit Floating Point - E4M3 & E5M2)fp8acts as the perfect middle ground between the two extremes mentioned above. It provides the memory footprint ofint8while maintaining the dynamic range of a floating-point exponent, allowing it to handle activation outliers without precision loss.nvfp4(4-bit Microscaled Floating Point)Proposed Solution
Update the
MLOperandDataTypeenum and underlying buffer compatibility tables to include:bfloat16fp8(potentially differentiating between E4M3 and E5M2 variants)nvfp4(or a generic hardware-agnostic equivalent for 4-bit block-scaled floating point)Use Cases & Benefits
fp8,nvfp4) retains the dynamic range of activations much better than standard integer quantization, reducing perplexity degradation.