Understanding MXFP4 Quantization

Visualizer for MXFP4 quantization

Posted Sep 14, 2025 Updated Sep 14, 2025

By Kapil Sharma

5 min read

As LLMs continue to grow in size, quantized versions of these models provide a good way to locally experiment and post-train them. MXFP4 (Microscaling 4-bit Floating Point) quantization was recently used in open source releases of gpt-oss. In this post, we’ll dive deep into how MXFP4 works and explore its compression benefits through practical examples.

What is MXFP4 Quantization?

MXFP4 is a quantization format that compresses neural network parameters using three key principles:

4 bits per element: Each quantized value uses only 4 bits
Shared scaling factors: Elements within a block share a common scale factor
Block-wise quantization: Typically groups 32-128 elements per block

This approach balances compression efficiency with precision by recognizing that nearby parameters in neural networks often have similar magnitudes.

MXFP4 Format Structure

Each MXFP4 value uses 4 bits arranged as:

1 bit: Sign
2 bits: Exponent
1 bit: Mantissa

Here’s the complete mapping of 4-bit codes to values:

Binary	Sign	Exp	Mantissa	Value
0000	0	00	0	0.0
0001	0	00	1	0.5
0010	0	01	0	1.0
0011	0	01	1	1.5
0100	0	10	0	2.0
0101	0	10	1	3.0
0110	0	11	0	4.0
0111	0	11	1	6.0
1000	1	00	0	-0.0
1001	1	00	1	-0.5
1010	1	01	0	-1.0
1011	1	01	1	-1.5
1100	1	10	0	-2.0
1101	1	10	1	-3.0
1110	1	11	0	-4.0
1111	1	11	1	-6.0

The MXFP4 format can represent values in the range [-6, 6] before applying the shared scale factor.

Step-by-Step Example: Quantizing a BF16 Tensor

Let’s walk through quantizing a simple 4-element bf16 tensor:

  
# Original bf16 values
original = [2.5, -1.25, 0.75, 4.0]

Step 1: Calculate the Shared Scale Factor

  
# Find the maximum absolute value
max_abs = max(abs(x) for x in original)  # max_abs = 4.0

# Calculate scale factor to fit in MXFP4 range [-6, 6]
scale_factor = max_abs / 6.0  # scale_factor = 4.0/6.0 ≈ 0.6667

# For efficiency, use power-of-2 scaling
import math
scale_exp = math.ceil(math.log2(max_abs / 6.0))  # scale_exp = 0
scale_factor = 2 ** scale_exp  # scale_factor = 1.0

Step 2: Scale the Values

  
scaled_values = [x / scale_factor for x in original]
# scaled_values = [2.5, -1.25, 0.75, 4.0]

Step 3: Quantize to MXFP4

Find the nearest MXFP4 representation for each scaled value:

  
# Quantization mapping:
# 2.5  → 2.0  (code: 0b0100)
# -1.25 → -1.0 (code: 0b1010)  
# 0.75 → 1.0  (code: 0b0010)
# 4.0  → 4.0  (code: 0b0110)

mxfp4_values = [2.0, -1.0, 1.0, 4.0]
mxfp4_codes = [0b0100, 0b1010, 0b0010, 0b0110]

Step 4: Storage Format

  
# Metadata
scale_factor = 1.0
scale_exponent = 0  # Since scale_factor = 2^0

# Quantized 4-bit values (packed into 2 bytes)
packed_data = 0b0100101000100110  # 01001010 00100110

Step 5: Dequantization

  
# Unpack 4-bit codes
codes = [0b0100, 0b1010, 0b0010, 0b0110]

# Map codes to MXFP4 values  
mxfp4_values = [2.0, -1.0, 1.0, 4.0]

# Apply scale factor
dequantized = [val * scale_factor for val in mxfp4_values]
# dequantized = [2.0, -1.0, 1.0, 4.0]

Results Comparison

Original:     [2.5, -1.25, 0.75, 4.0]
Dequantized:  [2.0, -1.0,  1.0,  4.0] 
Error:        [0.5, 0.25, -0.25, 0.0]

Storage Efficiency Analysis

Small Tensor (4 elements)

Original bf16: 4 × 16 bits = 64 bits
MXFP4: 4 × 4 bits + 16 bits (scale) = 32 bits
Compression ratio: 2:1

Large Tensor (1 Million elements)

The compression ratio improves dramatically with larger tensors because the scale factor overhead becomes negligible. Using 32-element blocks:

Total elements: 1,000,000
Number of blocks: 1,000,000 ÷ 32 = 31,250 blocks

Original bf16: 1,000,000 × 16 bits = 16,000,000 bits

MXFP4:

Quantized values: 1,000,000 × 4 bits = 4,000,000 bits
Scale factors: 31,250 × 16 bits = 500,000 bits
Total: 4,500,000 bits

Compression ratio: 16,000,000 ÷ 4,500,000 ≈ 3.56:1

Block Size Impact

Block Size	Num Blocks	Scale Overhead	Total MXFP4 Bits	Compression Ratio
16	62,500	1,000,000	5,000,000	3.20:1
32	31,250	500,000	4,500,000	3.56:1
64	15,625	250,000	4,250,000	3.76:1
128	7,813	125,008	4,125,008	3.88:1
256	3,907	62,512	4,062,512	3.94:1

Key Insights

Theoretical Maximum

As block size approaches infinity, the compression ratio approaches:

Limit = original_bits_per_element / quantized_bits_per_element = 16 / 4 = 4:1

Block Size Trade-off

Larger blocks: Better compression ratio
Smaller blocks: Better quantization accuracy (values in each block are more similar)

Real-World Impact

For a 1M element tensor:

Original bf16: 2.0 MB
MXFP4 (32-elem blocks): 0.56 MB
Memory saved: 1.44 MB (72% reduction)

MXFP4 Analysis Charts

Explore how MXFP4 quantization behaves across different scenarios:

Continuous to Discrete Mapping

Shows how continuous input values (blue line) map to discrete MXFP4 values (red line). The step function illustrates the quantization process.

Quantization Error

Shows the error (difference) between original and quantized values. Notice larger errors in sparse regions of the MXFP4 value space.

Value Distribution (Simulated Neural Network Weights)

Shows how normally distributed values (typical of neural network weights) map to the 16 discrete MXFP4 values. Values cluster around zero due to the normal distribution.

Key Observations:

Values near zero (0.0, ±0.5, ±1.0) have highest probability
Extreme values (±4.0, ±6.0) are rarely used
The distribution reflects typical neural network weight patterns
Non-uniform spacing in MXFP4 values affects representation efficiency

Complete MXFP4 Value Table

Interactive Visualizer

Try the interactive visualizer below to understand how MXFP4 quantization works:

🔬 MXFP4 Quantization Visualizer

Input BF16 Values

MXFP4 Output

0100

1010

0010

0110

Scale: 1.0 | Compression: 1.33:1

🚀 Open Full Visualizer

Try different values in the visualizer above to see how the quantization adapts!

Tool

This post is licensed under CC BY 4.0 by the author.