Understanding MXFP4 Quantization
Visualizer for MXFP4 quantization
As LLMs continue to grow in size, quantized versions of these models provide a good way to locally experiment and post-train them. MXFP4 (Microscaling 4-bit Floating Point) quantization was recently used in open source releases of gpt-oss. In this post, we’ll dive deep into how MXFP4 works and explore its compression benefits through practical examples.
What is MXFP4 Quantization?
MXFP4 is a quantization format that compresses neural network parameters using three key principles:
- 4 bits per element: Each quantized value uses only 4 bits
- Shared scaling factors: Elements within a block share a common scale factor
- Block-wise quantization: Typically groups 32-128 elements per block
This approach balances compression efficiency with precision by recognizing that nearby parameters in neural networks often have similar magnitudes.
MXFP4 Format Structure
Each MXFP4 value uses 4 bits arranged as:
- 1 bit: Sign
- 2 bits: Exponent
- 1 bit: Mantissa
Here’s the complete mapping of 4-bit codes to values:
Binary | Sign | Exp | Mantissa | Value |
---|---|---|---|---|
0000 | 0 | 00 | 0 | 0.0 |
0001 | 0 | 00 | 1 | 0.5 |
0010 | 0 | 01 | 0 | 1.0 |
0011 | 0 | 01 | 1 | 1.5 |
0100 | 0 | 10 | 0 | 2.0 |
0101 | 0 | 10 | 1 | 3.0 |
0110 | 0 | 11 | 0 | 4.0 |
0111 | 0 | 11 | 1 | 6.0 |
1000 | 1 | 00 | 0 | -0.0 |
1001 | 1 | 00 | 1 | -0.5 |
1010 | 1 | 01 | 0 | -1.0 |
1011 | 1 | 01 | 1 | -1.5 |
1100 | 1 | 10 | 0 | -2.0 |
1101 | 1 | 10 | 1 | -3.0 |
1110 | 1 | 11 | 0 | -4.0 |
1111 | 1 | 11 | 1 | -6.0 |
The MXFP4 format can represent values in the range [-6, 6] before applying the shared scale factor.
Step-by-Step Example: Quantizing a BF16 Tensor
Let’s walk through quantizing a simple 4-element bf16 tensor:
1
2
# Original bf16 values
original = [2.5, -1.25, 0.75, 4.0]
Step 1: Calculate the Shared Scale Factor
1
2
3
4
5
6
7
8
9
10
# Find the maximum absolute value
max_abs = max(abs(x) for x in original) # max_abs = 4.0
# Calculate scale factor to fit in MXFP4 range [-6, 6]
scale_factor = max_abs / 6.0 # scale_factor = 4.0/6.0 ≈ 0.6667
# For efficiency, use power-of-2 scaling
import math
scale_exp = math.ceil(math.log2(max_abs / 6.0)) # scale_exp = 0
scale_factor = 2 ** scale_exp # scale_factor = 1.0
Step 2: Scale the Values
1
2
scaled_values = [x / scale_factor for x in original]
# scaled_values = [2.5, -1.25, 0.75, 4.0]
Step 3: Quantize to MXFP4
Find the nearest MXFP4 representation for each scaled value:
1
2
3
4
5
6
7
8
# Quantization mapping:
# 2.5 → 2.0 (code: 0b0100)
# -1.25 → -1.0 (code: 0b1010)
# 0.75 → 1.0 (code: 0b0010)
# 4.0 → 4.0 (code: 0b0110)
mxfp4_values = [2.0, -1.0, 1.0, 4.0]
mxfp4_codes = [0b0100, 0b1010, 0b0010, 0b0110]
Step 4: Storage Format
1
2
3
4
5
6
# Metadata
scale_factor = 1.0
scale_exponent = 0 # Since scale_factor = 2^0
# Quantized 4-bit values (packed into 2 bytes)
packed_data = 0b0100101000100110 # 01001010 00100110
Step 5: Dequantization
1
2
3
4
5
6
7
8
9
# Unpack 4-bit codes
codes = [0b0100, 0b1010, 0b0010, 0b0110]
# Map codes to MXFP4 values
mxfp4_values = [2.0, -1.0, 1.0, 4.0]
# Apply scale factor
dequantized = [val * scale_factor for val in mxfp4_values]
# dequantized = [2.0, -1.0, 1.0, 4.0]
Results Comparison
1
2
3
Original: [2.5, -1.25, 0.75, 4.0]
Dequantized: [2.0, -1.0, 1.0, 4.0]
Error: [0.5, 0.25, -0.25, 0.0]
Storage Efficiency Analysis
Small Tensor (4 elements)
- Original bf16: 4 × 16 bits = 64 bits
- MXFP4: 4 × 4 bits + 16 bits (scale) = 32 bits
- Compression ratio: 2:1
Large Tensor (1 Million elements)
The compression ratio improves dramatically with larger tensors because the scale factor overhead becomes negligible. Using 32-element blocks:
- Total elements: 1,000,000
- Number of blocks: 1,000,000 ÷ 32 = 31,250 blocks
Original bf16: 1,000,000 × 16 bits = 16,000,000 bits
MXFP4:
- Quantized values: 1,000,000 × 4 bits = 4,000,000 bits
- Scale factors: 31,250 × 16 bits = 500,000 bits
- Total: 4,500,000 bits
Compression ratio: 16,000,000 ÷ 4,500,000 ≈ 3.56:1
Block Size Impact
Block Size | Num Blocks | Scale Overhead | Total MXFP4 Bits | Compression Ratio |
---|---|---|---|---|
16 | 62,500 | 1,000,000 | 5,000,000 | 3.20:1 |
32 | 31,250 | 500,000 | 4,500,000 | 3.56:1 |
64 | 15,625 | 250,000 | 4,250,000 | 3.76:1 |
128 | 7,813 | 125,008 | 4,125,008 | 3.88:1 |
256 | 3,907 | 62,512 | 4,062,512 | 3.94:1 |
Key Insights
Theoretical Maximum
As block size approaches infinity, the compression ratio approaches:
1
Limit = original_bits_per_element / quantized_bits_per_element = 16 / 4 = 4:1
Block Size Trade-off
- Larger blocks: Better compression ratio
- Smaller blocks: Better quantization accuracy (values in each block are more similar)
Real-World Impact
For a 1M element tensor:
- Original bf16: 2.0 MB
- MXFP4 (32-elem blocks): 0.56 MB
- Memory saved: 1.44 MB (72% reduction)
MXFP4 Analysis Charts
Explore how MXFP4 quantization behaves across different scenarios:
Continuous to Discrete Mapping
Shows how continuous input values (blue line) map to discrete MXFP4 values (red line). The step function illustrates the quantization process.
Quantization Error
Shows the error (difference) between original and quantized values. Notice larger errors in sparse regions of the MXFP4 value space.
Interactive Visualizer
Try the interactive visualizer below to understand how MXFP4 quantization works:
🔬 MXFP4 Quantization Visualizer
Input BF16 Values
MXFP4 Output
Try different values in the visualizer above to see how the quantization adapts!