Post

Understanding MXFP4 Quantization

Visualizer for MXFP4 quantization

Understanding MXFP4 Quantization

As LLMs continue to grow in size, quantized versions of these models provide a good way to locally experiment and post-train them. MXFP4 (Microscaling 4-bit Floating Point) quantization was recently used in open source releases of gpt-oss. In this post, we’ll dive deep into how MXFP4 works and explore its compression benefits through practical examples.

What is MXFP4 Quantization?

MXFP4 is a quantization format that compresses neural network parameters using three key principles:

  • 4 bits per element: Each quantized value uses only 4 bits
  • Shared scaling factors: Elements within a block share a common scale factor
  • Block-wise quantization: Typically groups 32-128 elements per block

This approach balances compression efficiency with precision by recognizing that nearby parameters in neural networks often have similar magnitudes.

MXFP4 Format Structure

Each MXFP4 value uses 4 bits arranged as:

  • 1 bit: Sign
  • 2 bits: Exponent
  • 1 bit: Mantissa

Here’s the complete mapping of 4-bit codes to values:

BinarySignExpMantissaValue
000000000.0
000100010.5
001000101.0
001100111.5
010001002.0
010101013.0
011001104.0
011101116.0
10001000-0.0
10011001-0.5
10101010-1.0
10111011-1.5
11001100-2.0
11011101-3.0
11101110-4.0
11111111-6.0

The MXFP4 format can represent values in the range [-6, 6] before applying the shared scale factor.

Step-by-Step Example: Quantizing a BF16 Tensor

Let’s walk through quantizing a simple 4-element bf16 tensor:

1
2
# Original bf16 values
original = [2.5, -1.25, 0.75, 4.0]

Step 1: Calculate the Shared Scale Factor

1
2
3
4
5
6
7
8
9
10
# Find the maximum absolute value
max_abs = max(abs(x) for x in original)  # max_abs = 4.0

# Calculate scale factor to fit in MXFP4 range [-6, 6]
scale_factor = max_abs / 6.0  # scale_factor = 4.0/6.0 ≈ 0.6667

# For efficiency, use power-of-2 scaling
import math
scale_exp = math.ceil(math.log2(max_abs / 6.0))  # scale_exp = 0
scale_factor = 2 ** scale_exp  # scale_factor = 1.0

Step 2: Scale the Values

1
2
scaled_values = [x / scale_factor for x in original]
# scaled_values = [2.5, -1.25, 0.75, 4.0]

Step 3: Quantize to MXFP4

Find the nearest MXFP4 representation for each scaled value:

1
2
3
4
5
6
7
8
# Quantization mapping:
# 2.5  → 2.0  (code: 0b0100)
# -1.25 → -1.0 (code: 0b1010)  
# 0.75 → 1.0  (code: 0b0010)
# 4.0  → 4.0  (code: 0b0110)

mxfp4_values = [2.0, -1.0, 1.0, 4.0]
mxfp4_codes = [0b0100, 0b1010, 0b0010, 0b0110]

Step 4: Storage Format

1
2
3
4
5
6
# Metadata
scale_factor = 1.0
scale_exponent = 0  # Since scale_factor = 2^0

# Quantized 4-bit values (packed into 2 bytes)
packed_data = 0b0100101000100110  # 01001010 00100110

Step 5: Dequantization

1
2
3
4
5
6
7
8
9
# Unpack 4-bit codes
codes = [0b0100, 0b1010, 0b0010, 0b0110]

# Map codes to MXFP4 values  
mxfp4_values = [2.0, -1.0, 1.0, 4.0]

# Apply scale factor
dequantized = [val * scale_factor for val in mxfp4_values]
# dequantized = [2.0, -1.0, 1.0, 4.0]

Results Comparison

1
2
3
Original:     [2.5, -1.25, 0.75, 4.0]
Dequantized:  [2.0, -1.0,  1.0,  4.0] 
Error:        [0.5, 0.25, -0.25, 0.0]

Storage Efficiency Analysis

Small Tensor (4 elements)

  • Original bf16: 4 × 16 bits = 64 bits
  • MXFP4: 4 × 4 bits + 16 bits (scale) = 32 bits
  • Compression ratio: 2:1

Large Tensor (1 Million elements)

The compression ratio improves dramatically with larger tensors because the scale factor overhead becomes negligible. Using 32-element blocks:

  • Total elements: 1,000,000
  • Number of blocks: 1,000,000 ÷ 32 = 31,250 blocks

Original bf16: 1,000,000 × 16 bits = 16,000,000 bits

MXFP4:

  • Quantized values: 1,000,000 × 4 bits = 4,000,000 bits
  • Scale factors: 31,250 × 16 bits = 500,000 bits
  • Total: 4,500,000 bits

Compression ratio: 16,000,000 ÷ 4,500,000 ≈ 3.56:1

Block Size Impact

Block SizeNum BlocksScale OverheadTotal MXFP4 BitsCompression Ratio
1662,5001,000,0005,000,0003.20:1
3231,250500,0004,500,0003.56:1
6415,625250,0004,250,0003.76:1
1287,813125,0084,125,0083.88:1
2563,90762,5124,062,5123.94:1

Key Insights

Theoretical Maximum

As block size approaches infinity, the compression ratio approaches:

1
Limit = original_bits_per_element / quantized_bits_per_element = 16 / 4 = 4:1

Block Size Trade-off

  • Larger blocks: Better compression ratio
  • Smaller blocks: Better quantization accuracy (values in each block are more similar)

Real-World Impact

For a 1M element tensor:

  • Original bf16: 2.0 MB
  • MXFP4 (32-elem blocks): 0.56 MB
  • Memory saved: 1.44 MB (72% reduction)

MXFP4 Analysis Charts

Explore how MXFP4 quantization behaves across different scenarios:

Continuous to Discrete Mapping

Shows how continuous input values (blue line) map to discrete MXFP4 values (red line). The step function illustrates the quantization process.

Quantization Error

Shows the error (difference) between original and quantized values. Notice larger errors in sparse regions of the MXFP4 value space.

Interactive Visualizer

Try the interactive visualizer below to understand how MXFP4 quantization works:

🔬 MXFP4 Quantization Visualizer

Input BF16 Values

MXFP4 Output

0100
1010
0010
0110
Scale: 1.0 | Compression: 1.33:1

Try different values in the visualizer above to see how the quantization adapts!

This post is licensed under CC BY 4.0 by the author.