Learn CUTLASS the hard way!

Walkthrough of optimization techniques for GEMMs from a naive fp32 kernel to CUTLASS bf16 kernel

Triton Kernels - Fused Softmax - 2

Worklog: Performance debugging Triton Kernel

Fused Softmax Triton kernel exploration

RMS Normalization Triton kernel implementation for LLMs

Visualizer for MXFP4 quantization

KV Caching: Training vs Inference in Multi-Head Attention

Load OpenAI's reference implementation

Testing out the huggingface version for the gpt-oss-20b locally on consumer hardware

Regularization to improve duplication penalty loss

How Triton Compiler Works Under the Hood!