Progressive CUDA GEMM Optimization: From Memory-Bound to Swizzling

A step-by-step guide to optimizing FP32 CUDA GEMM kernels. Learn how to overcome warp-level memory coalescing bottlenecks, eliminate 32-way shared memory bank conflicts using memory padding, and implement zero-waste XOR address swizzling.

Deep Dive into Triton GEMM Optimization: From Naive Tiling to Hopper TMA

A step-by-step guide to optimizing GEMM in Triton, covering Tiling, Autotuning, L2 Cache Optimizations, and Hopper TMA.

Beyond Theoretical FLOPs: Analyzing MFU, HFU, and Attention Overhead in Transformers

A rigorous breakdown of FLOPs in Llama-style architectures, deriving the relationship between linear projections and quadratic attention overhead, with insights into sample packing efficiency.