Progressive CUDA GEMM Optimization: From Memory-Bound to Swizzling
A step-by-step guide to optimizing FP32 CUDA GEMM kernels. Learn how to overcome warp-level memory coalescing bottlenecks, eliminate 32-way shared memory bank conflicts using memory padding, and implement zero-waste XOR address swizzling.