Progressive CUDA GEMM Optimization: From Memory-Bound to Swizzling
Loss Reduction in Distributed Training
Computing Global Gradient Norm in Distributed Training: TP, DP_Shard, DP_Replicate, EP, and PP
Demystifying FlashAttention: Forward, Backward, and Triton Implementation
The Devil in the Details: Engineering Tricks for SOTA Video Models
Deep Dive into Triton GEMM Optimization: From Naive Tiling to Hopper TMA
Roofline Analysis of LLMs on H200: Performance Modeling and Recomputation Strategies
From DDPM to Flow Matching: The Evolution of Generative Trajectories
From DiT to Hunyuan: The Evolution of adaLN-Zero in Generative Models
Beyond Theoretical FLOPs: Analyzing MFU, HFU, and Attention Overhead in Transformers
Visualizing 3D Attention: Bridging the Gap Between 1D Sequences and 3D Space