2026  12

March  4

Progressive CUDA GEMM Optimization: From Memory-Bound to Swizzling

Loss Reduction in Distributed Training

Computing Global Gradient Norm in Distributed Training: TP, DP_Shard, DP_Replicate, EP, and PP

Demystifying FlashAttention: Forward, Backward, and Triton Implementation

February  5

The Devil in the Details: Engineering Tricks for SOTA Video Models

Deep Dive into Triton GEMM Optimization: From Naive Tiling to Hopper TMA

Roofline Analysis of LLMs on H200: Performance Modeling and Recomputation Strategies

From DDPM to Flow Matching: The Evolution of Generative Trajectories

From DiT to Hunyuan: The Evolution of adaLN-Zero in Generative Models

January  3

Beyond Theoretical FLOPs: Analyzing MFU, HFU, and Attention Overhead in Transformers

Visualizing 3D Attention: Bridging the Gap Between 1D Sequences and 3D Space

GPU & Network Constants