Hi, I’m Yunsheng Ni đź‘‹

Engineering notes on System Optimization & AI Algorithms.

Progressive CUDA GEMM Optimization: From Memory-Bound to Swizzling

A step-by-step guide to optimizing FP32 CUDA GEMM kernels. Learn how to overcome warp-level memory coalescing bottlenecks, eliminate 32-way shared memory bank conflicts using memory padding, and implement zero-waste XOR address swizzling.

Loss Reduction in Distributed Training

An analysis of how Data Parallelism (DP) and Context Parallelism (CP) affect loss reduction, and how to maintain mathematical equivalence between distributed and single-device training for LLM and Video DiT models.

Computing Global Gradient Norm in Distributed Training: TP, DP_Shard, DP_Replicate, EP, and PP

A mathematical and engineering guide to calculating the exact global gradient norm across complex hybrid parallel training topologies. This post details the required hierarchical synchronization sequence across TP, DP, EP, and PP process groups to compute the norm without materializing full tensors, preventing double-counting and OOM errors.

Demystifying FlashAttention: Forward, Backward, and Triton Implementation

A breakdown of FlashAttention’s forward and backward passes, including Online Softmax, LogSumExp materialization, gradient recomputation, and core Triton implementations.

The Devil in the Details: Engineering Tricks for SOTA Video Models

Theory is clean, but training is messy. This note covers 5 essential engineering tricks—from Timestep Shifting to 3D RoPE—that stabilize training and boost performance.

Deep Dive into Triton GEMM Optimization: From Naive Tiling to Hopper TMA

A step-by-step guide to optimizing GEMM in Triton, covering Tiling, Autotuning, L2 Cache Optimizations, and Hopper TMA.

Roofline Analysis of LLMs on H200: Performance Modeling and Recomputation Strategies

A quantitative Roofline analysis of LLMs on NVIDIA H200. We derive compute-bound thresholds, analyze the 1:10 communication bottleneck, and propose optimal strategies for activation recomputation and operator fusion to maximize hardware efficiency.

From DDPM to Flow Matching: The Evolution of Generative Trajectories

A technical note on the shift from noise prediction (DDPM) to velocity prediction (Flow Matching), and how CFG acts as a vector field modifier.

From DiT to Hunyuan: The Evolution of adaLN-Zero in Generative Models

From DiT to Hunyuan Video, adaLN-Zero remains the gold standard for conditioning. Here’s how this zero-initialized module works and why it persists in the era of Flow Matching.

Beyond Theoretical FLOPs: Analyzing MFU, HFU, and Attention Overhead in Transformers

A rigorous breakdown of FLOPs in Llama-style architectures, deriving the relationship between linear projections and quadratic attention overhead, with insights into sample packing efficiency.