Loss Reduction in Distributed Training

An analysis of how Data Parallelism (DP) and Context Parallelism (CP) affect loss reduction, and how to maintain mathematical equivalence between distributed and single-device training for LLM and Video DiT models.

Computing Global Gradient Norm in Distributed Training: TP, DP_Shard, DP_Replicate, EP, and PP

A mathematical and engineering guide to calculating the exact global gradient norm across complex hybrid parallel training topologies. This post details the required hierarchical synchronization sequence across TP, DP, EP, and PP process groups to compute the norm without materializing full tensors, preventing double-counting and OOM errors.