Computing Global Gradient Norm in Distributed Training: TP, DP_Shard, DP_Replicate, EP, and PP

A mathematical and engineering guide to calculating the exact global gradient norm across complex hybrid parallel training topologies. This post details the required hierarchical synchronization sequence across TP, DP, EP, and PP process groups to compute the norm without materializing full tensors, preventing double-counting and OOM errors.