Loss Reduction in Distributed Training

An analysis of how Data Parallelism (DP) and Context Parallelism (CP) affect loss reduction, and how to maintain mathematical equivalence between distributed and single-device training for LLM and Video DiT models.