Roofline Analysis of LLMs on H200: Performance Modeling and Recomputation Strategies

A quantitative Roofline analysis of LLMs on NVIDIA H200. We derive compute-bound thresholds, analyze the 1:10 communication bottleneck, and propose optimal strategies for activation recomputation and operator fusion to maximize hardware efficiency.

Beyond Theoretical FLOPs: Analyzing MFU, HFU, and Attention Overhead in Transformers

A rigorous breakdown of FLOPs in Llama-style architectures, deriving the relationship between linear projections and quadratic attention overhead, with insights into sample packing efficiency.