HFU | Yunsheng Ni

Beyond Theoretical FLOPs: Analyzing MFU, HFU, and Attention Overhead in Transformers

A rigorous breakdown of FLOPs in Llama-style architectures, deriving the relationship between linear projections and quadratic attention overhead, with insights into sample packing efficiency.