Learn/Core Concept How does Flash Attention optimise memory usage? Flash Attention reduces memory usage during transformer training by recomputing attention weights on-the-fly instead of storing them. It processes attention in smaller blocks, keeping intermediate values in fast SRAM rather than slower GPU memory, achieving significant speedups. This matters because attention memory scales quadratically with sequence length, making long sequences prohibitively expensive. As one dev discovered when designing a 2-billion parameter model, head dimensions that don't align with Flash Attention's requirements can cost 20% performance. Understanding these constraints helps us design architectures that actually run efficiently. AttentionMemoryTransformers |