Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head at...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bokyeong Yoon, Ah-Hyun Lee, Jinsung Kim, Gordon Euhyun Moon
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Sparse Transformer sparse attention MHA optimization
Online Access:	https://ieeexplore.ieee.org/document/10589623/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to <inline-formula> <tex-math notation="LaTeX">$2.84\times $ </tex-math></inline-formula> training speedup, <inline-formula> <tex-math notation="LaTeX">$6.87\times $ </tex-math></inline-formula> memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.
ISSN:	2169-3536

Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

Similar Items