Attention (machine learning)

In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial recurrent neural network (RNN) language translation system, but a more recent design, namely the transformer, removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.

Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of using information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state.

History

1950s–1960s	Psychology and biology of attention. Cocktail party effect^[1] — focusing on content by filtering out background noise. Filter model of attention,^[2] partial report paradigm, and saccade control.^[3]
1980s	Sigma-pi units,^[4] higher-order neural networks.^[5]
1990s	Fast weight controllers and dynamic links between neurons, anticipating key-value mechanisms in attention.^[6]^[7]^[8]^[9]
1998	Early computational tools resembling attention were introduced: the bilateral filter in image processing^[10] and PageRank in graph theory^[11]. Both used pairwise affinity matrices to propagate relevance across elements, anticipating the structure of attention mechanisms.
2005	Non-local means extended affinity-based filtering in image denoising, using Gaussian similarity kernels as fixed attention-like weights.^[12]
2014	seq2seq with RNN + Attention.^[13] Attention was introduced to enhance RNN encoder-decoder translation, particularly for long sentences. See Overview section. Attentional Neural Networks introduced a learned feature selection mechanism using top-down cognitive modulation, showing how attention weights can highlight relevant inputs.^[14]
2015	Attention was extended to vision for image captioning tasks.^[15]^[16] Meanwhile, Infinite Feature Selection (Inf-FS) introduced a mathematical framework for ranking feature importance via a fully connected pairwise affinity matrix $A$ . This matrix models the relationships between input features and is central to the Inf-FS formulation. When the path length $l=1$ , the Inf-FS ranking reduces to a form equivalent to the self-attention matrix (without softmax).^[17]^[18]^[19] Inf-FS thereby anticipated attention’s structural form from a feature selection perspective. Later works such as AFS^[20] (AAAI 2019) and Sequential Attention^[21] (ICLR 2023) confirmed this link by using attention weights as feature selectors.
2016	Self-attention was integrated into RNN-based models to capture intra-sequence dependencies.^[22]^[23] Self-attention was explored in decomposable attention models for natural language inference^[24] and structured self-attentive sentence embeddings^[25].
2017	The Transformer architecture introduced in the research paper Attention is All You Need^[26] formalized scaled dot-product self-attention: $A={\text{softmax}}\left({\frac {QK^{T}}{\sqrt {d_{k}}}}\right)$ Relation Networks^[27] and Set Transformers^[28] applied attention to unordered sets and relational reasoning, generalizing pairwise interaction models.
2018	Non-local neural networks^[29] extended attention to computer vision by capturing long-range dependencies in space and time. Graph Attention Networks^[30] applied attention mechanisms to graph-structured data.
2019–2020	Efficient Transformers, including Reformer^[31], Linformer^[32], and Performer^[33], introduced scalable approximations of attention for long sequences.
2021	Hopfield networks were reinterpreted as associative memory-based attention systems^[34], and Vision Transformers (ViTs) achieved competitive results in image classification^[35].
2022+	Transformers were adopted across scientific domains, including AlphaFold for protein folding^[36], CLIP for vision-language pretraining^[37], and attention-based dense segmentation models like CCNet^[38] and DANet^[39].

Additional surveys of the attention mechanism in deep learning are provided by Niu et al.^[40] and Soydaner.^[41]

The major breakthrough came with self-attention, where each element in the input sequence attends to all others, enabling the model to capture global dependencies. This idea was central to the Transformer architecture, which replaced recurrence with attention mechanisms. As a result, Transformers became the foundation for models like BERT, T5 and generative pre-trained transformers (GPT).^[26]

Overview

The modern era of machine attention was revitalized by grafting an attention mechanism (Fig 1. orange) to an Encoder-Decoder.^{[citation needed]}

Animated sequence of language translation

Fig 1. Encoder-decoder with attention.^[42] Numerical subscripts (100, 300, 500, 9k, 10k) indicate vector sizes while lettered subscripts i and i − 1 indicate time steps. Pinkish regions in H matrix and w vector are zero values. See Legend for details.

Legend
Label	Description
100	Max. sentence length
300	Embedding size (word dimension)
500	Length of hidden vector
9k, 10k	Dictionary size of input & output languages respectively.
x, Y	9k and 10k 1-hot dictionary vectors. x → x implemented as a lookup table rather than vector multiplication. Y is the 1-hot maximizer of the linear Decoder layer D; that is, it takes the argmax of D's linear layer output.
x	300-long word embedding vector. The vectors are usually pre-calculated from other projects such as GloVe or Word2Vec.
h	500-long encoder hidden vector. At each point in time, this vector summarizes all the preceding words before it. The final h can be viewed as a "sentence" vector, or a thought vector as Hinton calls it.
s	500-long decoder hidden state vector.
E	500 neuron recurrent neural network encoder. 500 outputs. Input count is 800–300 from source embedding + 500 from recurrent connections. The encoder feeds directly into the decoder only to initialize it, but not thereafter; hence, that direct connection is shown very faintly.
D	2-layer decoder. The recurrent layer has 500 neurons and the fully-connected linear layer has 10k neurons (the size of the target vocabulary).^[43] The linear layer alone has 5 million (500 × 10k) weights – ~10 times more weights than the recurrent layer.
score	100-long alignment score
w	100-long vector attention weight. These are "soft" weights which changes during the forward pass, in contrast to "hard" neuronal weights that change during the learning phase.
A	Attention module – this can be a dot product of recurrent states, or the query-key-value fully-connected layers. The output is a 100-long vector w.
H	500×100. 100 hidden vectors h concatenated into a matrix
c	500-long context vector = H * w. c is a linear combination of h vectors weighted by w.

Figure 2 shows the internal step-by-step operation of the attention block (A) in Fig 1.

Figure 2. The diagram shows the attention forward pass calculating correlations of the word "that" with other words in "See that girl run." Given the right weights from training, the network should be able to identify "girl" as a highly correlated word. Some things to note:

This example focuses on the attention of a single word "that". In practice, the attention of each word is calculated in parallel to speed up calculations. Simply changing the lowercase "x" vector to the uppercase "X" matrix will yield the formula for this.
Softmax scaling $qW k T$ / √100 prevents a high variance in $qW k T$ that would allow a single word to excessively dominate the softmax resulting in attention to only one word, as a discrete hard max would do.
Notation: the commonly written row-wise $softmax$ formula above assumes that vectors are rows, which runs contrary to the standard math notation of column vectors. More correctly, we should take the transpose of the context vector and use the column-wise $softmax$ , resulting in the more correct form

{\begin{aligned}(XW_{v})^{T}*{[(W_{k}X^{T})*{({\underline {x}}W_{q})^{T}}]_{sm}}\end{aligned}}

.

Interpreting attention weights

In translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced.

Consider an example of translating I love you to French. On the first pass through the decoder, 94% of the attention weight is on the first English word I, so the network offers the word je. On the second pass of the decoder, 88% of the attention weight is on the third English word you, so it offers t'. On the last pass, 95% of the attention weight is on the second English word love, so it offers aime.

In the I love you example, the second word love is aligned with the third word aime. Stacking soft row vectors together for je, t', and aime yields an alignment matrix:

	I	love	you
je	0.94	0.02	0.04
t'	0.11	0.01	0.88
aime	0.03	0.95	0.02

Sometimes, alignment can be multiple-to-multiple. For example, the English phrase look it up corresponds to cherchez-le. Thus, "soft" attention weights work better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector.

Variants

Many variants of attention implement soft weights, such as

fast weight programmers, or fast weight controllers (1992).^[6] A "slow" neural network outputs the "fast" weights of another neural network through outer products. The slow network learns by gradient descent. It was later renamed as "linearized self-attention".^[44]
Bahdanau-style attention,^[13] also referred to as additive attention,
Luong-style attention,^[45] which is known as multiplicative attention,
Early attention mechanisms similar to modern self-attention were proposed using recurrent neural networks. However, the highly parallelizable self-attention was introduced in 2017 and successfully used in the Transformer model,
positional attention and factorized positional attention.^[46]

For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention,^[47] channel attention,^[48] or combinations.^[49]^[50]

These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Overview section above.

1. encoder-decoder dot product	2. encoder-decoder QKV	3. encoder-only dot product	4. encoder-only QKV	5. Pytorch tutorial
Both encoder & decoder are needed to calculate attention.^[45]	Both encoder & decoder are needed to calculate attention.^[51]	Decoder is not used to calculate attention. With only 1 input into corr, W is an auto-correlation of dot products. w_ij = x_i x_j.^[52]	Decoder is not used to calculate attention.^[53]	A fully-connected layer is used to calculate attention instead of dot product correlation.^[54]

Legend
Label	Description
Variables X, H, S, T	Upper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column.
S, T	S, decoder hidden state; T, target word embedding. In the Pytorch Tutorial variant training phase, T alternates between 2 sources depending on the level of teacher forcing used. T could be the embedding of the network's output word; i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2.
X, H	H, encoder hidden state; X, input word embeddings.
W	Attention coefficients
Qw, Kw, Vw, FC	Weight matrices for query, key, value respectively. FC is a fully-connected weight matrix.
⊕, ⊗	⊕, vector concatenation; ⊗, matrix multiplication.
corr	Column-wise softmax(matrix of all combinations of dot products). The dot products are *x_i x_j in variant #3, h_i* s*_j in variant 1, and column _i ( Kw H ) * column _j ( Qw * S ) in variant 2, and column _i ( Kw * X ) * column _j ( Qw * X ) in variant 4. Variant 5 uses a fully-connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the $\sqrt d$ where $d$ is the height of the QKV matrices.

Optimizations

Flash attention

The size of the attention matrix is proportional to the square of the number of input tokens. Therefore, when the input is long, calculating the attention matrix requires a lot of GPU memory. Flash attention is an implementation that reduces the memory needs and increases efficiency without sacrificing accuracy. It achieves this by partitioning the attention computation into smaller blocks that fit into the GPU's faster on-chip memory, reducing the need to store large intermediate matrices and thus lowering memory usage while increasing computational efficiency.^[55]

FlexAttention

FlexAttention^[56] is an attention kernel developed by Meta that allows users to modify attention scores prior to softmax and dynamically chooses the optimal attention algorithm.

Applications

Attention is widely used in natural language processing, computer vision, and speech recognition. In NLP, it improves context understanding in tasks like question answering and summarization. In vision, visual attention helps models focus on relevant image regions, enhancing object detection and image captioning.

Attention maps as explanations for vision transformers

From the original paper on vision transformers (ViT), visualizing attention scores as a heat map (called saliency maps or attention maps) has become an important and routine way to inspect the decision making process of ViT models.^[57] One can compute the attention maps with respect to any attention head at any layer, while the deeper layers tend to show more semantically meaningful visualization. Attention rollout is a recursive algorithm to combine attention scores across all layers, by computing the dot product of successive attention maps.^[58]

Because vision transformers are typically trained in a self-supervised manner, attention maps are generally not class-sensitive. When a classification head attached to the ViT backbone, class-discriminative attention maps (CDAM) combines attention maps and gradients with respect to the class [CLS] token.^[59] Some class-sensitive interpretability methods originally developed for convolutional neural networks can be also applied to ViT, such as GradCAM, which back-propagates the gradients to the outputs of the final attention layer^[60].

Using attention as basis of explanation for the transformers in language and vision is not without debate. While some pioneering papers analyzed and framed attention scores as explanations,^[61]^[62] higher attention scores do not always correlate with greater impact on model performances.^[63]

Mathematical representation

Standard scaled dot-product attention

For matrices: $Q\in \mathbb {R} ^{m\times d_{k}},K\in \mathbb {R} ^{n\times d_{k}}$ and $V\in \mathbb {R} ^{n\times d_{v}}$ , the scaled dot-product, or QKV attention, is defined as: ${\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{T}}{\sqrt {d_{k}}}}\right)V\in \mathbb {R} ^{m\times d_{v}}$ where ${}^{T}$ denotes transpose and the softmax function is applied independently to every row of its argument. The matrix $Q$ contains $m$ queries, while matrices $K,V$ jointly contain an unordered set of $n$ key-value pairs. Value vectors in matrix $V$ are weighted using the weights resulting from the softmax operation, so that the rows of the $m$ -by- $d_{v}$ output matrix are confined to the convex hull of the points in $\mathbb {R} ^{d_{v}}$ given by the rows of $V$ .

To understand the permutation invariance and permutation equivariance properties of QKV attention,^[64] let $A\in \mathbb {R} ^{m\times m}$ and $B\in \mathbb {R} ^{n\times n}$ be permutation matrices; and $D\in \mathbb {R} ^{m\times n}$ an arbitrary matrix. The softmax function is permutation equivariant in the sense that:

{\text{softmax}}(ADB)=A\,{\text{softmax}}(D)B

By noting that the transpose of a permutation matrix is also its inverse, it follows that:

{\text{Attention}}(AQ,BK,BV)=A\,{\text{Attention}}(Q,K,V)

which shows that QKV attention is equivariant with respect to re-ordering the queries (rows of $Q$ ); and invariant to re-ordering of the key-value pairs in $K,V$ . These properties are inherited when applying linear transforms to the inputs and outputs of QKV attention blocks. For example, a simple self-attention function defined as:

X\mapsto {\text{Attention}}(XT_{q},XT_{k},XT_{v})

is permutation equivariant with respect to re-ordering the rows of the input matrix $X$ in a non-trivial way, because every row of the output is a function of all the rows of the input. Similar properties hold for multi-head attention, which is defined below.

Masked attention

When QKV attention is used as a building block for an autoregressive decoder, and when at training time all input and output matrices have $n$ rows, a masked attention variant is used: ${\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{T}}{\sqrt {d_{k}}}}+M\right)V$ where the mask, $M\in \mathbb {R} ^{n\times n}$ is a strictly upper triangular matrix, with zeros on and below the diagonal and $-\infty$ in every element above the diagonal. The softmax output, also in $\mathbb {R} ^{n\times n}$ is then lower triangular, with zeros in all elements above the diagonal. The masking ensures that for all $1\leq i<j\leq n$ , row $i$ of the attention output is independent of row $j$ of any of the three input matrices. The permutation invariance and equivariance properties of standard QKV attention do not hold for the masked variant.

Multi-head attention

Multi-head attention ${\text{MultiHead}}(Q,K,V)={\text{Concat}}({\text{head}}_{1},...,{\text{head}}_{h})W^{O}$ where each head is computed with QKV attention as: ${\text{head}}_{i}={\text{Attention}}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$ and $W_{i}^{Q},W_{i}^{K},W_{i}^{V}$ , and $W^{O}$ are parameter matrices.

The permutation properties of (standard, unmasked) QKV attention apply here also. For permutation matrices, $A,B$ :

{\text{MultiHead}}(AQ,BK,BV)=A\,{\text{MultiHead}}(Q,K,V)

from which we also see that multi-head self-attention:

X\mapsto {\text{MultiHead}}(XT_{q},XT_{k},XT_{v})

is equivariant with respect to re-ordering of the rows of input matrix $X$ .

Bahdanau (additive) attention

${\text{Attention}}(Q,K,V)={\text{softmax}}(\tanh(W_{Q}Q+W_{K}K)V)$ where $W_{Q}$ and $W_{K}$ are learnable weight matrices.^[13]

Luong attention (general)

${\text{Attention}}(Q,K,V)={\text{softmax}}(QWK^{T})V$ where $W$ is a learnable weight matrix.^[45]

Self-attention

Self-attention is essentially the same as cross-attention, except that query, key, and value vectors all come from the same model. Both encoder and decoder can use self-attention, but with subtle differences.

For encoder self-attention, we can start with a simple encoder without self-attention, such as an "embedding layer", which simply converts each input word into a vector by a fixed lookup table. This gives a sequence of hidden vectors $h_{0},h_{1},\dots$ . These can then be applied to a dot-product attention mechanism, to obtain ${\begin{aligned}h_{0}'&=\mathrm {Attention} (h_{0}W^{Q},HW^{K},HW^{V})\\h_{1}'&=\mathrm {Attention} (h_{1}W^{Q},HW^{K},HW^{V})\\&\cdots \end{aligned}}$ or more succinctly, $H'=\mathrm {Attention} (HW^{Q},HW^{K},HW^{V})$ . This can be applied repeatedly, to obtain a multilayered encoder. This is the "encoder self-attention", sometimes called the "all-to-all attention", as the vector at every position can attend to every other.

Masking

For decoder self-attention, all-to-all attention is inappropriate, because during the autoregressive decoding process, the decoder cannot attend to future outputs that has yet to be decoded. This can be solved by forcing the attention weights $w_{ij}=0$ for all $i<j$ , called "causal masking". This attention mechanism is the "causally masked self-attention".

References

^ Cherry, E. Colin (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears". The Journal of the Acoustical Society of America. 25 (5): 975–979. Bibcode:1953ASAJ...25..975C. doi:10.1121/1.1907229. hdl:11858/00-001M-0000-002A-F750-3.
^ Broadbent, Donald E. (1958). Perception and Communication. Pergamon Press.
^ Kowler, Eileen (1995). "The control of saccadic eye movements". Reviews of Oculomotor Research. 5: 1–70.
^ Rumelhart, David E.; Hinton, G. E.; Mcclelland, James L. (1987-07-29). "A General Framework for Parallel Distributed Processing" (PDF). In Rumelhart, David E.; Hinton, G. E.; PDP Research Group (eds.). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. Cambridge, Massachusetts: MIT Press. ISBN 978-0-262-68053-0.
^ Giles, C. Lee (1988). "Learning and synthesizing time series by the back propagation algorithm". IEEE Transactions on Acoustics, Speech, and Signal Processing. 36 (6): 939–945. doi:10.1109/29.1647 (inactive 25 July 2025).{{cite journal}}: CS1 maint: DOI inactive as of July 2025 (link)
^ ^a ^b Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.
^ von der Malsburg, Christoph (1981). "The correlation theory of brain function". Internal Report 81–2, Max-Planck-Institute for Biophysical Chemistry.
^ Feldman, Jerome A. (1982). "Dynamic connections in neural networks". Biological Cybernetics. 46 (1): 27–39. doi:10.1007/BF00335349. PMID 6307398.
^ Hinton, Geoffrey E. (1989). "Connectionist learning procedures". Artificial Intelligence. 40 (1–3): 185–234. doi:10.1016/0004-3702(89)90049-0.
^ Tomasi, Carlo (1998). Bilateral filtering for gray and color images. ICCV.
^ Page, Larry (1998). The PageRank Citation Ranking: Bringing Order to the Web (Technical report). Stanford InfoLab.
^ Buades, Antoni (2005). A non-local algorithm for image denoising. CVPR.
^ ^a ^b ^c Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].
^ Wang, Qian (2014). Attentional Neural Network: Feature Selection Using Cognitive Feedback. NeurIPS.
^ Xu, Kelvin; Ba, Jimmy; Kiros, Ryan (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044.
^ Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2015). "Show and Tell: A Neural Image Caption Generator". 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3156–3164. doi:10.1109/CVPR.2015.7298935. ISBN 978-1-4673-6964-0.
^ Roffo, Giorgio (2015). Infinite Feature Selection. ICCV.
^ Roffo, Giorgio (2017). Infinite Latent Feature Selection. ICCV.
^ Roffo, Giorgio (2020). "Infinite Feature Selection". IEEE Transactions on Pattern Analysis and Machine Intelligence.
^ Gui, Ning (2019). AFS: An Attention-based Mechanism for Supervised Feature Selection. AAAI.
^ Anonymous (2023). Sequential Attention for Feature Selection. ICLR.
^ Cheng, Jianpeng (2016). "Long Short-Term Memory-Networks for Machine Reading". arXiv:1601.06733 [cs.CL].
^ Paulus, Romain (2017). "A Deep Reinforced Model for Abstractive Summarization". arXiv:1705.04304 [cs.CL].
^ Parikh, Anees (2016). Decomposable Attention Model for Natural Language Inference. EMNLP.
^ Lin, Zichao (2017). A Structured Self-Attentive Sentence Embedding. ICLR.
^ ^a ^b Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017). "Attention is All You Need". arXiv:1706.03762 [cs.CL].
^ Santoro, Adam (2017). Relation Networks for Relational Reasoning. ICLR.
^ Lee, Juho (2019). Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. ICML.
^ Wang, Xiaolong (2018). Non-Local Neural Networks. CVPR.
^ Veličković, Petar (2018). Graph Attention Networks. ICLR.
^ Kitaev, Nikita (2020). Reformer: The Efficient Transformer. ICLR.
^ Wang, Salah (2020). Linformer: Self-Attention with Linear Complexity. ICLR.
^ Choromanski, Krzysztof (2020). Rethinking Attention with Performers. ICLR.
^ Ramsauer, Johannes (2021). Hopfield Networks is All You Need. NeurIPS.
^ Dosovitskiy, Aleksander (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR.
^ Jumper, John (2021). "Highly accurate protein structure prediction with AlphaFold". Nature.
^ Radford, Alec (2021). Learning Transferable Visual Models from Natural Language Supervision. ICML.
^ Huang, Xiangyu (2019). CCNet: Criss-Cross Attention for Semantic Segmentation. ICCV.
^ Fu, Jing (2019). Dual Attention Network for Scene Segmentation. CVPR.
^ Niu, Zhaoyang; Zhong, Guoqiang; Yu, Hui (2021-09-10). "A review on the attention mechanism of deep learning". Neurocomputing. 452: 48–62. doi:10.1016/j.neucom.2021.03.091. ISSN 0925-2312.
^ Soydaner, Derya (August 2022). "Attention mechanism in neural networks: where it comes and where it goes". Neural Computing and Applications. 34 (16): 13371–13385. arXiv:2204.13154. doi:10.1007/s00521-022-07366-3. ISSN 0941-0643.
^ Britz, Denny; Goldie, Anna; Luong, Minh-Thanh; Le, Quoc (2017-03-21). "Massive Exploration of Neural Machine Translation Architectures". arXiv:1703.03906 [cs.CV].
^ "Pytorch.org seq2seq tutorial". Retrieved December 2, 2021.
^ Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
^ ^a ^b ^c Luong, Minh-Thang (2015-09-20). "Effective Approaches to Attention-Based Neural Machine Translation". arXiv:1508.04025v5 [cs.CL].
^ "Learning Positional Attention for Sequential Recommendation". catalyzex.com.
^ Zhu, Xizhou; Cheng, Dazhi; Zhang, Zheng; Lin, Stephen; Dai, Jifeng (2019). "An Empirical Study of Spatial Attention Mechanisms in Deep Networks". 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6687–6696. arXiv:1904.05873. doi:10.1109/ICCV.2019.00679. ISBN 978-1-7281-4803-8. S2CID 118673006.
^ Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7132–7141. arXiv:1709.01507. doi:10.1109/CVPR.2018.00745. ISBN 978-1-5386-6420-9. S2CID 206597034.
^ Woo, Sanghyun; Park, Jongchan; Lee, Joon-Young; Kweon, In So (2018-07-18). "CBAM: Convolutional Block Attention Module". arXiv:1807.06521 [cs.CV].
^ Georgescu, Mariana-Iuliana; Ionescu, Radu Tudor; Miron, Andreea-Iuliana; Savencu, Olivian; Ristea, Nicolae-Catalin; Verga, Nicolae; Khan, Fahad Shahbaz (2022-10-12). "Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution". arXiv:2204.04218 [eess.IV].
^ Neil Rhodes (2021). CS 152 NN—27: Attention: Keys, Queries, & Values. Event occurs at 06:30. Retrieved 2021-12-22.
^ Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 05:30. Retrieved 2021-12-22.
^ Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 20:15. Retrieved 2021-12-22.
^ Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". pytorch.org. Retrieved 2021-12-22.
^ Mittal, Aayush (2024-07-17). "Flash Attention: Revolutionizing Transformer Efficiency". Unite.AI. Retrieved 2024-11-16.
^ "FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention – PyTorch".
^ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg (2021-06-03), An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv:2010.11929, retrieved 2025-07-21
^ Abnar, Samira; Zuidema, Willem (2020-05-31), Quantifying Attention Flow in Transformers, arXiv:2005.00928, retrieved 2025-07-21
^ Brocki, Lennart; Binda, Jakub; Chung, Neo Christopher (2024-10-25), Class-Discriminative Attention Maps for Vision Transformers, arXiv:2312.02364, retrieved 2025-07-21
^ Gildenblat, Jacob (2025-07-21), jacobgil/pytorch-grad-cam, retrieved 2025-07-21
^ Mullenbach, James; Wiegreffe, Sarah; Duke, Jon; Sun, Jimeng; Eisenstein, Jacob (2018-04-16), Explainable Prediction of Medical Codes from Clinical Text, arXiv:1802.05695, retrieved 2025-07-21
^ Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19), Neural Machine Translation by Jointly Learning to Align and Translate, arXiv:1409.0473
^ Serrano, Sofia; Smith, Noah A. (2019-06-09), Is Attention Interpretable?, arXiv:1906.03731, retrieved 2025-07-21
^ Lee, Juho; Lee, Yoonho; Kim, Jungtaek; Kosiorek, Adam R; Choi, Seungjin; Teh, Yee Whye (2018). "Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks". arXiv:1810.00825 [cs.LG].

External links

Olah, Chris; Carter, Shan (September 8, 2016). "Attention and Augmented Recurrent Neural Networks". Distill. 1 (9). Distill Working Group. doi:10.23915/distill.00001.
Dan Jurafsky and James H. Martin (2022). Speech and Language Processing (3rd ed. draft, January 2022) — Chapter 10.4 (Attention) and Chapter 9.7 (Self-Attention Networks: Transformers)
Alex Graves (2020). Attention and Memory in Deep Learning — video lecture from DeepMind / UCL

[Cherry_1953-1] Cherry, E. Colin (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears". The Journal of the Acoustical Society of America. 25 (5): 975–979. Bibcode:1953ASAJ...25..975C. doi:10.1121/1.1907229. hdl:11858/00-001M-0000-002A-F750-3.

[Broadbent-2] Broadbent, Donald E. (1958). Perception and Communication. Pergamon Press.

[Kowler1995-3] Kowler, Eileen (1995). "The control of saccadic eye movements". Reviews of Oculomotor Research. 5: 1–70.

[PDP-4] Rumelhart, David E.; Hinton, G. E.; Mcclelland, James L. (1987-07-29). "A General Framework for Parallel Distributed Processing" (PDF). In Rumelhart, David E.; Hinton, G. E.; PDP Research Group (eds.). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. Cambridge, Massachusetts: MIT Press. ISBN 978-0-262-68053-0.

[Giles1987-5] Giles, C. Lee (1988). "Learning and synthesizing time series by the back propagation algorithm". IEEE Transactions on Acoustics, Speech, and Signal Processing. 36 (6): 939–945. doi:10.1109/29.1647 (inactive 25 July 2025).{{cite journal}}: CS1 maint: DOI inactive as of July 2025 (link)

[transform1992-6] Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.

[malsburg1981-7] von der Malsburg, Christoph (1981). "The correlation theory of brain function". Internal Report 81–2, Max-Planck-Institute for Biophysical Chemistry.

[feldman1982-8] Feldman, Jerome A. (1982). "Dynamic connections in neural networks". Biological Cybernetics. 46 (1): 27–39. doi:10.1007/BF00335349. PMID 6307398.

[hinton1987-9] Hinton, Geoffrey E. (1989). "Connectionist learning procedures". Artificial Intelligence. 40 (1–3): 185–234. doi:10.1016/0004-3702(89)90049-0.

[tomasi1998-10] Tomasi, Carlo (1998). Bilateral filtering for gray and color images. ICCV.

[pagerank1998-11] Page, Larry (1998). The PageRank Citation Ranking: Bringing Order to the Web (Technical report). Stanford InfoLab.

[buades2005-12] Buades, Antoni (2005). A non-local algorithm for image denoising. CVPR.

[bahdanau-13] Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].

[wang2014-14] Wang, Qian (2014). Attentional Neural Network: Feature Selection Using Cognitive Feedback. NeurIPS.

[xu2015-15] Xu, Kelvin; Ba, Jimmy; Kiros, Ryan (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044.

[vinyals2015-16] Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2015). "Show and Tell: A Neural Image Caption Generator". 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3156–3164. doi:10.1109/CVPR.2015.7298935. ISBN 978-1-4673-6964-0.

[roffo2015-17] Roffo, Giorgio (2015). Infinite Feature Selection. ICCV.

[roffo2017-18] Roffo, Giorgio (2017). Infinite Latent Feature Selection. ICCV.

[roffo2020-19] Roffo, Giorgio (2020). "Infinite Feature Selection". IEEE Transactions on Pattern Analysis and Machine Intelligence.

[afs2019-20] Gui, Ning (2019). AFS: An Attention-based Mechanism for Supervised Feature Selection. AAAI.

[sequentialattention2023-21] Anonymous (2023). Sequential Attention for Feature Selection. ICLR.

[cheng2016-22] Cheng, Jianpeng (2016). "Long Short-Term Memory-Networks for Machine Reading". arXiv:1601.06733 [cs.CL].

[paulus2017-23] Paulus, Romain (2017). "A Deep Reinforced Model for Abstractive Summarization". arXiv:1705.04304 [cs.CL].

[parikh2016-24] Parikh, Anees (2016). Decomposable Attention Model for Natural Language Inference. EMNLP.

[lin2017-25] Lin, Zichao (2017). A Structured Self-Attentive Sentence Embedding. ICLR.

[allyouneed-26] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017). "Attention is All You Need". arXiv:1706.03762 [cs.CL].

[santoro2017-27] Santoro, Adam (2017). Relation Networks for Relational Reasoning. ICLR.

[lee2019-28] Lee, Juho (2019). Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. ICML.

[wang2018-29] Wang, Xiaolong (2018). Non-Local Neural Networks. CVPR.

[velickovic2018-30] Veličković, Petar (2018). Graph Attention Networks. ICLR.

[reformer2020-31] Kitaev, Nikita (2020). Reformer: The Efficient Transformer. ICLR.

[linformer2020-32] Wang, Salah (2020). Linformer: Self-Attention with Linear Complexity. ICLR.

[performer2020-33] Choromanski, Krzysztof (2020). Rethinking Attention with Performers. ICLR.

[ramsauer2021-34] Ramsauer, Johannes (2021). Hopfield Networks is All You Need. NeurIPS.

[dosovitskiy2021-35] Dosovitskiy, Aleksander (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR.

[alphafold-36] Jumper, John (2021). "Highly accurate protein structure prediction with AlphaFold". Nature.

[clip-37] Radford, Alec (2021). Learning Transferable Visual Models from Natural Language Supervision. ICML.

[ccnet-38] Huang, Xiangyu (2019). CCNet: Criss-Cross Attention for Semantic Segmentation. ICCV.

[danet-39] Fu, Jing (2019). Dual Attention Network for Scene Segmentation. CVPR.

[:0-40] Niu, Zhaoyang; Zhong, Guoqiang; Yu, Hui (2021-09-10). "A review on the attention mechanism of deep learning". Neurocomputing. 452: 48–62. doi:10.1016/j.neucom.2021.03.091. ISSN 0925-2312.

[:1-41] Soydaner, Derya (August 2022). "Attention mechanism in neural networks: where it comes and where it goes". Neural Computing and Applications. 34 (16): 13371–13385. arXiv:2204.13154. doi:10.1007/s00521-022-07366-3. ISSN 0941-0643.

[bdritz2017-42] Britz, Denny; Goldie, Anna; Luong, Minh-Thanh; Le, Quoc (2017-03-21). "Massive Exploration of Neural Machine Translation Architectures". arXiv:1703.03906 [cs.CV].

[pytorch_s2s-43] "Pytorch.org seq2seq tutorial". Retrieved December 2, 2021.

[schlag2021-44] Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.

[xy-dot-45] Luong, Minh-Thang (2015-09-20). "Effective Approaches to Attention-Based Neural Machine Translation". arXiv:1508.04025v5 [cs.CL].

[luo-46] "Learning Positional Attention for Sequential Recommendation". catalyzex.com.

[xzhu1-47] Zhu, Xizhou; Cheng, Dazhi; Zhang, Zheng; Lin, Stephen; Dai, Jifeng (2019). "An Empirical Study of Spatial Attention Mechanisms in Deep Networks". 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6687–6696. arXiv:1904.05873. doi:10.1109/ICCV.2019.00679. ISBN 978-1-7281-4803-8. S2CID 118673006.

[jhu1-48] Hu, Jie; Shen, Li; Sun, Gang (2018). "Squeeze-and-Excitation Networks". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7132–7141. arXiv:1709.01507. doi:10.1109/CVPR.2018.00745. ISBN 978-1-5386-6420-9. S2CID 206597034.

[psanghyun1-49] Woo, Sanghyun; Park, Jongchan; Lee, Joon-Young; Kweon, In So (2018-07-18). "CBAM: Convolutional Block Attention Module". arXiv:1807.06521 [cs.CV].

[mgeorgescu-50] Georgescu, Mariana-Iuliana; Ionescu, Radu Tudor; Miron, Andreea-Iuliana; Savencu, Olivian; Ristea, Nicolae-Catalin; Verga, Nicolae; Khan, Fahad Shahbaz (2022-10-12). "Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution". arXiv:2204.04218 [eess.IV].

[xy-qkv-51] Neil Rhodes (2021). CS 152 NN—27: Attention: Keys, Queries, & Values. Event occurs at 06:30. Retrieved 2021-12-22.

[xx-dot-52] Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 05:30. Retrieved 2021-12-22.

[xx-qkv-53] Alfredo Canziani & Yann Lecun (2021). NYU Deep Learning course, Spring 2020. Event occurs at 20:15. Retrieved 2021-12-22.

[pytorch-tutorial-54] Robertson, Sean. "NLP From Scratch: Translation With a Sequence To Sequence Network and Attention". pytorch.org. Retrieved 2021-12-22.

[55] Mittal, Aayush (2024-07-17). "Flash Attention: Revolutionizing Transformer Efficiency". Unite.AI. Retrieved 2024-11-16.

[56] "FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention – PyTorch".

[57] Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg (2021-06-03), An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv:2010.11929, retrieved 2025-07-21

[58] Abnar, Samira; Zuidema, Willem (2020-05-31), Quantifying Attention Flow in Transformers, arXiv:2005.00928, retrieved 2025-07-21

[59] Brocki, Lennart; Binda, Jakub; Chung, Neo Christopher (2024-10-25), Class-Discriminative Attention Maps for Vision Transformers, arXiv:2312.02364, retrieved 2025-07-21

[60] Gildenblat, Jacob (2025-07-21), jacobgil/pytorch-grad-cam, retrieved 2025-07-21

[61] Mullenbach, James; Wiegreffe, Sarah; Duke, Jon; Sun, Jimeng; Eisenstein, Jacob (2018-04-16), Explainable Prediction of Medical Codes from Clinical Text, arXiv:1802.05695, retrieved 2025-07-21

[62] Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19), Neural Machine Translation by Jointly Learning to Align and Translate, arXiv:1409.0473

[63] Serrano, Sofia; Smith, Noah A. (2019-06-09), Is Attention Interpretable?, arXiv:1906.03731, retrieved 2025-07-21

[SetTransformer-64] Lee, Juho; Lee, Yoonho; Kim, Jungtaek; Kosiorek, Adam R; Choi, Seungjin; Teh, Yee Whye (2018). "Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks". arXiv:1810.00825 [cs.LG].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]