Re-attention

To handle Attention Collapse or learning plateau: re-attention

Attention(Q, K, V) = softmax(\frac{{Q}\cdot{K}^{T}}{\sqrt{d}})\cdot{V}

Re-attention(Q, K, V) = Norm(\theta^{T}(softmax(\frac{{Q}\cdot{K}^{T}}{\sqrt{d}})))\cdot{V}

Last updated 4 years ago

Was this helpful?