Re-attention

To handle Attention Collapse or learning plateau: re-attention

Attention(Q,K,V)=softmax(QKTd)VAttention(Q, K, V) = softmax(\frac{{Q}\cdot{K}^{T}}{\sqrt{d}})\cdot{V}
Reattention(Q,K,V)=Norm(θT(softmax(QKTd)))VRe-attention(Q, K, V) = Norm(\theta^{T}(softmax(\frac{{Q}\cdot{K}^{T}}{\sqrt{d}})))\cdot{V}

paper

Last updated