Attention

Attention(Q,K,V)=softmax(QKTd)VAttention(Q, K, V) = softmax(\frac{{Q}\cdot{K}^{T}}{\sqrt{d}})\cdot{V}

Query -> input*embedding

Key and Value -> input*embedding

softmax -> weight of Value (use Query to interpret Key(input), does this input have what Query is looking for)

Attention model
circle-info

Why softmax not relu: too many 0s? Softmax normalizes the attention scores.

Attention paperarrow-up-right

circle-exclamation
ResNet residual

Q1

Q2

Q3

K1

1

0

0

K2

1

1

0

K3

1

1

1

Self-attention knowledge tracing paperarrow-up-right

Overview videoarrow-up-right

Last updated

Was this helpful?