Attention

Attention(Q,K,V)=softmax(Qā‹…KTd)ā‹…VAttention(Q, K, V) = softmax(\frac{{Q}\cdot{K}^{T}}{\sqrt{d}})\cdot{V}

Query -> input*embedding

Key and Value -> input*embedding

softmax -> weight of Value (use Query to interpret Key(input), does this input have what Query is looking for)

Why softmax not relu: too many 0s? Softmax normalizes the attention scores.

Attention paper

For self-attention knowledge tracing: attention + ResNet (output + x)

Q1

Q2

Q3

K1

1

0

0

K2

1

1

0

K3

1

1

1

Self-attention knowledge tracing paper

Overview video

Last updated