Attention

Attention(Q, K, V) = softmax(\frac{{Q}\cdot{K}^{T}}{\sqrt{d}})\cdot{V}

Query -> input*embedding

Key and Value -> input*embedding

softmax -> weight of Value (use Query to interpret Key(input), does this input have what Query is looking for)

Why softmax not relu: too many 0s? Softmax normalizes the attention scores.

For self-attention knowledge tracing: attention + ResNet (output + x)

Self-attention knowledge tracing paper

Last updated 1 year ago

Was this helpful?