Attention Matrix Explorer

See the full attention computation laid out step by step: Q, K, V projections, raw QKT scores, scaling, causal masking, and final softmax weights.

Input

Enter text above to see tokens

Attention Configuration

Single
1.00

Attention Matrices — click any cell to highlight the query → key connection

Raw QKT Scores

i
Dot product of Query × KeyT for every token pair. Values can be positive (attract) or negative (repel).

Scaled QKT

i
Divided by √dk to prevent dot products from growing too large (prevents gradient vanishing in softmax).

After Causal Mask

i
Upper triangle zeroed out (set to −∞) so tokens cannot attend to future tokens.

Softmax Weights

i
Final attention probabilities. Each row sums to 1.0. Darker = lower weight, brighter = higher attention.
Legend
High value
Zero
Negative
Masked (−∞)

Q, K, V Vectors — activations across dimensions for each token

How to Read This

Attention at a glance Each cell in the matrix shows how much token i (row) attends to token j (column). Click any cell to highlight which query token is looking at which key token. Toggle the causal mask to see why future-token leakage is dangerous. Switch to multi-head mode to see how different heads learn different relationship patterns.
Scale factor experiment Drag the 1/√dk slider from 0 to 2. At 0, raw scores skip scaling entirely and softmax becomes near-uniform. At high values, scores are compressed and softmax becomes peaked. The sweet spot (1.0) keeps gradients healthy.