Attention Matrix Explorer

See the full attention computation laid out step by step: Q, K, V projections, raw QK^T scores, scaling, causal masking, and final softmax weights.

Week 2 · Day 9 Week 2 · Day 10

Input

Sentence to analyze

Tokens

Enter text above to see tokens

Attention Configuration

Causal mask

Heads

Single

Scale 1/√d_k

1.00

Attention Matrices — click any cell to highlight the query → key connection

Raw QK^T Scores

i

Dot product of Query × Key^T for every token pair. Values can be positive (attract) or negative (repel).

Scaled QK^T

i

Divided by √d_k to prevent dot products from growing too large (prevents gradient vanishing in softmax).

After Causal Mask

i

Upper triangle zeroed out (set to −∞) so tokens cannot attend to future tokens.

Softmax Weights

i

Final attention probabilities. Each row sums to 1.0. Darker = lower weight, brighter = higher attention.

Legend

High value

Zero

Negative

Masked (−∞)

Q, K, V Vectors — activations across dimensions for each token

How to Read This

▶

Attention at a glance Each cell in the matrix shows how much token i (row) attends to token j (column). Click any cell to highlight which query token is looking at which key token. Toggle the causal mask to see why future-token leakage is dangerous. Switch to multi-head mode to see how different heads learn different relationship patterns.

⚙

Scale factor experiment Drag the 1/√d_k slider from 0 to 2. At 0, raw scores skip scaling entirely and softmax becomes near-uniform. At high values, scores are compressed and softmax becomes peaked. The sweet spot (1.0) keeps gradients healthy.