▶
Attention at a glance
Each cell in the matrix shows how much token i (row) attends to token j (column). Click any cell to highlight which query token is looking at which key token. Toggle the causal mask to see why future-token leakage is dangerous. Switch to multi-head mode to see how different heads learn different relationship patterns.
⚙
Scale factor experiment
Drag the 1/√dk slider from 0 to 2. At 0, raw scores skip scaling entirely and softmax becomes near-uniform. At high values, scores are compressed and softmax becomes peaked. The sweet spot (1.0) keeps gradients healthy.