Two-way conversion between cross-entropy loss and perplexity. Understand what your model's loss really means.
Benchmark perplexity values for common training checkpoints on a character-level TinyShakespeare model (vocab ~65):
| Checkpoint | Loss | Perplexity | What it means |
|---|---|---|---|
| Untrained random | 4.17 | ~65 | Uniformly guessing among all 65 characters. No learning yet. |
| Char-level · 500 steps | 2.5 | ~12 | Basic character frequencies and common bigrams learned. |
| Char-level · 1k steps | 2.0 | ~7.4 | Common patterns like "th", "he" emerge. Output looks like scrambled English. |
| Char-level · 5k steps | 1.5 | ~4.5 | Whole words appear. From 65 options, the model narrows to ~4–5 candidates. |
| Char-level · 20k steps | 1.0 | ~2.7 | Grammatically plausible text. Model is confident about most characters. |
| GPT-2 (WikiText-103) | 3.4 | ~30 | Subword model on a much harder dataset — apples-to-oranges with char-level, but a common reference point. |
| GPT-3 (Common Crawl) | 3.0 | ~20 | Massive model on web-scale data. Perplexity in the low 20s is state of the art for open-ended text. |
Perplexity directly answers: "How many tokens is the model choosing between at each step?" This is called the effective branching factor.
A common source of confusion: the loss value you see in training logs may differ depending on how it's averaged.
reduction='sum' instead of reduction='mean' in CrossEntropyLoss, your loss will be batch_size × seq_len times larger. Always check which reduction mode you're using.ln(vocab_size) (~4.2 for 65 chars, ~10.8 for 50k tokens).