Loss & Perplexity Converter — LLM From Scratch

Converter

Cross-Entropy Loss

Loss value 3.00

or enter directly

Perplexity

20.09

Perplexity (e^loss)

or enter perplexity

rand poor fair good v.good

0 3 6 9 12

Random Poor Fair Good Very Good

Fair The model has learned common patterns but still makes frequent mistakes.

Checkpoint Reference

Benchmark perplexity values for common training checkpoints on a character-level TinyShakespeare model (vocab ~65):

Checkpoint	Loss	Perplexity	What it means
Untrained random	4.17	~65	Uniformly guessing among all 65 characters. No learning yet.
Char-level · 500 steps	2.5	~12	Basic character frequencies and common bigrams learned.
Char-level · 1k steps	2.0	~7.4	Common patterns like "th", "he" emerge. Output looks like scrambled English.
Char-level · 5k steps	1.5	~4.5	Whole words appear. From 65 options, the model narrows to ~4–5 candidates.
Char-level · 20k steps	1.0	~2.7	Grammatically plausible text. Model is confident about most characters.
GPT-2 (WikiText-103)	3.4	~30	Subword model on a much harder dataset — apples-to-oranges with char-level, but a common reference point.
GPT-3 (Common Crawl)	3.0	~20	Massive model on web-scale data. Perplexity in the low 20s is state of the art for open-ended text.

!

Important: Perplexity is only comparable within the same tokenizer and dataset. A character-level perplexity of 4.5 is excellent; a subword perplexity of 4.5 would be literally superhuman. Always note your vocab size and tokenization scheme when reporting perplexity.

Effective Branching Factor

Perplexity directly answers: "How many tokens is the model choosing between at each step?" This is called the effective branching factor.

20.09

Effective Branching Factor

At each position, the model is as uncertain as if it were choosing from ~20 equally-likely tokens.

With a vocabulary of 65 possible tokens, your model has narrowed the choice from all 65 down to about 20 plausible candidates per position. A random model would spread probability evenly across all 65 — a perplexity equal to the vocab size means no learning.

↔

Context window matters: A perplexity of 20 doesn't mean the model considers 20 tokens in isolation. It means the model's average uncertainty across the entire sequence is equivalent to choosing from 20 tokens. On easy positions (common words in predictable contexts), it may be near-certain; on hard positions (rare words, surprising transitions), it may have dozens of candidates.

Batch Loss vs Per-Token Loss

A common source of confusion: the loss value you see in training logs may differ depending on how it's averaged.

i

Per-token loss (what this converter uses) is the cross-entropy averaged over every individual token prediction. A batch loss reported by many frameworks is the mean across all tokens in the batch — same thing, as long as all batches have equal token counts.

Batch loss can be the sum of losses across the batch rather than the mean. If your training code uses reduction='sum' instead of reduction='mean' in CrossEntropyLoss, your loss will be batch_size × seq_len times larger. Always check which reduction mode you're using.

Rule of thumb: If your initial loss is in the thousands, you're probably summing instead of averaging. Typical initial loss should be close to ln(vocab_size) (~4.2 for 65 chars, ~10.8 for 50k tokens).

Formula Derivation

▶

Cross-Entropy Loss for a single prediction:
H(p, q) = -Σ p(x) · log(q(x))

Since the true distribution p is one-hot (the correct token has probability 1):
Loss = -log(q(correct_token))

Averaged over all positions in the sequence:
L = -(1/N) · Σ log(q(correct_token_i))

Perplexity = exp(L)

Why exponential? Loss is in log-probability space. Exponentiating converts it back to probability space. A loss of 2.0 means the model assigns probability e^(−2.0) ≈ 0.135 to the correct token on average. Perplexity = 1 / average_token_probability.

Inverse conversion: Loss = ln(Perplexity)