LLM Parameter Counter

Break down the parameter count of a GPT-style model layer by layer. See exactly where every parameter lives.

Week 1 · Day 5 Week 2 · Day 7 Week 6 · Day 2

Configuration

Comparison mode

Off

side-by-side

1 Base model

Vocab Size 50257

d_model (embedding dim) 768

n_layers (transformer blocks) 12

n_heads (attention heads) 12

d_ff (FFN inner dim) 3072

max_seq_len (context) 1024

Weight tying

On (lm_head skipped — shared with embeddings)

Parameter Breakdown

Base Model

Total Parameters

FP32 Memory (4 bytes)

BF16 Memory (2 bytes)

Memory estimate notes Training requires additional memory for gradients (~same as params), optimizer states (Adam: 2× params for momentum + variance), and activations. Rule of thumb: multiply parameter memory by 4–6× for full training.

Formula Reference

Token Embedding = vocab_size × d_model
Position Embedding = max_seq_len × d_model (if learned)
Attention (per block) = 4 × d_model² (Q,K,V,O projections, no biases)
MLP (per block) = 2 × d_model × d_ff (expand + project)
LayerNorms (per block) = 4 × d_model (2 LayerNorms × 2 vectors each)
LM Head = d_model × vocab_size (0 when weight tying is ON)

Total = Embeddings + n_layers × (Attention + MLP + Norms) + Final_LN + LM_Head
Biases add ~3 × d_model per attention projection and 2 × d_model per MLP layer (negligible — included in approximation above)

LLM Parameter Counter

Configuration

Parameter Breakdown

Base Model

Comparison Model

Formula Reference