LLM Parameter Counter

Break down the parameter count of a GPT-style model layer by layer. See exactly where every parameter lives.

Configuration

Off
side-by-side
1 Base model
50257
768
12
12
3072
1024
On (lm_head skipped — shared with embeddings)

Parameter Breakdown

Base Model

0
Total Parameters
    0
    FP32 Memory (4 bytes)
    0
    BF16 Memory (2 bytes)
    i
    Memory estimate notes Training requires additional memory for gradients (~same as params), optimizer states (Adam: 2× params for momentum + variance), and activations. Rule of thumb: multiply parameter memory by 4–6× for full training.

    Formula Reference

    Token Embedding = vocab_size × d_model
    Position Embedding = max_seq_len × d_model (if learned)
    Attention (per block) = 4 × d_model2 (Q,K,V,O projections, no biases)
    MLP (per block) = 2 × d_model × d_ff (expand + project)
    LayerNorms (per block) = 4 × d_model (2 LayerNorms × 2 vectors each)
    LM Head = d_model × vocab_size (0 when weight tying is ON)

    Total = Embeddings + n_layers × (Attention + MLP + Norms) + Final_LN + LM_Head
    Biases add ~3 × d_model per attention projection and 2 × d_model per MLP layer (negligible — included in approximation above)