Understand every component by implementing it yourself. Tokenization, attention, transformers, training loops, and text generation: no black boxes, just working code.
Curriculum based on angelos-p/llm-from-scratch by Angelos P.
Turn raw text into numbers the model can understand
Week 1 Focus: Text to tokens to IDsWhy tokenization matters, character-level basics, and the case for Byte Pair Encoding
Neural networks require fixed-size vector inputs. Text is variable-length and symbolic. Tokenization solves this by mapping each unit of text to a unique integer ID in a finite vocabulary. The vocabulary size V determines the output dimension of the embedding layer and, by extension, a large portion of the model's parameter count.
Consider the sentence "The cat sat.". At the character level, this becomes ['T','h','e',' ','c','a','t',' ','s','a','t','.'] (length 12). At the word level, it becomes ['The','cat','sat.'] (length 3). At the subword level (BPE), it might become ['The','Ġcat','Ġsat.'] (length 3, with the space encoded as Ġ).
['un', 'fortun', 'ately'].The simplest tokenizer maps each unique character to an ID. With a character-level approach, your "vocabulary" is just the set of unique characters in your training corpus. This is trivial to build and deterministic: no learning required.
# Build a character-level vocabulary from a corpus def build_char_vocab(text): chars = sorted(set(text)) return {ch: i for i, ch in enumerate(chars)} # Encode text to IDs def encode_char(text, vocab): return [vocab[ch] for ch in text] # Decode IDs back to text def decode_char(ids, vocab): itos = {i: ch for ch, i in vocab.items()} return ''.join(itos[i] for i in ids) # Example corpus = "The quick brown fox jumps over the lazy dog" vocab = build_char_vocab(corpus) # vocab might be: {' ': 0, 'T': 1, 'a': 2, 'b': 3, ...} ids = encode_char(corpus, vocab) # ids: [1, 2, ...] - sequence length = 43
len(set(text)) to see your vocabulary size. What happens to sequence length if you include Unicode characters?
Transformer attention is O(n^2) in sequence length n. If your sequences are 10x longer because you used characters instead of subwords, your compute cost increases 100x. This is the fundamental reason character-level tokenization is rarely used for large models, though it is common in toy implementations and specific domains (DNA sequences, for example).
For "The quick brown fox..." (43 chars), sequence length = 43. Vocabulary = ~20. Cheap embedding layer, expensive attention everywhere else.
Same text might be ~10 tokens. Vocabulary = 32,000. Larger embedding layer, but attention cost drops by ~18x (43^2 vs 10^2). Net win for any real model.
Byte Pair Encoding starts with character-level tokens, then iteratively merges the most frequent pair of adjacent tokens. After enough merges, common sequences like "ing", "tion", or whole words like "the" become single tokens.
The algorithm:
# Conceptual BPE merge step def get_pair_counts(tokens): counts = {} for i in range(len(tokens) - 1): pair = (tokens[i], tokens[i+1]) counts[pair] = counts.get(pair, 0) + 1 return counts # After many merges, "t" + "h" -> "th", "th" + "e" -> "the" # The merge table stores: {(116, 104): 256, (256, 101): 257, ...}
The critical detail: encoding at inference time uses the same merge table. You apply merges greedily from the bottom of the table up, which guarantees the same tokenization the model was trained with.
In practice, you rarely train BPE from scratch. You use a library like tiktoken (OpenAI's tokenizer) or tokenizers (Hugging Face). This ensures compatibility with pretrained models and handles edge cases like Unicode correctly.
import tiktoken # Load GPT-4's tokenizer (cl100k_base) enc = tiktoken.get_encoding("cl100k_base") text = "The quick brown fox jumps over the lazy dog" ids = enc.encode(text) # ids: [464, 2068, 7586, 21831, 1917, 3463, 262, 429, 1332, 311] # Only 10 tokens for 43 characters! decoded = enc.decode(ids) # decoded == text (round-trip integrity) # Check vocabulary size print(enc.n_vocab) # 100256
tiktoken or tokenizers. Tokenize a sample text at character level and with BPE. Compare sequence lengths. Try encoding a word that wasn't in your training corpus (for a from-scratch BPE) vs a pretrained one. Observe how subword splitting differs.
PyTorch Datasets, DataLoaders, and the art of creating training batches
Language modeling is fundamentally next-token prediction. Given a sequence of token IDs [x1, x2, ..., xn], the model learns to predict [x2, x3, ..., xn+1]. Input and target are offset by one position.
# For a sequence of token IDs tokens = [464, 2068, 7586, 21831, 1917] # "The quick brown..." # Input: all tokens except the last inputs = tokens[:-1] # [464, 2068, 7586, 21831] # Target: all tokens except the first (shifted left by 1) targets = tokens[1:] # [2068, 7586, 21831, 1917] # Model learns: given 464 -> predict 2068, given 2068 -> predict 7586, etc.
When working with sequences longer than one step (which you always should), inputs and targets become matrices of shape (batch_size, sequence_length). The loss is computed over all positions in the sequence.
A Dataset in PyTorch defines how to get a single item. For language modeling, an item is typically a chunk of text of fixed length. The dataset needs to tokenize the raw text and serve up (input_chunk, target_chunk) pairs.
import torch from torch.utils.data import Dataset, DataLoader class TextDataset(Dataset): def __init__(self, text, tokenizer, seq_len): self.tokenizer = tokenizer self.seq_len = seq_len # Tokenize the entire text once self.ids = tokenizer.encode(text) # Number of complete sequences we can make self.n_samples = len(self.ids) // seq_len def __len__(self): return self.n_samples def __getitem__(self, idx): start = idx * self.seq_len end = start + self.seq_len # Input: tokens[start:end-1], Target: tokens[start+1:end] chunk = self.ids[start:end] x = torch.tensor(chunk[:-1], dtype=torch.long) y = torch.tensor(chunk[1:], dtype=torch.long) return x, y
seq_len values: 16, 64, 256, 1024. For each, print the number of samples and the shape of x and y. What happens to the number of samples as seq_len increases? What's the trade-off between longer sequences (more context) and more samples (more gradient updates per epoch)?
The DataLoader wraps your Dataset and adds batching, shuffling, and multi-process loading. For language modeling, you typically want:
dataset = TextDataset(text, tokenizer, seq_len=128) dataloader = DataLoader( dataset, batch_size=32, # 32 sequences per batch shuffle=True, # Randomize order each epoch num_workers=4, # Parallel loading (set to 0 on Windows if issues) pin_memory=True # Speeds up GPU transfer ) for batch_x, batch_y in dataloader: # batch_x: shape (32, 127) - 32 sequences of 127 tokens each # batch_y: shape (32, 127) - the targets print(batch_x.shape, batch_y.shape) break # Just check the first batch
You need separate data for training and validation. The validation set measures generalization: if your model memorizes the training set but fails on validation, you're overfitting. A typical split is 90/10 or 95/5 for large corpora.
def train_val_split(text, train_ratio=0.95): split_idx = int(len(text) * train_ratio) return text[:split_idx], text[split_idx:] train_text, val_text = train_val_split(raw_text) train_dataset = TextDataset(train_text, tokenizer, seq_len=128) val_dataset = TextDataset(val_text, tokenizer, seq_len=128) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
What happens when seq_len doesn't divide the corpus evenly? You have two options: discard the remainder (simple, loses a bit of data) or pad the last sequence (requires attention masks). For simplicity in a from-scratch implementation, discarding is fine.
# In __init__ of TextDataset: self.n_samples = len(self.ids) // seq_len # Integer division discards remainder self.ids = self.ids[:self.n_samples * seq_len] # Trim to exact multiple
For very large datasets, you might also implement a streaming dataset that doesn't load everything into memory. But for learning purposes, loading the full tokenized corpus is simpler and fast enough for corpora up to ~100MB.
Turn token IDs into vectors, and teach the model where each token sits in sequence
An embedding layer is literally a matrix of shape (vocab_size, d_model). When you look up token ID 42, you get row 42 of this matrix. The values in this matrix are learned during training: they are parameters, just like weights in a linear layer.
The intuition: tokens with similar meanings should have similar embedding vectors (small cosine distance). "King" and "Queen" should be close; "King" and "Toaster" should be far apart. The model learns this by adjusting the embedding matrix to minimize the training loss.
import torch.nn as nn vocab_size = 50257 # GPT-2 vocabulary size d_model = 768 # Embedding dimension (GPT-2 small) embedding = nn.Embedding(vocab_size, d_model) # embedding.weight: shape (50257, 768) - these are all learnable parameters # Forward pass: batch of token IDs token_ids = torch.tensor([[464, 2068, 7586]]) # shape (1, 3) embeddings = embedding(token_ids) # shape (1, 3, 768) # Each token ID is replaced by its 768-dim vector
The embedding lookup described above is position-agnostic. The vector for token ID 42 at position 0 is identical to the vector for token ID 42 at position 500. But "The cat" (cat at position 1) and "Cat the" (cat at position 0) mean different things! The model needs to know where each token sits in the sequence.
This is solved by adding position information to each token's embedding. There are two main approaches:
Simple to implement (another nn.Embedding). Works well in practice. GPT and most modern LLMs use this approach. The model figures out what position information is useful.
Elegant idea: different positions get different "frequencies". But it's less flexible than learned, and you need to handle extrapolation to longer sequences carefully. Mostly historical now.
class EmbeddingWithPosition(nn.Module): def __init__(self, vocab_size, d_model, max_seq_len=1024): super().__init__() self.token_embedding = nn.Embedding(vocab_size, d_model) self.position_embedding = nn.Embedding(max_seq_len, d_model) def forward(self, token_ids): # token_ids: (batch, seq_len) seq_len = token_ids.shape[1] # Position indices: [0, 1, 2, ..., seq_len-1] positions = torch.arange(seq_len, device=token_ids.device) # Look up embeddings tok_emb = self.token_embedding(token_ids) # (batch, seq_len, d_model) pos_emb = self.position_embedding(positions) # (seq_len, d_model) # Add them! Broadcasting handles the batch dimension. return tok_emb + pos_emb
The addition is the key operation. Each token's final representation is token_embedding[token_id] + position_embedding[position]. Both are d_model-dimensional vectors, so they add element-wise.
The original Transformer paper (Vaswani et al., 2017) used fixed sinusoidal encodings. The formula:
For dimension i (where i is even):
PE(pos, i) = sin(pos / 10000^(i/d_model))
For dimension i (where i is odd):
PE(pos, i) = cos(pos / 10000^((i-1)/d_model))
def sinusoidal_position_encoding(seq_len, d_model): # Create position indices and dimension indices positions = torch.arange(seq_len).unsqueeze(1) # (seq_len, 1) dims = torch.arange(d_model).unsqueeze(0) # (1, d_model) # Compute the div_term: 1 / 10000^(2i/d_model) div_term = torch.exp( torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model) ) # Allocate the encoding matrix pe = torch.zeros(seq_len, d_model) pe[:, 0::2] = torch.sin(positions * div_term) pe[:, 1::2] = torch.cos(positions * div_term) return pe # (seq_len, d_model) - add this to token embeddings
matplotlib.imshow(). Observe the different frequency patterns. Now compare with learned position embeddings from a trained model (if available). Which ones look more structured? Try changing the 10000 base in the formula to 500 or 50000. How does it affect the encoding?
You might wonder: why add the position encoding? Why not concatenate token embedding (768-dim) with position encoding (768-dim) to get 1536-dim, then project down?
The answer is simplicity and parameter efficiency. Adding keeps the dimension at d_model throughout the network. Concatenation would double the effective dimension at the cost of more parameters. Addition works because the model learns to "interpret" each dimension as encoding both token identity and position simultaneously.
This is also why the initialization scale of position embeddings matters. If position embeddings are too large relative to token embeddings, the token information gets overwhelmed. If they're too small, the model can't distinguish positions. In practice, both are typically initialized with similar scales (e.g., standard normal with std=0.02).
The engine room: optimizer, loss function, backpropagation, and gradient clipping
Every training loop follows this structure. The details change, but the flow is invariant:
def train(model, dataloader, optimizer, device, epoch): model.train() # Enable dropout, batch norm updates, etc. total_loss = 0 for batch_idx, (inputs, targets) in enumerate(dataloader): # Move data to device (GPU if available) inputs = inputs.to(device) targets = targets.to(device) # Forward pass logits = model(inputs) # shape: (batch, seq_len, vocab_size) loss = compute_loss(logits, targets) # Backward pass optimizer.zero_grad() # Clear previous gradients loss.backward() # Compute gradients via backprop optimizer.step() # Update parameters total_loss += loss.item() if batch_idx % 100 == 0: print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}") return total_loss / len(dataloader)
The model outputs logits of shape (batch, seq_len, vocab_size). For each position, this is a vector of raw scores (not probabilities) for each token in the vocabulary. Cross-entropy loss measures how far these scores are from the target token ID.
Mathematically: CE(p, y) = -log(p[y]) where p is the softmax of the logits and y is the target token ID. PyTorch's CrossEntropyLoss combines softmax and negative log-likelihood in one numerically stable operation.
import torch.nn.functional as F def compute_loss(logits, targets): # logits: (batch, seq_len, vocab_size) # targets: (batch, seq_len) # Reshape for CrossEntropyLoss: # It expects (batch*vocab_size, ...) but actually (N, C) where N = batch*seq_len batch_size, seq_len, vocab_size = logits.shape logits = logits.view(-1, vocab_size) # (batch*seq_len, vocab_size) targets = targets.view(-1) # (batch*seq_len,) return F.cross_entropy(logits, targets)
A key detail: we compute loss over all positions in the sequence. The model is learning to predict every next token simultaneously. This is what makes transformers efficient: one forward pass teaches the model about every position in the context window.
AdamW (Adam with decoupled weight decay) is the standard optimizer for transformers. It adapts the learning rate per parameter based on the history of gradients, and applies weight decay (L2 regularization) in a way that's decoupled from the adaptive learning rate.
import torch.optim as optim optimizer = optim.AdamW( model.parameters(), lr=3e-4, # Learning rate: 0.0003 is a good starting point weight_decay=0.1, # Weight decay: helps prevent overfitting betas=(0.9, 0.95) # Adam momentum parameters )
lr=3e-4 is the GPT-3 learning rate. weight_decay=0.1 is standard. betas=(0.9, 0.95) gives slightly more momentum to the second moment than the original (0.9, 0.999).
Using SGD (too slow convergence). Forgetting weight_decay (overfitting). Setting lr too high (>1e-3, loss explodes). Setting lr too low (<1e-5, barely learns).
Transformers are prone to gradient explosion, especially early in training. A single large gradient can destroy your model's weights. Gradient clipping solves this by scaling the gradients if their norm exceeds a threshold.
# Inside the training loop, after loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Then optimizer.step()
The value max_norm=1.0 means: if the total gradient norm exceeds 1.0, scale all gradients down proportionally so the norm equals 1.0. This keeps training stable without changing the direction of the gradient update.
torch.nn.utils.clip_grad_norm_() (it returns the norm before clipping). Plot gradient norms over batches. Without clipping, do you see occasional massive spikes? Try max_norm values of 0.5, 1.0, and 5.0. Which produces the most stable training?
A constant learning rate is rarely optimal. The standard schedule for transformers is warmup followed by cosine decay: start with a very small LR, ramp up to the target LR over N steps, then decay down to near-zero.
def get_lr_scheduler(optimizer, warmup_steps, total_steps): def lr_lambda(step): if step < warmup_steps: return float(step) / float(max(1, warmup_steps)) return 0.5 * (1.0 + math.cos( math.pi * (step - warmup_steps) / (total_steps - warmup_steps) )) return optim.lr_scheduler.LambdaLR(optimizer, lr_lambda) # Usage scheduler = get_lr_scheduler(optimizer, warmup_steps=2000, total_steps=100000) # After optimizer.step(): scheduler.step()
Warmup prevents early instability (the model's loss is very high at the start, leading to large gradients). Cosine decay allows fine-tuning in the later stages of training.
The full transformer architecture: layers, dimensions, and how it all connects
A GPT-style transformer follows this data flow:
class GPT(nn.Module): def __init__(self, vocab_size, d_model=768, n_layers=12, n_heads=12, max_seq_len=1024): super().__init__() self.embedding = EmbeddingWithPosition(vocab_size, d_model, max_seq_len) self.blocks = nn.ModuleList([ TransformerBlock(d_model, n_heads) for _ in range(n_layers) ]) self.ln_final = nn.LayerNorm(d_model) self.lm_head = nn.Linear(d_model, vocab_size, bias=False) def forward(self, token_ids): x = self.embedding(token_ids) # (batch, seq_len, d_model) for block in self.blocks: x = block(x) # shape preserved x = self.ln_final(x) # (batch, seq_len, d_model) logits = self.lm_head(x) # (batch, seq_len, vocab_size) return logits
Each transformer block has two main sub-layers, each wrapped with residual connections and layer normalization (pre-norm is the modern standard):
class TransformerBlock(nn.Module): def __init__(self, d_model, n_heads): super().__init__() self.ln1 = nn.LayerNorm(d_model) self.attn = MultiHeadAttention(d_model, n_heads) self.ln2 = nn.LayerNorm(d_model) self.ffn = FeedForward(d_model) def forward(self, x): # Pre-norm residual: norm -> sublayer -> residual add x = x + self.attn(self.ln1(x)) x = x + self.ffn(self.ln2(x)) return x
Attention is the core innovation of the Transformer. It computes, for each token, a weighted sum of all tokens in the sequence (including itself). The weights are determined by how "relevant" each other token is to the current token.
Multi-head attention runs this process in parallel across multiple "heads", each learning different types of relationships:
The full mathematical form (you'll implement this in Week 2):
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Where Q (queries), K (keys), and V (values) are linear projections of the input, and d_k is the dimension per head.
The FFN is surprisingly simple: two linear layers with a non-linearity (typically GELU) in between. It expands to 4 * d_model internally, then projects back to d_model.
class FeedForward(nn.Module): def __init__(self, d_model): super().__init__() self.net = nn.Sequential( nn.Linear(d_model, 4 * d_model), # Expand nn.GELU(), # Non-linearity (smoother than ReLU) nn.Linear(4 * d_model, d_model), # Project back ) def forward(self, x): return self.net(x)
The 4x expansion factor is a hyperparameter. GPT-3 uses 4x. Some models use 8x. The FFN typically accounts for about 2/3 of the model's parameters.
Understanding the parameter count helps you choose model dimensions that fit your hardware. Here's the breakdown for a GPT-style model:
# Rough parameter count for one transformer block: d_model = 768 n_heads = 12 # Attention: Q, K, V projections + output projection = 4 * d_model * d_model attn_params = 4 * d_model * d_model # ~2.4M # FFN: expand (d_model -> 4*d_model) + project (4*d_model -> d_model) ffn_params = d_model * 4 * d_model + 4 * d_model * d_model # ~4.7M # Layer norms: 2 * 2 * d_model (negligible) print(attn_params + ffn_params) # ~7.1M per block print(12 * 7100000 + 12400000) # ~98M + 12M embedding = ~110M total
sum(p.numel() for p in model.parameters()) to verify your calculation on an actual PyTorch model.
How to evaluate your model, run ablation studies, and debug training issues
Perplexity is the exponential of the average negative log-likelihood. Lower is better. A perplexity of N means the model is as confused as if it had to choose uniformly from N tokens at each step.
Mathematically: Perplexity = exp(cross_entropy_loss). If your model achieves a cross-entropy loss of 3.0, perplexity is e^3 ≈ 20.09, meaning the model is about as uncertain as choosing from 20 tokens randomly.
def compute_perplexity(model, dataloader, device): model.eval() total_loss = 0 total_tokens = 0 with torch.no_grad(): # No gradient computation needed for inputs, targets in dataloader: inputs, targets = inputs.to(device), targets.to(device) logits = model(inputs) loss = compute_loss(logits, targets) total_loss += loss.item() * targets.numel() total_tokens += targets.numel() avg_loss = total_loss / total_tokens return math.exp(avg_loss) # Perplexity # Usage val_ppl = compute_perplexity(model, val_loader, device) print(f"Validation Perplexity: {val_ppl:.2f}")
On TinyShakespeare: ~30-50 is decent for a small model. GPT-3 on Common Crawl: ~20-30. Lower means better predictions. But perplexity depends heavily on the dataset and tokenizer.
Perplexity equal to vocab_size (e.g., 50,000) means the model is predicting uniformly at random. Perplexity that increases during training means something is broken (learning rate too high, bug in loss, etc.).
An ablation study removes one component at a time to measure its contribution. This is how you build understanding of what each part does.
# Ablation: run the same training with different configurations configs = [ {"d_model": 64, "n_layers": 2, "n_heads": 2, "name": "tiny"}, {"d_model": 128, "n_layers": 4, "n_heads": 4, "name": "small"}, {"d_model": 256, "n_layers": 6, "n_heads": 8, "name": "medium"}, ] for cfg in configs: model = GPT(vocab_size, cfg["d_model"], cfg["n_layers"], cfg["n_heads"]) # Train and evaluate... ppl = compute_perplexity(model, val_loader, device) print(f"{cfg['name']}: Perplexity = {ppl:.2f}")
When training fails, it's usually one of these issues. Here's how to diagnose each:
optimizer.zero_grad() and observe gradient accumulation.
Before training on a large dataset, try overfitting on a single batch of 10-20 examples. If your model can't memorize 10 examples, something is fundamentally broken.
# Sanity check: overfit a single batch single_batch = next(iter(train_loader)) # Get one batch inputs, targets = single_batch inputs, targets = inputs.to(device), targets.to(device) for step in range(1000): logits = model(inputs) loss = compute_loss(logits, targets) optimizer.zero_grad() loss.backward() optimizer.step() if step % 100 == 0: print(f"Step {step}, Loss: {loss.item():.4f}") if loss.item() < 0.01: print("SUCCESS: Model can memorize!") break
This test should take less than a minute. If loss doesn't go below 0.01 within 1000 steps, debug your model before wasting time on full training.
Scaling laws describe how model performance improves as you increase parameters, data, or compute. You can observe this empirically with small models:
# Run the same training with increasing model sizes # Plot: d_model vs final perplexity # Expect: perplexity decreases as d_model increases (diminishing returns) # Also try: fixed model, varying training data size # Expect: more data = lower perplexity (up to a point)
The key insight from scaling laws research: performance follows a power law with compute. Doubling compute (model size or training steps) gives you a predictable improvement in perplexity. This is why big tech companies keep building bigger models: the relationship is reliable.
Combine tokenization, data pipeline, embeddings, and training loop into one complete working script that generates text
Before writing code, trace the data flow. Each component you built in Days 1–6 occupies a specific slot in the pipeline:
(input, target) pairs and batched.Every component is already implemented. Day 7 is about wiring them together correctly and verifying the combined system works.
Here is a single self-contained training script that combines every week-1 component. Read it as a checklist , each section maps to a day.
import math, torch, torch.nn as nn from torch.utils.data import Dataset, DataLoader # ── 1. TOKENIZER (Day 1) ─────────────────────────────────────── def build_vocab(text): chars = sorted(set(text)) stoi = {ch: i for i, ch in enumerate(chars)} itos = {i: ch for ch, i in stoi.items()} return stoi, itos def encode(text, stoi): return [stoi[ch] for ch in text] def decode(ids, itos): return ''.join(itos[i] for i in ids) # ── 2. DATA PIPELINE (Day 2) ─────────────────────────────────── class TextDataset(Dataset): def __init__(self, ids, seq_len): self.ids = torch.tensor(ids, dtype=torch.long) self.seq_len = seq_len def __len__(self): return len(self.ids) - self.seq_len def __getitem__(self, i): x = self.ids[i : i + self.seq_len] y = self.ids[i + 1 : i + self.seq_len + 1] return x, y # ── 3. MODEL: embeddings + tiny transformer skeleton (Days 3 & 5) class TinyLM(nn.Module): def __init__(self, vocab_size, d_model, seq_len, n_layers): super().__init__() self.tok_emb = nn.Embedding(vocab_size, d_model) self.pos_emb = nn.Embedding(seq_len, d_model) self.blocks = nn.Sequential(*[ nn.Sequential( nn.LayerNorm(d_model), nn.Linear(d_model, 4 * d_model), nn.GELU(), nn.Linear(4 * d_model, d_model), ) for _ in range(n_layers) ]) self.ln_f = nn.LayerNorm(d_model) self.lm_head = nn.Linear(d_model, vocab_size, bias=False) def forward(self, x): B, T = x.shape pos = torch.arange(T, device=x.device) h = self.tok_emb(x) + self.pos_emb(pos) for block in self.blocks: h = h + block(h) # residual connection return self.lm_head(self.ln_f(h)) # ── 4. TRAINING LOOP (Day 4) ─────────────────────────────────── def train(model, loader, optimizer, device, max_steps=3000): model.train() step = 0 for epoch in range(999): for x, y in loader: x, y = x.to(device), y.to(device) logits = model(x) loss = nn.functional.cross_entropy( logits.view(-1, logits.size(-1)), y.view(-1)) optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() step += 1 if step % 500 == 0: print(f"step {step:,} loss {loss.item():.4f}") if step >= max_steps: return # ── 5. GENERATION ───────────────────────────────────────────── @torch.no_grad() def generate(model, prompt_ids, itos, seq_len, n_tokens=200, temperature=0.8): model.eval() ids = list(prompt_ids) for _ in range(n_tokens): ctx = torch.tensor([ids[-seq_len:]], dtype=torch.long) logits = model(ctx)[0, -1] / temperature probs = torch.softmax(logits, dim=-1) nxt = torch.multinomial(probs, num_samples=1).item() ids.append(nxt) return decode(ids, itos) # ── 6. MAIN ─────────────────────────────────────────────────── if __name__ == "__main__": # Load TinyShakespeare (download from karpathy/char-rnn repo) with open("input.txt") as f: text = f.read() stoi, itos = build_vocab(text) ids = encode(text, stoi) split = int(0.9 * len(ids)) train_ids, val_ids = ids[:split], ids[split:] SEQ_LEN, BATCH = 64, 32 train_ds = TextDataset(train_ids, SEQ_LEN) val_ds = TextDataset(val_ids, SEQ_LEN) train_dl = DataLoader(train_ds, batch_size=BATCH, shuffle=True) device = "cuda" if torch.cuda.is_available() else "cpu" model = TinyLM(vocab_size=len(stoi), d_model=128, seq_len=SEQ_LEN, n_layers=4).to(device) opt = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1) print(f"vocab size: {len(stoi)} params: {sum(p.numel() for p in model.parameters()):,}") train(model, train_dl, opt, device, max_steps=3000) prompt = encode("HAMLET:\n", stoi) print(generate(model, prompt, itos, SEQ_LEN))
Before committing to a full training run, run a single forward pass and verify every tensor shape. This catches the majority of integration bugs in under a second.
# Shape smoke test , run before training B, T = 4, SEQ_LEN dummy_x = torch.zeros(B, T, dtype=torch.long).to(device) logits = model(dummy_x) print(logits.shape) # expect (4, 64, vocab_size) dummy_y = torch.zeros(B, T, dtype=torch.long).to(device) loss = nn.functional.cross_entropy( logits.view(-1, logits.size(-1)), dummy_y.view(-1)) print(loss.item()) # expect ≈ log(vocab_size), ~4.2 for 65 chars
A randomly initialized model should produce a loss close to ln(vocab_size). For a 65-character vocabulary, that's ln(65) ≈ 4.17. If your initial loss is wildly different, there's a bug in the loss computation or model output shape.
If loss starts at 0, the target tensor is all zeros and you're accidentally predicting the padding token perfectly , or your labels and inputs are the same tensor. Check the offset in your TextDataset.__getitem__.
With 3,000 steps on TinyShakespeare you should see loss drop from ~4.2 to somewhere around 1.5–1.8. Here's what to look for at each checkpoint:
optimizer.zero_grad() called before loss.backward()? (2) Is the model in model.train() mode? (3) Is the learning rate too low (1e-5 or below)? (4) Are all tensors on the same device?
The generate function above uses temperature sampling. Temperature controls how "sharp" the probability distribution is before sampling:
# temperature = 1.0 → unchanged distribution # temperature < 1.0 → sharper, more repetitive but coherent # temperature > 1.0 → flatter, more random and creative for temp in [0.5, 0.8, 1.0, 1.2]: out = generate(model, prompt, itos, SEQ_LEN, n_tokens=100, temperature=temp) print(f"\n── temperature={temp} ──\n{out}")
At temperature=0.5 the model will repeat common phrases. At temperature=1.2 it will occasionally produce nonsense characters. The sweet spot for readability is usually 0.7–0.9.
torch.multinomial(probs, 1) with torch.argmax(logits).item(). Compare output to temperature=0.5 sampling. Does greedy produce better or worse text? Why might it produce repetitive loops?
train.py to completion and print the generated text. Then make three targeted changes and observe the effect: (1) double d_model from 128 to 256 , does loss drop lower? (2) set temperature=0.5 and temperature=1.2 , compare the outputs. (3) Change the prompt from "HAMLET:\n" to "OPHELIA:\n" , does the model produce different-flavored text? Document your findings in a comment at the top of the file.
Build the GPT architecture: attention, layers, and forward pass
Week 2 Focus: The GPT architecture from scratchBig picture overview, GPTConfig dataclass, and the forward pass skeleton
GPTConfig dataclass to hold hyperparameters and a skeleton forward() method that shows the data flow. The key insight is that every design decision: d_model, n_heads, n_layers: directly impacts parameter count and compute cost. Understanding the skeleton helps you see where each component fits before implementing it.
GPT models have many hyperparameters: embedding dimension, number of heads, number of layers, vocabulary size, sequence length, dropout rate. Hardcoding these leads to messy code and makes experimentation painful. A dataclass centralizes all configuration in one place.
from dataclasses import dataclass import torch import torch.nn as nn @dataclass class GPTConfig: """Configuration for a GPT model.""" block_size: int = 1024 # Maximum sequence length (context window) vocab_size: int = 50257 # GPT-2 vocabulary size n_layer: int = 12 # Number of transformer blocks n_head: int = 12 # Number of attention heads n_embd: int = 768 # Embedding dimension (d_model) dropout: float = 0.1 bias: bool = False # Use bias in Linear layers? (GPT-2 omits them) # Derived properties (computed automatically) def head_size(self): return self.n_embd // self.n_head # d_k = d_model / n_heads # Create a small GPT config for experimentation config = GPTConfig( block_size=256, vocab_size=50257, n_layer=6, n_head=6, n_embd=384 )
n_embd from 384 to 512 or 768. Notice how head_size changes automatically. What happens if you set n_head to 7 with n_embd=384? (Hint: 384 is not divisible by 7). This is why d_model is almost always a multiple of n_heads.
The GPT class composes three main components: token embeddings, position embeddings, a stack of transformer blocks, a final LayerNorm, and an LM head (output projection). The forward pass flows: input IDs -> token_emb -> pos_emb -> blocks -> ln_f -> lm_head -> logits.
class GPT(nn.Module): def __init__(self, config: GPTConfig): super().__init__() self.config = config # Token embeddings: map token IDs to vectors self.transformer = nn.ModuleDict(dict( wte = nn.Embedding(config.vocab_size, config.n_embd), wpe = nn.Embedding(config.block_size, config.n_embd), drop = nn.Dropout(config.dropout), h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), ln_f = nn.LayerNorm(config.n_embd, bias=config.bias), )) # Language model head (projects to vocabulary) self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) def forward(self, idx, targets=None): # idx: (batch, sequence_length) - token IDs B, T = idx.shape assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}" # Positional indices pos = torch.arange(T, device=idx.device) # Embeddings tok_emb = self.transformer.wte(idx) # (B, T, n_embd) pos_emb = self.transformer.wpe(pos) # (T, n_embd) x = self.transformer.drop(tok_emb + pos_emb) # Transformer blocks for block in self.transformer.h: x = block(x) # Final layernorm x = self.transformer.ln_f(x) # Output projection (logits) logits = self.lm_head(x) # (B, T, vocab_size) # Loss computation (if targets provided) loss = None if targets is not None: loss = nn.functional.cross_entropy( logits.view(-1, logits.size(-1)), targets.view(-1) ) return logits, loss
The shape annotations in comments are crucial. They help you trace the tensor dimensions as data flows through the model. If you get a shape mismatch error, these annotations are your debugging guide.
GPT follows a specific architectural pattern that differs from the original Transformer encoder-decoder. Understanding WHY each piece is placed where it is helps you debug and modify the architecture.
Self-attention + FFN blocks only. No encoder. Causal masking ensures autoregressive generation. Simpler, works excellently for generation tasks.
The original Transformer has both an encoder (bidirectional attention) and decoder (causal attention). More complex, better for translation, overkill for language modeling.
GPTConfig dataclass and the GPT class skeleton. Print the model architecture using print(model). Count the parameters with sum(p.numel() for p in model.parameters()). Try different configs: a tiny model (n_embd=128, n_layer=2, n_head=4) vs a GPT-2 medium (n_embd=1024, n_layer=24, n_head=16).
QKV projections, scaled dot-product attention, causal masking, and why masking matters
CausalSelfAttention class with Query, Key, and Value projections, scaled dot-product attention, and the crucial causal mask that prevents the model from cheating by looking at future tokens. The causal mask is what makes the model autoregressive: it can only attend to past and current positions, never future ones.
Attention works by comparing each token's Query against all tokens' Keys to compute compatibility scores, then using those scores to take a weighted sum of the Values. Think of it as a soft dictionary lookup: "given my Query, which Keys are relevant, and what are their Values?"
class CausalSelfAttention(nn.Module): def __init__(self, config: GPTConfig): super().__init__() assert config.n_embd % config.n_head == 0, "n_embd must be divisible by n_head" # Single linear projection for Q, K, V (more efficient than separate ones) self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias) self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias) self.n_head = config.n_head self.n_embd = config.n_embd self.dropout = config.dropout self.head_size = config.n_embd // config.n_head
self.c_attn.weight. It should be (3*n_embd, n_embd). Why do we project Q, K, V together? (Hint: it's one matrix multiply instead of three). What's the total parameter count of this layer for n_embd=768?
The attention formula is: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. The scaling by sqrt(d_k) prevents the dot products from growing too large, which would push softmax into regions with extremely small gradients.
def forward(self, x): # x: (batch, sequence_length, n_embd) B, T, C = x.size() # Project to Q, K, V and split qkv = self.c_attn(x) # (B, T, 3 * n_embd) q, k, v = qkv.split(self.n_embd, dim=2) # Reshape for multi-head: (B, T, n_head, head_size) -> (B, n_head, T, head_size) # head_size = n_embd / n_head q = q.view(B, T, self.n_head, self.head_size).transpose(1, 2) k = k.view(B, T, self.n_head, self.head_size).transpose(1, 2) v = v.view(B, T, self.n_head, self.head_size).transpose(1, 2) # Compute attention scores: (B, n_head, T, T) # Each token's query compares against all tokens' keys att = (q @ k.transpose(-2, -1)) / (self.head_size ** 0.5) # APPLY CAUSAL MASK HERE (next step) # Softmax to get attention weights att = nn.functional.softmax(att, dim=-1) # Apply attention to values y = att @ v # (B, n_head, T, head_size) # Re-assemble heads: (B, n_head, T, head_size) -> (B, T, n_embd) y = y.transpose(1, 2).contiguous().view(B, T, C) # Final projection y = self.c_proj(y) return y
The transpose operations (transpose(1, 2)) are needed to get the shape (B, n_head, T, head_size) for parallel attention computation across heads. The final transpose back recombines the heads.
Without a causal mask, each token can attend to ALL tokens including future ones. This would let the model "cheat" during training by simply copying the next token's information. The causal mask ensures token i can only attend to tokens 0 through i.
# Create causal mask: upper triangle is -inf (masked out) # Shape: (T, T) where mask[i,j] = 0 if j <= i, -inf if j > i causal_mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool() att = att.masked_fill(causal_mask, float('-inf'))
After applying the mask and softmax, future positions have exactly 0 attention weight. The model cannot "see" tokens that haven't been generated yet, which is essential for autoregressive generation.
att[0].detach().cpu() for a small sequence.
CausalSelfAttention class. Test it with a random input tensor of shape (2, 16, 768). Print the attention weights for the first head and first batch. Verify that the upper triangle is all zeros (after softmax). Remove the causal mask and observe how the attention weights change: you should see non-zero values in the upper triangle.
Why one head isn't enough, splitting dimensions, and parallel computation across heads
Imagine trying to understand a sentence with only one type of attention. Should you attend to the subject? The verb? The object? The previous adjective? With a single head, you must average all these needs into one attention pattern. Multi-head lets each head specialize.
Head 1 might learn subject-verb relationships. Head 2 might track noun-adjective pairs. Head 3 might focus on positional proximity. All run in parallel with the same cost as single-head.
All relationship types compete for the same attention weights. The model must learn a "compromise" attention pattern that's suboptimal for any specific relationship type. Expressivity is severely limited.
The key insight: splitting d_model into n_heads chunks of size d_k means the total compute (n_heads * (T * d_k * T)) equals the single-head compute (T * d_model * T). You get more expressivity for free.
Getting shapes right is 90% of the battle in multi-head attention. Here's the complete reshape sequence with explanations:
# Starting shape: (B, T, n_embd) where n_embd = n_head * head_size # Example: (2, 16, 768) with n_head=12, head_size=64 # Step 1: Split embedding dimension into heads # (B, T, n_embd) -> (B, T, n_head, head_size) q = q.view(B, T, self.n_head, self.head_size) # Step 2: Move head dimension before sequence for parallel computation # (B, T, n_head, head_size) -> (B, n_head, T, head_size) q = q.transpose(1, 2) # Now we can do batch matrix multiply across heads: # (B, n_head, T, head_size) @ (B, n_head, head_size, T) -> (B, n_head, T, T) scores = q @ k.transpose(-2, -1)
n_head=1 and observe the shapes. The transpose becomes a no-op. Then set n_head=768 (head_size=1). What happens to the attention scores shape? Try n_head=768 with a real input: does the model train? (Hint: head_size=1 is too small for meaningful attention).
Here's the complete forward method with proper multi-head handling, causal masking, and head recombination:
def forward(self, x): B, T, C = x.size() # Project and split Q, K, V qkv = self.c_attn(x) # (B, T, 3*C) q, k, v = qkv.split(self.n_embd, dim=2) # Reshape for multi-head attention # (B, T, C) -> (B, T, n_head, head_size) -> (B, n_head, T, head_size) q = q.view(B, T, self.n_head, self.head_size).transpose(1, 2) k = k.view(B, T, self.n_head, self.head_size).transpose(1, 2) v = v.view(B, T, self.n_head, self.head_size).transpose(1, 2) # Scaled dot-product attention # (B, n_head, T, head_size) @ (B, n_head, head_size, T) -> (B, n_head, T, T) att = (q @ k.transpose(-2, -1)) * (1.0 / (self.head_size ** 0.5)) # Causal mask: prevent attending to future tokens # torch.triu creates upper triangular matrix causal_mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool() att = att.masked_fill(causal_mask, float('-inf')) # Softmax to get attention weights (now upper triangle is 0) att = nn.functional.softmax(att, dim=-1) # Apply attention to values # (B, n_head, T, T) @ (B, n_head, T, head_size) -> (B, n_head, T, head_size) y = att @ v # Re-assemble: (B, n_head, T, head_size) -> (B, T, C) y = y.transpose(1, 2).contiguous() # (B, T, n_head, head_size) y = y.view(B, T, C) # (B, T, n_embd) # Final projection + dropout y = self.c_proj(y) y = nn.functional.dropout(y, p=self.dropout, training=self.training) return y
The contiguous() call after transpose is important: transpose returns a view with rearranged strides, not a contiguous tensor. view() requires contiguous memory, so we call contiguous() first.
CausalSelfAttention implementation. Test with different n_head values: 1, 6, 12, 24. For each, print the shape of y to verify it's always (B, T, n_embd) regardless of head count. Time the forward pass with torch.cuda.Event if you have a GPU. Does more heads always mean slower?
view + transpose) is the key to parallel attention across heads. Always verify tensor shapes at each step: (B, n_head, T, head_size) is your friend.
Feed-forward network, GELU vs ReLU, expansion ratio, and why we need it
The MLP serves a different purpose than attention. While attention mixes information across positions, the MLP processes each position independently, adding computational depth. The expansion to 4*d_model followed by projection back to d_model creates a "bottleneck" that forces the model to learn compact representations.
class MLP(nn.Module): def __init__(self, config: GPTConfig): super().__init__() self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias) self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias) self.dropout = nn.Dropout(config.dropout) # Note: GELU is applied in forward(), not stored as a Module def forward(self, x): # x: (B, T, n_embd) x = self.c_fc(x) # (B, T, 4*n_embd) x = nn.functional.gelu(x) # GELU activation x = self.c_proj(x) # (B, T, n_embd) x = self.dropout(x) return x
ReLU (max(0, x)) was the standard activation, but it has problems: it's not smooth at 0, and neurons can "die" (always output 0) if weights push inputs consistently negative. GELU (Gaussian Error Linear Unit) is smoother and theoretically motivated by dropout.
Smooth everywhere, approximately x * sigmoid(1.702*x). Small negative values still pass through (stochastic regularization effect). Used in GPT-2, GPT-3, BERT. Better empirical performance.
Not smooth at 0. "Dying ReLU" problem: if a neuron's weights push input below 0 consistently, it never recovers. All negative values become exactly 0. Simpler but inferior for deep transformers.
# GELU implementation (PyTorch has it built-in) # But here's what it looks like conceptually: def gelu_approx(x): return x * torch.sigmoid(1.702 * x) # Or the exact formulation (what PyTorch uses): def gelu_exact(x): return 0.5 * x * (1.0 + torch.erf(x / (2.0 ** 0.5)))
The expansion ratio of 4 (d_ff = 4 * d_model) is a hyperparameter inherited from the original Transformer and widely used in GPT models. It's a sweet spot: too small and the MLP is a bottleneck that limits capacity; too large and you waste parameters on an overly wide network.
For a transformer block with d_model=768 and expansion 4:
The MLP accounts for ~2/3 of the block's parameters. This is typical and intentional: the MLP provides most of the model's capacity.
# Parameter count for MLP with expansion ratio E def mlp_params(d_model, expansion=4): d_ff = expansion * d_model return d_model * d_ff + d_ff * d_model # = 2 * expansion * d_model^2 print(mlp_params(768, 4)) # 4,718,592 print(mlp_params(768, 2)) # 2,359,296 (half the params) print(mlp_params(768, 8)) # 9,437,184 (double the params)
MLP class. Create a GPT model and print the parameter count per component using sum(p.numel() for p in layer.parameters()). Verify that MLP params are roughly 2x the attention params. Try expansion ratios of 2, 4, and 8. How does this affect total model size and training speed?
Pre-norm vs post-norm, gradient flow, and why LayerNorm works better than BatchNorm here
Block class with pre-norm or post-norm ordering. The choice between pre-norm (LayerNorm before attention/MLP) and post-norm (LayerNorm after) dramatically affects training stability. Pre-norm, used in GPT-2 and later models, provides better gradient flow and allows deeper networks without skip connections between every layer.
BatchNorm normalizes across the batch dimension, which assumes a fixed batch size and similar statistics across samples. LayerNorm normalizes across the feature dimension (for each sample independently), making it batch-size-agnostic and better suited for sequences of varying length.
Normalizes across features: LayerNorm(x) = (x - mean(x)) / std(x) * gamma + beta where mean/std are computed across the embedding dimension. Works with any batch size, any sequence length. No running statistics needed.
Normalizes across batch: requires tracking running statistics, behaves differently during training vs inference, struggles with small batches or variable-length sequences. Designed for CNNs, not transformers.
class Block(nn.Module): def __init__(self, config: GPTConfig): super().__init__() self.ln_1 = nn.LayerNorm(config.n_embd, bias=config.bias) self.attn = CausalSelfAttention(config) self.ln_2 = nn.LayerNorm(config.n_embd, bias=config.bias) self.mlp = MLP(config) def forward(self, x): # x: (B, T, n_embd) # Pre-norm: normalize BEFORE the sublayer x = x + self.attn(self.ln_1(x)) # Residual + Attention x = x + self.mlp(self.ln_2(x)) # Residual + MLP return x
The original Transformer used post-norm: x = ln(attention(x) + x). GPT-2 and later models switched to pre-norm: x = x + attention(ln(x)). Pre-norm has better gradient flow because the residual path is "cleaner": gradients can flow directly through the identity skip connection without passing through LayerNorm.
# Post-norm (original Transformer): def forward_postnorm(self, x): x = self.ln_1(x + self.attn(x)) # LayerNorm AFTER residual x = self.ln_2(x + self.mlp(x)) return x # Pre-norm (GPT-2, more stable): def forward_prenorm(self, x): x = x + self.attn(self.ln_1(x)) # LayerNorm BEFORE attention x = x + self.mlp(self.ln_2(x)) return x
Pre-norm is now the default for most implementations because it allows training deeper networks without the optimization difficulties that post-norm can exhibit. The residual connection ensures that even if the attention or MLP output is garbage, the model can default to the identity function.
The residual connection x + sublayer(x) is crucial for deep networks. During backpropagation, the gradient can flow directly through the identity connection, avoiding the vanishing gradient problem that plagues deep networks.
Mathematically, if y = x + F(x), then dy/dx = I + dF/dx. The identity term I ensures that even if F(x) has tiny gradients, the overall gradient is at least 1 (the identity contribution).
# Visualizing the residual connection: # Without residual: x -> LayerNorm -> Attention -> output # Gradient must pass through all operations (can vanish) # With residual: x -> LayerNorm -> Attention -> + -> output # ^---------------------| # Gradient has a direct path through the skip connection # This is why we can train 100+ layer transformers: # Each layer's skip connection provides a gradient superhighway
In the complete GPT model, the residual connections in each Block, combined with pre-norm, create a stable gradient flow that allows training models with billions of parameters.
Block class with pre-norm. Create a 12-layer GPT model. Visualize the computation graph or simply verify that the output shape matches the input shape: assert y.shape == x.shape. Remove the residual connections and observe how the model trains (or fails to train) on a simple task.
Sharing embeddings between input and output, lm_head, and the full forward pass
In a typical language model, you have:
Notice that these shapes are transposes of each other. Weight tying shares the same matrix for both: lm_head.weight = wte.weight. This works because the operation is conceptually the same: measuring similarity between a vector and each token's embedding.
class GPT(nn.Module): def __init__(self, config: GPTConfig): super().__init__() self.config = config self.transformer = nn.ModuleDict(dict( wte = nn.Embedding(config.vocab_size, config.n_embd), wpe = nn.Embedding(config.block_size, config.n_embd), drop = nn.Dropout(config.dropout), h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), ln_f = nn.LayerNorm(config.n_embd, bias=config.bias), )) # Option 1: Separate lm_head (more parameters) self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) # Option 2: Weight tying (share wte weights) # self.lm_head.weight = self.transformer.wte.weight # Tie weights def forward(self, idx, targets=None): B, T = idx.shape assert T <= self.config.block_size pos = torch.arange(T, device=idx.device) # Embeddings tok_emb = self.transformer.wte(idx) pos_emb = self.transformer.wpe(pos) x = self.transformer.drop(tok_emb + pos_emb) # Transformer blocks for block in self.transformer.h: x = block(x) # Final LayerNorm x = self.transformer.ln_f(x) # Output projection to vocabulary logits if targets is not None: # During training: we need logits for all positions logits = self.lm_head(x) loss = nn.functional.cross_entropy( logits.view(-1, logits.size(-1)), targets.view(-1) ) else: # During inference: only need logits for the last position logits = self.lm_head(x[:, [-1], :]) loss = None return logits, loss
Weight tying is motivated by the symmetry of the task. The input embedding asks: "What is the vector representation of token X?" The output embedding asks: "What is the probability of token X given this vector?" These are inverse operations, so sharing weights makes sense.
Mathematically, the output logits for position i are computed as:
# Without tying: logits[i, j] = dot(x[i], W_out[j]) + b[j] # Where W_out is a separate matrix for output projection # With tying: W_out = W_in^T (transpose of input embeddings) logits[i, j] = dot(x[i], W_in[j]) # This is just measuring similarity between x[i] and token j's embedding
Weight tying is especially beneficial for smaller models where the embedding parameters are a significant fraction of total parameters. For GPT-3 scale models, the savings are less impactful but still meaningful.
With all components in place, the full forward pass is: input IDs → token embeddings + positional embeddings → dropout → N transformer blocks (each: pre-norm → attention → residual, pre-norm → MLP → residual) → final LayerNorm → output projection → logits.
The shape flow for a concrete example with batch=4, seq_len=128, n_embd=768:
idx: (4, 128) - token IDstok_emb: (4, 128, 768) - after wtepos_emb: (128, 768) - broadcast to (4, 128, 768)x after dropout: (4, 128, 768)logits: (4, 128, 50257) - vocabulary-sized outputEach position's logits represent the model's prediction for the NEXT token at that position. The cross-entropy loss compares these predictions against the shifted target sequence.
generate() that takes input IDs and generates new tokens autoregressively. Test the full forward pass with a small batch. Verify shapes at each step. Enable weight tying and confirm that lm_head.weight is transformer.wte.weight returns True.
Parameter count analysis, testing the full model, and preparing for training
Let's count parameters for a GPT-2 small configuration: n_layer=12, n_head=12, n_embd=768, vocab_size=50257, block_size=1024.
def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) def gpt2_small_params(): # Embeddings wte = 50257 * 768 # 38,597,056 wpe = 1024 * 768 # 786,432 # One transformer block # Attention: c_attn = 768 * (3 * 768) # 1,769,472 (QKV projection) c_proj_attn = 768 * 768 # 589,824 (output projection) # MLP: c_fc = 768 * (4 * 768) # 2,359,296 c_proj_mlp = (4 * 768) * 768 # 2,359,296 # LayerNorms: 2 * (2 * 768) = 3,072 (small, negligible block_params = c_attn + c_proj_attn + c_fc + c_proj_mlp # 12 blocks all_blocks = 12 * block_params # 85,377,024 # Final LayerNorm ln_f = 2 * 768 # 1,536 # LM head (if not tied) lm_head = 768 * 50257 # 38,597,056 total = wte + wpe + all_blocks + ln_f + lm_head return total # ~124,734,720 parameters print(f"GPT-2 Small estimated params: {gpt2_small_params():,}")
Before training, verify that your model runs correctly with a forward pass using random inputs. Check output shapes, run a backward pass to verify gradients flow, and test generation.
import torch from torch.utils.data import DataLoader # Create model config = GPTConfig(vocab_size=50257, n_embd=768, n_layer=12, n_head=12, block_size=1024) model = GPT(config) model.to('cuda' if torch.cuda.is_available() else 'cpu') # Test forward pass B, T = 4, 128 idx = torch.randint(0, config.vocab_size, (B, T)) if torch.cuda.is_available(): idx = idx.cuda() logits, loss = model(idx, targets=idx) # Use same as targets for testing print(f"Logits shape: {logits.shape}") # Should be (4, 128, 50257) print(f"Loss: {loss.item()}") # Test backward pass loss.backward() print("Backward pass successful!") # Check gradient norms (should be reasonable, not NaN or Inf) total_norm = 0 for p in model.parameters(): if p.grad is not None: param_norm = p.grad.data.norm(2) total_norm += param_norm.item() ** 2 total_norm = total_norm ** 0.5 print(f"Gradient norm: {total_norm}")
After Week 2, you have:
Week 3 will add the training loop: data loading, loss computation, optimizer setup (AdamW), learning rate scheduling, gradient clipping, and evaluation. Your model is ready to learn!
Teach your model language with data, loss, and optimization
Week 3 Focus: Teaching techniquesNext-token prediction, cross-entropy loss, and loss interpretation
Before we can train our model, we need to answer one fundamental question: what exactly are we trying to minimize? This is the training objective , the measure of how wrong the model is, and the signal that drives learning. For language models, the answer is elegantly simple: predict the next token. Every position in every training sequence becomes a supervised example where the context is the input and the following character is the label. Understanding this deeply , not just as a formula but as an intuition , is what separates someone who can run training from someone who can diagnose it when it goes wrong.
A language model is trained to predict the next token given all previous tokens in a sequence. This framing is called self-supervised learning: no human labeling is needed because the text itself provides its own labels. Given the sequence "hello", the model learns to predict "e" from "h", "l" from "he", "l" from "hel", "o" from "hell" , all from a single training example. A sequence of 256 characters gives the model 255 separate prediction tasks to learn from simultaneously.
Why is this powerful? Self-supervised learning unlocks the entire internet as training data. Manual annotation for next-token prediction would be impossible at scale , you'd need a human to label billions of individual token positions. Instead, any text file you can find becomes a valid training corpus with zero annotation cost.
The label for any input is simply the input shifted right by one position. This is the "shift" you'll see in every data loader implementation:
# A single sequence becomes many prediction targets automatically
# Input sequence: [h, e, l, l, o, , w, o, r, l, d]
# Target sequence: [e, l, l, o, , w, o, r, l, d, !]
# ↑ predict next char at every position in parallel
def get_input_and_labels(sequence):
x = sequence[:-1] # All tokens except the last (inputs)
y = sequence[1:] # All tokens except the first (labels, shifted by 1)
return x, y # Same length: every input position has a label
During training, the model sees all positions in parallel (thanks to the causal mask from Week 2), computing a prediction and a loss at every single position. This is why transformers are so data-efficient: one forward pass on a 256-token sequence gives you 256 gradient signals rather than just one.
Cross-entropy loss quantifies how surprised the model is by the correct answer. The model outputs a probability distribution over all vocabulary characters, and we measure how much probability mass it assigned to the actual next character. Mathematically: loss = -log(p_correct). If the model assigns 1% probability to the right character, -log(0.01) ≈ 4.6. If it assigns 50%, -log(0.5) ≈ 0.69. Perfect prediction would give loss 0.
Why cross-entropy specifically? It aligns exactly with maximum likelihood estimation: minimizing cross-entropy is mathematically equivalent to maximizing the probability the model assigns to the training data. It also has ideal gradient properties , when the model is very wrong, gradients are large; when it's nearly right, gradients are small. This prevents oscillation near the optimum.
In practice we average the loss over every token in every sequence in the batch. This is why the logits need to be reshaped before passing to PyTorch's loss function:
import torch.nn.functional as F
# logits shape: (batch_size, seq_len, vocab_size)
# targets shape: (batch_size, seq_len)
# F.cross_entropy expects: (N, C) logits and (N,) targets
# So we flatten the batch and sequence dimensions together:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)), # → (batch*seq_len, vocab_size)
targets.view(-1) # → (batch*seq_len,)
)
# loss is now a single scalar: mean CE across all positions and batches
There is also a useful connection to perplexity: perplexity = e^loss. A loss of 4.2 gives perplexity ≈ 66, which means the model is as uncertain as if it were choosing uniformly from 66 options , consistent with a 65-character vocabulary at random. A loss of 1.0 gives perplexity ≈ 2.7, meaning the model is only as uncertain as a coin flip weighted toward the correct answer.
Loss numbers are meaningless in isolation , they only make sense relative to your vocabulary size and your baseline. For a character-level model on TinyShakespeare (65 characters), here are the key milestones to know:
-ln(1/65)). The model has learned nothing. Training hasn't started or something is broken.A concrete intuition for confidence: At loss=1.0, the model assigns on average e^(-1.0) ≈ 37% probability to the correct next character. At loss=2.0, it's e^(-2.0) ≈ 13.5%. At the random baseline of 4.2, it's 1/65 ≈ 1.5%. Watch for train and val loss diverging , that's your first sign of overfitting.
Loss is a number, but training is about changing weights. The chain connecting them is backpropagation: once we have a loss scalar, PyTorch traces every operation that produced it and computes the gradient of the loss with respect to every learnable parameter. These gradients tell each weight which direction to move to reduce the loss.
High loss means large gradients , the model is very wrong and needs big corrections. Low loss means small gradients , the model is mostly right and only needs fine-tuning. This is why training curves have a characteristic shape: rapid initial improvement (large gradients pulling weights into roughly the right region), then slower refinement (small gradients making precise adjustments).
The three-line update you'll write in every training loop:
loss.backward() # Compute gradients for every parameter
optimizer.step() # Subtract gradient × learning_rate from each weight
optimizer.zero_grad() # Clear gradients so they don't accumulate into next step
The order matters: zero_grad() must happen either before backward() or right after step(), never between them. If you forget it, gradients from previous batches pile up and your updates will be wildly too large.
import math; vocab_size = len(set(open('input.txt').read())); print(-math.log(1/vocab_size)). Write this number down , it's your starting line. Then add a one-liner to your training script that prints "Random baseline: {baseline:.4f}" before the first training step so you always know how far you've come.
Key Takeaway: Cross-entropy loss quantifies training progress. Character-level models with good performance hit 1.0-1.2 loss, while random guessing gives ~4.2.
Character-level tokenization, stoi/itos mappings, train/val split
The data pipeline is where most silent training bugs live. Before your model ever sees a weight update, you need to read raw text, convert every character to an integer, split the corpus into training and validation sets, and sample batches of correctly paired inputs and labels. Every step has a way to silently go wrong , and a wrong data pipeline will produce confidently incorrect training runs, with loss falling smoothly toward a wrong answer. Today you build and verify each piece before wiring them together.
Neural networks can't process raw text , they need numbers. Tokenization is the process of converting characters (or words, or subwords) into integer indices that can be looked up in an embedding table. At the character level, we collect every unique character in the dataset, sort them into a consistent order, and assign each one an integer from 0 to vocab_size−1.
Why sort? Sorting guarantees reproducibility. Without it, Python's set() returns characters in an arbitrary order that varies between runs. Two training runs on the same data would produce different stoi mappings, making saved checkpoints incompatible with each other.
Why character-level for small datasets? Word-level tokenization needs enough examples of each word to learn its embedding reliably , typically thousands of occurrences. Character-level tokenization only needs a ~65-character vocabulary, so even 1MB of text gives you tens of thousands of examples per character. The tradeoff is longer sequences (characters instead of words), but this is manageable at our scale.
def load_data(path):
with open(path, 'r', encoding='utf-8') as f:
text = f.read()
# Collect all unique characters and sort for reproducibility
chars = sorted(list(set(text)))
vocab_size = len(chars)
# stoi: character → integer index (used during encoding)
stoi = {ch: i for i, ch in enumerate(chars)}
# itos: integer index → character (used during decoding/generation)
itos = {i: ch for i, ch in enumerate(chars)}
# Convenience wrappers so callers don't need to know stoi/itos directly
encode = lambda s: [stoi[c] for c in s]
decode = lambda ids: ''.join([itos[i] for i in ids])
return text, stoi, itos, encode, decode, vocab_size
After calling this, encode("hello") gives you something like [32, 29, 36, 36, 39] and decode([32, 29, 36, 36, 39]) gives back "hello". Always verify that decode(encode(text)) == text before proceeding , if this fails, there are characters in your data that aren't in your vocabulary.
Once we have our encode function, we convert the entire text corpus into a single long tensor of integers. This is the format our model will consume throughout training. We then split it 90% / 10% into training and validation sets.
Why split sequentially, not randomly? For text, shuffling individual characters before splitting would leak information: the validation set would contain characters from the middle of training sequences, and the model would have implicitly "seen" the surrounding context during training. A clean sequential split ensures the validation set is genuinely unseen text.
import torch
# Encode the full text to a 1D integer tensor
# dtype=torch.long (int64) because embedding layers expect integer indices
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Dataset: {len(data):,} tokens, vocab size: {vocab_size}")
# e.g. "Dataset: 1,115,394 tokens, vocab size: 65"
# Sequential 90/10 split
split_idx = int(0.9 * len(data))
train_data = data[:split_idx] # ~1,003,854 tokens
val_data = data[split_idx:] # ~111,540 tokens
We don't train on the full dataset at once , memory and compute don't allow it. Instead, we sample small random batches of fixed-length subsequences. Each batch contains batch_size sequences of length context_len, drawn from random positions in the data. The label for each input sequence is the same sequence shifted one position to the right.
Why random positions? If we always sampled in order, the model would see the beginning of the text thousands of times before the end appears once. Random sampling ensures uniform coverage across the whole corpus within each epoch, making training more stable and preventing the model from learning position-specific patterns.
def get_batch(data, batch_size, context_len, device):
# Sample batch_size random starting positions
# Stop context_len from the end so we can always get a full sequence
idx = torch.randint(len(data) - context_len, (batch_size,))
# x: input sequences, shape (batch_size, context_len)
x = torch.stack([data[i: i+context_len] for i in idx])
# y: target sequences, shifted by 1, same shape
y = torch.stack([data[i+1: i+context_len+1] for i in idx])
# Move to the correct device before returning
return x.to(device), y.to(device)
# Sanity check: y should be x shifted left by one position
x, y = get_batch(train_data, batch_size=4, context_len=8, device='cpu')
print(x[0]) # e.g. tensor([32, 29, 36, 36, 39, 1, 47, 53])
print(y[0]) # e.g. tensor([29, 36, 36, 39, 1, 47, 53, 56])
assert (x[0][1:] == y[0][:-1]).all(), "Labels must be inputs shifted by 1!"
Run that assertion every time you change the data pipeline. The most common bug in data loading is accidentally setting y = data[i:i+context_len] (same as x) instead of y = data[i+1:i+context_len+1]. The model will train, loss will drop, but it's learning to copy its input rather than predict the next token , a subtle and completely silent failure.
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt). Implement load_data and get_batch from today's lesson. Print the vocab size, dataset length, and the shapes of one batch. Then verify that y[0] equals x[0] shifted left by one position , if they don't match, your labels are wrong and nothing will train correctly.
Key Takeaway: Character-level tokenization is ideal for small datasets. Always split into train/val sets to monitor overfitting.
MPS vs CUDA vs CPU, get_batch function, tensor shapes
Your model and its data must live on the same device , a matrix multiplication between a tensor on CPU and one on GPU raises an immediate error. More importantly, the difference between training on CPU vs a GPU is not marginal: a 10M parameter model that takes 3 hours on Apple Silicon M3 would take days on a laptop CPU. Day 17 wires together device detection with the encoding pipeline from Day 16 to produce a complete, device-aware training setup, and gives you the tools to verify everything is on the right device before launching a real run.
A "device" in PyTorch refers to a specific memory space and compute unit. A CPU tensor lives in your machine's RAM and is computed by your CPU cores. A CUDA tensor lives in your NVIDIA GPU's VRAM and is computed in parallel across thousands of GPU cores. An MPS tensor does the same on Apple Silicon's GPU. You can't mix them in a single operation.
Why does this matter for speed? Transformer training is dominated by large matrix multiplications. A CPU runs these serially across a handful of cores. A GPU runs them across thousands of cores simultaneously. For a model with ~10M parameters, this translates to a 20–50x speedup in practice. The time difference between training runs is what makes GPU selection worth thinking about before you start.
We prioritize: CUDA (NVIDIA) → MPS (Apple Silicon) → CPU (fallback for testing and debugging).
import torch
def get_device():
if torch.cuda.is_available():
device = torch.device('cuda')
elif torch.backends.mps.is_available(): # Apple M1/M2/M3/M4
device = torch.device('mps')
else:
device = torch.device('cpu')
print(f"Using device: {device}")
return device
# Verify your model is on the right device after moving it:
# model.to(device)
# print(next(model.parameters()).device) # Should match get_device() output
MPS caveat: Not all PyTorch operations are supported on MPS. Unsupported ops silently fall back to CPU, which can make timing benchmarks misleading. If you see unexpected slowness on MPS, check for fallback warnings with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 or run torch.backends.mps.is_built() to confirm MPS support is compiled in.
When we encode text to a tensor, we must use dtype=torch.long (64-bit integer). This is required because our tokens are indices , they get passed to an embedding layer which performs a table lookup, and embedding layers require integer inputs. If you accidentally use torch.float, you'll get a cryptic "expected Long" error at training time.
device = get_device()
# Encode: text string → 1D integer tensor (stays on CPU for now)
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Full dataset: {data.shape}, dtype: {data.dtype}")
# Full dataset: torch.Size([1115394]), dtype: torch.int64
# Sequential train/val split (do NOT shuffle text data)
split_idx = int(0.9 * len(data))
train_data = data[:split_idx]
val_data = data[split_idx:]
# Note: the full dataset stays on CPU here.
# Individual batches are moved to device inside get_batch() via .to(device).
# This avoids requiring GPU memory for the full dataset at once.
Keeping the full dataset on CPU and only moving individual batches to GPU is important for memory management. A 1MB text file encodes to ~1M tokens, which is ~8MB as int64 , small enough for CPU RAM, but you're likely running other processes too. On the GPU, you want memory reserved for model weights, activations, and gradients , not static data that can be streamed in batch by batch.
Before committing to a multi-hour training run, spend 5 minutes verifying your pipeline end-to-end. These checks catch the most common setup errors without wasting compute:
# 1. Verify device is correct
device = get_device()
model = GPT(config).to(device)
print(next(model.parameters()).device) # Must match device
# 2. Verify data shapes
x, y = get_batch(train_data, batch_size=4, context_len=64, device=device)
print(x.shape, x.device) # torch.Size([4, 64]), cuda:0 (or mps, or cpu)
print(y.shape, y.device) # torch.Size([4, 64]), same device
# 3. Verify a single forward pass runs without errors
with torch.no_grad():
logits = model(x)
print(logits.shape) # (4, 64, vocab_size)
# 4. Verify the initial loss is near the random baseline
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
print(f"Initial loss: {loss.item():.4f} (expect ~{math.log(vocab_size):.2f})")
That last check is especially valuable: an untrained model with random weights should assign roughly equal probability to every character, giving a loss very close to -ln(1/vocab_size). If your initial loss is significantly lower than the random baseline, something is wrong , likely your model or data has been accidentally pre-fitted, or there's a label leakage bug. If it's higher than the baseline, check your loss function is averaging correctly across all positions.
get_device() to your script and confirm it prints the right device. Then benchmark a single forward pass: create a dummy batch at your intended dimensions and use import time; t=time.time(); [model(x) for _ in range(10)]; print((time.time()-t)/10). Multiply by your planned number of steps to estimate total training time before committing to a full run.
Key Takeaway: Always use GPU/MPS when available for 10-50x faster training. Proper tensor shape management avoids runtime errors.
Warmup phase, cosine decay, get_lr function
The learning rate is the single most consequential hyperparameter in training. Too high and the model diverges , gradients become enormous, weights fly off to infinity, and your loss hits NaN. Too low and training crawls, getting stuck in shallow local minima, wasting hours of compute on negligible improvement. A fixed learning rate forces you to choose between these extremes. A schedule lets you have it both ways: aggressive steps early when you need rapid progress, precise steps late when you need to converge. Today you implement the warmup + cosine decay schedule used in GPT-3 and most modern transformer training runs.
At the start of training, the model's weights are randomly initialized and the gradients are noisy , they point in roughly the right direction but with a lot of variance. Using a large constant LR here leads to overshooting: the optimizer steps too far, bounces around the loss landscape, and can diverge entirely (loss → NaN or infinity). Using a small constant LR avoids this but then converges too slowly throughout the entire run.
The insight behind LR scheduling is that the two phases of training have different needs:
A practical way to see this: with max_lr=3e-4 and your typical 5,000-step training run, a constant LR of 3e-4 often diverges in the first 100 steps. Drop it to 3e-5 and it's stable but barely moves. The schedule gives you 3e-4 where it's safe (after warmup) and backs off smoothly as precision matters.
The schedule has three regions. During warmup, LR rises linearly from 0 to max_lr. After warmup, it follows a cosine curve down to a minimum LR (typically 10% of max). After max_steps, it holds at the minimum.
Why cosine specifically? The cosine function has a useful property: it's steep in the middle (large LR reductions when the model is improving quickly) and flat near the ends (gentle changes at the start and end of decay). This mirrors the natural shape of improvement , you want bigger LR reductions early in the decay when the model still has a lot to improve, and smaller changes at the end to avoid disrupting a nearly-converged model.
import math
def get_lr(step, warmup_steps=100, max_lr=3e-4, max_steps=5000, min_lr=3e-5):
# Phase 1: Linear warmup from 0 → max_lr over warmup_steps
if step < warmup_steps:
return max_lr * (step / warmup_steps)
# Phase 3: Hold at min_lr after decay is complete
if step >= max_steps:
return min_lr
# Phase 2: Cosine decay from max_lr → min_lr
# decay_ratio goes from 0.0 (start of decay) to 1.0 (end of decay)
decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
# coeff goes from 1.0 → 0.0 following a cosine curve
coeff = 0.5 * (math.cos(math.pi * decay_ratio) + 1.0)
return min_lr + coeff * (max_lr - min_lr)
Warmup length rule of thumb: 1–2% of total training steps. For a 5,000-step run, that's 50–100 warmup steps. For a 50,000-step run, 500–1,000. Too short a warmup and you get early instability; too long and you waste steps at low LR when you could be making progress.
PyTorch optimizers have a fixed LR set at initialization. To change it mid-training, you modify the optimizer's param_groups directly. This must happen at the start of each training step, before the forward pass:
for step in range(max_steps):
# 1. Update LR for this step (before the forward pass)
lr = get_lr(step)
for param_group in optimizer.param_groups:
param_group['lr'] = lr
# 2. Standard training step
x, y = get_batch(train_data, batch_size, context_len, device)
logits = model(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Confirm schedule is working: print LR every 100 steps
if step % 100 == 0:
actual_lr = optimizer.param_groups[0]['lr']
expected_lr = get_lr(step)
print(f"Step {step}: LR={actual_lr:.2e} (expected {expected_lr:.2e})")
Verifying that the actual LR in optimizer.param_groups[0]['lr'] matches your get_lr(step) output is a quick sanity check that the schedule is actually being applied. It's easy to accidentally update the wrong variable or miss a step , printing this every 100 iterations costs nothing and catches schedule bugs immediately.
get_lr() into a notebook. Generate LR values for steps 0–2000 with warmup_steps=500 and plot them: plt.plot(steps, [get_lr(s) for s in steps]). Add a vertical dashed line at step 500. The curve should rise linearly then arc down smoothly , if it doesn't, your formula has a bug. Fix it now before wiring it into your training loop, where a broken schedule is nearly impossible to debug.
Key Takeaway: Warmup + cosine decay stabilizes training and improves convergence. Never use constant LR for transformer models.
AdamW vs SGD, weight decay, betas, why AdamW for transformers
Gradient descent points us in the right direction , downhill on the loss surface , but it doesn't tell us how far to step, or whether to step equally in every direction. The optimizer makes these decisions. SGD takes the same-sized step for every parameter based only on the current gradient. AdamW keeps a running history of gradients and their magnitude, effectively building a per-parameter model of the loss surface curvature. This makes it dramatically more effective for transformers, where different parameters (embeddings, attention weights, MLP weights) have very different gradient statistics. AdamW is the standard optimizer for nearly every large language model ever trained , understanding why helps you tune it confidently.
Adam maintains two exponential moving averages for each parameter: m (first moment , the gradient direction) and v (second moment , the squared gradient magnitude). The update for each weight uses the ratio m / sqrt(v), which normalizes the gradient by its recent magnitude. This means parameters that receive large gradients take smaller effective steps, and parameters with small gradients take larger steps relative to their signal.
Why this matters for transformers: Embedding parameters only receive gradient updates for the specific tokens present in a batch , most of the embedding table gets zero gradient every step. SGD treats these sparse updates the same as dense ones, leading to inconsistent effective learning rates. Adam's per-parameter running averages adapt naturally to this sparsity: rarely-updated embeddings accumulate gradient history slowly and receive proportionally larger updates when they do appear.
SGD rule of thumb: it can work for convolutional networks where all parameters receive dense gradients every step. For transformers, you should always use Adam or a derivative , training time and final quality both suffer significantly with SGD.
The original Adam paper coupled weight decay with the gradient update, which caused a subtle interaction: weight decay was effectively scaled by the per-parameter adaptive LR, making it inconsistent , parameters with large gradient variance got less regularization than intended. AdamW fixes this by applying weight decay directly to the weights, independently of the gradient update:
# Adam (with L2 regularization, the "wrong" way):
# gradient = gradient + weight_decay * weight ← mixes with adaptive LR
# AdamW (decoupled weight decay, the "right" way):
# weight = weight - lr * gradient_update ← adaptive gradient step
# weight = weight * (1 - lr * weight_decay) ← separate shrinkage term
from torch.optim import AdamW
optimizer = AdamW(
model.parameters(),
lr=3e-4, # Initial LR; will be overridden by get_lr() schedule
betas=(0.9, 0.95), # beta1=0.9: gradient momentum; beta2=0.95: squared gradient momentum
weight_decay=0.1 # Penalizes large weights to prevent overfitting
)
What beta1=0.9 means: the first moment (gradient direction) is 90% old estimate + 10% new gradient. This smooths out noisy batch-to-batch gradient variation. What beta2=0.95 means: the second moment (gradient magnitude) decays slower , 95% old + 5% new , because variance estimates are noisier and need more history to stabilize. GPT-2 and GPT-3 both used beta2=0.95; PyTorch's default of 0.999 is often too slow to adapt for transformer training.
Weight decay=0.1: at each step, every weight shrinks by lr × weight_decay = 3e-4 × 0.1 = 3e-5 of its current value. This continuously pushes weights toward zero, preventing any single weight from growing too large. The gradient signal must overcome this shrinkage to maintain a weight's magnitude, which means only useful, well-supported weights stay large.
Not all parameters should have weight decay. Bias terms and LayerNorm parameters are typically excluded , they're 1D scalars that don't benefit from magnitude regularization and can be actively harmed by it (shrinking a bias term that's learned to be +5.0 is counterproductive). The standard practice from GPT papers is to separate parameters into two groups:
def configure_optimizer(model, lr, weight_decay):
# Separate parameters: apply weight decay only to 2D+ tensors (weights, not biases/norms)
decay_params = [p for n, p in model.named_parameters()
if p.requires_grad and p.dim() >= 2]
no_decay_params = [p for n, p in model.named_parameters()
if p.requires_grad and p.dim() < 2]
param_groups = [
{'params': decay_params, 'weight_decay': weight_decay},
{'params': no_decay_params, 'weight_decay': 0.0},
]
return AdamW(param_groups, lr=lr, betas=(0.9, 0.95))
optimizer = configure_optimizer(model, lr=3e-4, weight_decay=0.1)
print(f"Decay params: {sum(p.numel() for p in decay_params):,}")
print(f"No-decay params: {sum(p.numel() for p in no_decay_params):,}")
For a 10M parameter model, you'll typically find that 99%+ of parameters are in the decay group (all those weight matrices) and only a small fraction , biases, LayerNorm scales and shifts , are in the no-decay group. This split has a measurable effect on training stability and final model quality, particularly for smaller models where regularization pressure is more significant relative to the overall parameter count.
sum(p.numel() for p in model.parameters()) to confirm the total parameter count the optimizer will manage. Then run exactly 10 training steps and print the loss after each one , loss should decrease with every step. If it's flat or rising, check that you're calling optimizer.zero_grad() before loss.backward(), not after.
Key Takeaway: AdamW with weight decay is the go-to optimizer for transformers. It balances fast convergence and overfitting prevention.
Eval mode, gradient clipping, checkpointing, sample generation
Every piece from this week , data loading, device setup, the LR schedule, the optimizer , comes together in the training loop. The loop itself is short: it fits on one screen. But the order of operations is strict, and violating it produces bugs that are often silent. Loss may still drop, but slower than it should, or toward the wrong answer. Today you build the full loop correctly, understand exactly why each line is where it is, and add the validation and checkpointing logic that transforms a training script into something you can actually trust.
Each training step must follow this exact sequence. Every deviation is either an error or a deliberate optimization that you need to understand first:
for step in range(total_steps):
# 1. Set LR for this step
lr = get_lr(step)
for pg in optimizer.param_groups:
pg['lr'] = lr
# 2. Get a batch
x, y = get_batch(train_data, batch_size, context_len, device)
# 3. Zero gradients BEFORE the forward pass
optimizer.zero_grad()
# 4. Forward pass
logits = model(x)
# 5. Compute loss
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
# 6. Backward pass: compute gradients
loss.backward()
# 7. Clip gradients to prevent explosions
# clip_grad_norm_ scales ALL gradients down if their collective L2 norm > 1.0
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# 8. Update weights
optimizer.step()
Why does zero_grad() placement matter? PyTorch accumulates gradients: calling backward() adds new gradients on top of existing ones. If you forget to zero before each step, gradients from 10 previous batches pile up and your effective update is 10x too large. The resulting instability can look like a bad learning rate , a subtle misdirection.
Why gradient clipping? During certain unlucky batches, the loss can be very large, producing enormous gradients that would cause an enormous weight update. Gradient clipping rescales the entire gradient vector if its L2 norm exceeds a threshold (1.0 is standard). This doesn't change the direction of the update , only its maximum magnitude , keeping training numerically stable without meaningfully slowing learning.
Validation runs after each epoch (or every N steps on longer runs). The critical difference from training: you use model.eval() and torch.no_grad(). These serve different purposes: eval() switches dropout to pass-through mode (no random dropping during validation , we want deterministic outputs). torch.no_grad() tells PyTorch not to build a computation graph for these forward passes, saving memory and compute since we won't be backpropagating.
best_val_loss = float('inf')
for epoch in range(num_epochs):
# --- Training phase ---
model.train()
train_losses = []
for step in range(steps_per_epoch):
# ... training step from above ...
train_losses.append(loss.item())
# --- Validation phase ---
model.eval()
val_losses = []
with torch.no_grad(): # No gradient tracking needed
for _ in range(val_steps):
x, y = get_batch(val_data, batch_size, context_len, device)
logits = model(x)
val_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
val_losses.append(val_loss.item())
avg_train = sum(train_losses) / len(train_losses)
avg_val = sum(val_losses) / len(val_losses)
print(f"Epoch {epoch+1}: train={avg_train:.4f}, val={avg_val:.4f}, gap={avg_val-avg_train:.4f}")
# Save best checkpoint (not just every N epochs)
if avg_val < best_val_loss:
best_val_loss = avg_val
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'train_loss': avg_train,
'val_loss': avg_val,
'stoi': stoi, # Save vocab mappings so you can decode later
'itos': itos,
}, 'best_checkpoint.pt')
print(f" → New best val loss: {best_val_loss:.4f}, checkpoint saved")
Note that we save the vocabulary mappings (stoi and itos) in the checkpoint. If you reload a checkpoint later without them, you won't be able to decode the model's generated tokens back to text. A checkpoint that's missing the tokenizer is almost useless for generation.
Sometimes you run the training loop and something feels wrong. Here are the most common failure modes and how to identify them:
y is x shifted by one position, not identical to x. Also verify the model output is (batch, seq_len, vocab_size) , if it's wrong shape, cross_entropy silently misbehaves.clip_grad_norm_ is called. Try halving max_lr.get_lr() output at step 0 , it should be near zero, but by step warmup_steps it should be at max_lr.The most valuable debugging tool is printing loss every step for the first 20 steps. Loss should drop monotonically or near-monotonically. If it doesn't, the issue is in the training step order, not the model architecture.
f"Epoch {e}: train={train_loss:.4f}, val={val_loss:.4f}" at the end of each epoch. Confirm that train loss drops from ~4.2 toward ~2.5 by epoch 3 , if it doesn't budge, there's a bug in your data pipeline or gradient update order.
Key Takeaway: A full training loop includes validation, gradient clipping, checkpointing, and sample generation. The train/val loss gap is the best overfitting indicator.
Overfitting signs, train vs val loss, when to stop training
Training is only half the work. Understanding what the numbers and samples mean , and knowing when to stop , is what separates a model that was trained from a model that actually learned. A loss drop from 4.2 to 1.0 is a transformation from "random character guesser" to something that has internalized spelling, common words, and grammatical patterns. You can see this progression in the generated text long before the loss numbers tell the whole story. Day 21 gives you the diagnostic vocabulary to read what your training curves are telling you and make confident decisions about your model.
A healthy loss curve has a characteristic shape: a steep initial drop, followed by a long gradual improvement, followed by a plateau. Understanding what each phase looks like helps you diagnose problems early rather than waiting for a 3-hour run to finish.
The train/val loss gap is your primary overfitting diagnostic:
One subtlety: train loss is always expected to be somewhat lower than val loss, because the model has seen the training data before (it's literally what it trained on). A small gap isn't a problem. Only a large or widening gap is concerning.
Model capacity (number of parameters) must be matched to the amount of training data. Too large a model on too little data will memorize the training set , it has enough capacity to store it, rather than learning to generalize. Too small a model won't have enough capacity to represent the patterns in your data. On 1MB of text (~1M characters of TinyShakespeare), here's how different configurations perform:
# Config notation: (n_layers, n_heads, embed_dim) → approx parameter count
# 2L/2H/128D → ~0.5M params
# Trains fast (minutes on CPU, seconds on GPU)
# Sample: "the king of his the and to be the my"
# (valid words, no meaningful grammar or coherence)
# 4L/4H/256D → ~4M params
# Trains in ~30min on MPS/CUDA
# Sample: "HAMLET: I will not speak with her, my lord."
# (coherent sentences, character voices emerging)
# 6L/6H/384D → ~10M params
# Trains in ~2-3h on MPS, ~30min on good CUDA
# Sample: "KING HENRY: The queen hath given me many a tear"
# (multi-sentence coherence, consistent dramatic tone)
# Rule of thumb: for 1MB text, 10M params is about the right ceiling.
# Above this, you'll see val loss rise early no matter how you tune.
Start with the smallest config (2L/2H/128D) to verify your pipeline end-to-end in minutes. Then scale up once you're confident training is working correctly. There's no benefit to debugging a 10M parameter model when the same bug would show up in a 0.5M parameter model 20x faster.
The model checkpoint with the lowest validation loss is your best model , not necessarily the one from the last epoch. If you overfitted, your final checkpoint is actually worse than your best one. This is why we save best_checkpoint.pt whenever val loss improves, as shown in Day 20. The code to find the best epoch and load it:
# If you've been printing per-epoch losses, find the best manually:
losses = {
1: (3.21, 3.18), # (train, val)
2: (2.45, 2.49),
3: (1.82, 1.90),
4: (1.51, 1.63),
5: (1.24, 1.44), # ← best val loss
6: (1.05, 1.52), # gap increasing: overfitting starting
7: (0.89, 1.71), # clearly overfitting
}
best_epoch = min(losses, key=lambda e: losses[e][1]) # epoch 5
# Load the best checkpoint
checkpoint = torch.load('best_checkpoint.pt', weights_only=False)
model.load_state_dict(checkpoint['model_state_dict'])
stoi = checkpoint['stoi']
itos = checkpoint['itos']
print(f"Loaded best checkpoint from epoch {checkpoint['epoch']+1}")
print(f"Best val loss: {checkpoint['val_loss']:.4f}")
Once you've loaded the best checkpoint, generate a few samples (Week 4 covers generation in depth). Compare samples from epoch 5 vs epoch 7 in the example above , the epoch 7 model has lower train loss but higher val loss, and its outputs will be more repetitive and less varied because it has memorized specific training sequences rather than learning the underlying language patterns. This comparison makes overfitting concrete and visible, not just a number.
How to resume training from a checkpoint: Save optimizer_state_dict in your checkpoint (as shown in Day 20) and restore it with optimizer.load_state_dict(checkpoint['optimizer_state_dict']). This restores Adam's moment estimates, so training resumes from exactly where it left off. Without this, Adam's running averages reset to zero and you effectively restart warmup , sometimes causing a temporary loss spike in the first few steps after resuming.
torch.load('checkpoint_epochN.pt') and generate a 200-character sample. Compare it side-by-side with a sample from your final (potentially overfitted) checkpoint. The difference makes overfitting tangible, not just a number on a graph.
Key Takeaway: Monitor train/val loss gap to detect overfitting. Stop training when val loss plateaus. Larger models produce better samples but need more data.
Make your model write: sampling strategies, temperature, and top-k
Week 4 Focus: Making the model writeAutoregressive generation: predict one token at a time, append and repeat
Language models predict the next token given a sequence of previous tokens. To generate text, you start with a prompt, predict the next token, append it to the sequence, and repeat. This is called autoregressive generation because each prediction depends on all previous predictions. The term "autoregressive" comes from time series analysis where each value is regressed on previous values of the same variable.
def generate(model, idx, max_new_tokens): """Generate text autoregressively. Args: model: The GPT model idx: Tensor of shape (batch, seq_len) with starting tokens max_new_tokens: Number of tokens to generate Returns: Tensor with original prompt + generated tokens """ model.eval() # Set to evaluation mode for _ in range(max_new_tokens): # Get predictions for next token # Only use the last block_size tokens for context idx_cond = idx[:, -model.config.block_size:] # Forward pass: (batch, seq_len, vocab_size) logits, _ = model(idx_cond) # Focus on the last token's predictions # logits shape: (batch, seq_len, vocab_size) # We want: (batch, vocab_size) for the last position logits = logits[:, -1, :] # Convert to probabilities probs = torch.softmax(logits, dim=-1) # Sample the next token next_token = torch.multinomial(probs, num_samples=1) # Append to the sequence idx = torch.cat([idx, next_token], dim=1) return idx
model.eval() to disable dropout, and how it only passes the last block_size tokens to the model (because GPT can only handle sequences up to block_size tokens). What happens if you forget to truncate the sequence?
The generate loop has three key parts: (1) prepare the input sequence, (2) run the model to get logits for the next token, (3) sample or select the next token and append it. This loop repeats until you've generated the desired number of tokens. The "autoregressive" part means the input to each iteration includes all previously generated tokens.
# How to use the generate function import torch from model import GPT, GPTConfig from tokenizer import get_stoi, get_itos # Load your trained model checkpoint = torch.load("checkpoint_final.pt", weights_only=False) model = GPT(checkpoint["config"]) model.load_state_dict(checkpoint["model_state_dict"]) # Get character mappings stoi = checkpoint["stoi"] itos = checkpoint["itos"] # Encode the prompt prompt = "To be or not" tokens = [stoi[c] for c in prompt if c in stoi] idx = torch.tensor([tokens], dtype=torch.long) # Generate! output_idx = generate(model, idx, max_new_tokens=200) # Decode back to text output_text = "".join([itos[i] for i in output_idx[0].tolist()]) print(output_text)
GPT models have a fixed context window , the maximum sequence length they were trained on (block_size). During generation the sequence grows longer with every token you append. Once it exceeds block_size, you must truncate it or the model will receive a positional ID it has never seen during training, producing garbage output.
The fix is one line: always pass only the last block_size tokens into the model. The generated text still accumulates in full in idx (for decoding), but the model only "looks back" as far as its training allowed.
# Without truncation , works fine until sequence length > block_size logits, _ = model(idx) # idx grows indefinitely , will crash or hallucinate # With truncation , always safe idx_cond = idx[:, -model.config.block_size:] # keep last block_size tokens logits, _ = model(idx_cond) # The full idx still stores the entire generated sequence for decoding. # Only the model input is truncated, not the stored output.
block_size=256. Watch what happens after 256 tokens: the output will degrade or crash with an index-out-of-bounds error in the position embedding. Then add the truncation back and generate the same prompt. The difference is dramatic.
Calling model.eval() before generating is not optional , it changes the model's behaviour in two important ways:
# Correct inference setup model.eval() # Disable dropout, use running BN stats # Even better: pair with no_grad to save memory model.eval() with torch.no_grad(): output = generate(model, idx, max_new_tokens=200) # If you need to continue training after generation: model.train() # Switch back to train mode
model.eval() means dropout is active during generation. The same prompt with temperature=0 will produce different output on every call. If you're debugging generation quality and results are mysteriously inconsistent, check whether you set eval mode.
generate function from scratch without copying. Given a model, a starting token tensor, and max_new_tokens, implement the loop: forward pass → slice last logit → softmax → multinomial sample → cat. Call it with your trained checkpoint and a short prompt. Print the raw output to the terminal. Verify the output grows by exactly max_new_tokens tokens.
Always pick the most probable token - why it's deterministic and repetitive
Greedy decoding always selects the token with the highest probability. It's simple and deterministic: the same prompt always produces the same output. But this is exactly why it's problematic. Language is inherently probabilistic, and the most probable continuation isn't always the most interesting or coherent one. Greedy decoding tends to get stuck in loops or produce bland, repetitive text.
def generate_greedy(model, idx, max_new_tokens): """Greedy decoding: always pick the most probable token. This is deterministic - same prompt = same output every time. """ model.eval() with torch.no_grad(): # No need to track gradients for inference for _ in range(max_new_tokens): idx_cond = idx[:, -model.config.block_size:] # Get logits for next token logits, _ = model(idx_cond) logits = logits[:, -1, :] # GREEDY: always pick the most probable token # argmax returns the index of the highest value next_token = logits.argmax(dim=-1, keepdim=True) # Append and continue idx = torch.cat([idx, next_token], dim=1) return idx
Greedy decoding has two main problems: (1) It's deterministic, which means no creativity or variation. (2) It tends to be repetitive because the highest-probability continuation often reinforces itself. For example, if the model predicts "the" as the most likely next token, and "the" leads to another "the", you get "the the the...". Sampling strategies like temperature and top-k break this cycle by introducing controlled randomness.
# Compare greedy vs sampled output # Greedy (temperature=0.1, very close to argmax): # "To be or not to be or not to be or not to be..." # With temperature sampling (temperature=0.8): # "To be or not to be, that is the question. Whether 'tis..." # The greedy output is stuck in a loop! # This happens because the model learns that certain # patterns (like "to be or not") are highly probable, # and greedy always picks the highest probability.
Repetition isn't accidental , it's the direct consequence of always choosing the most probable token. Here's why the loop forms:
The model isn't "broken" , it's doing exactly what it was trained to do: predict the most likely continuation. The problem is that the most likely continuation locally is not always the most interesting continuation globally. Greedy optimises for each token in isolation, not the quality of the full sequence.
# Greedy feedback loop visualised # Each step the model sees its own previous output # Step 1: "To be or not" → P(next=" to") = 0.41 ← greedy picks this # Step 2: "be or not to" → P(next=" be") = 0.55 ← greedy picks this # Step 3: "or not to be" → P(next=" or") = 0.38 ← greedy picks this # Step 4: "not to be or" → P(next=" not") = 0.47 ← loop locked in # With sampling at T=0.8, other tokens get a chance: # Step 1: "To be or not" → samples " to" (41%), " ,", " here", ... # Step 2: might go an entirely different direction
Beam search is the middle ground between greedy (one candidate) and full sampling (all candidates). Instead of keeping only the single best token at each step, beam search keeps the top k partial sequences (called "beams") and expands all of them. At the end, it picks the full sequence with the highest total log-probability.
Fast, deterministic, but myopic. Can lock into loops. Used only when speed is critical and output quality is less important.
Slower (k× more forward passes), but considers more options. Popular in translation and summarisation. Still deterministic , same beam width always produces the same output.
For open-ended text generation (stories, Shakespeare, creative writing), neither greedy nor beam search is ideal , they both optimise for highest probability, which produces safe, bland text. Sampling with temperature and top-k, covered in the next two days, is the standard approach for creative generation because it introduces controlled randomness.
generate_greedy variant to your generate file that uses argmax instead of multinomial. Run it with the prompt "To be or not" and generate 200 characters. Run it a second time with the exact same prompt , you should get identical output. Now compare it side-by-side with your sampling version. Write down one observation about where greedy gets stuck.
Softmax with temperature scaling: T->0 approaches greedy, T>1 flattens distribution
Temperature scaling divides the logits by a temperature value before applying softmax. This changes the probability distribution: low temperature makes high-probability tokens even more likely (sharper distribution), high temperature flattens the distribution and gives rare tokens more chance. The "sweet spot" for coherent but varied text is typically T=0.7-0.9.
def apply_temperature(logits, temperature=1.0): """Scale logits by temperature before softmax. Temperature effects: - T = 1.0: Normal probabilities (no change) - T -> 0: Approaches greedy (argmax) - T > 1.0: Flattens distribution, more random - T = 0.7-0.9: Sweet spot for coherent text """ if temperature <= 0: raise ValueError("Temperature must be positive") # Divide logits by temperature # Higher T = flatter distribution # Lower T = sharper distribution (more peaked) scaled_logits = logits / temperature # Apply softmax to get probabilities probs = torch.softmax(scaled_logits, dim=-1) return probs
Softmax computes: P(token_i) = exp(logit_i) / sum(exp(logit_j)) for all j. When you divide logits by temperature T, you get: P(token_i) = exp(logit_i/T) / sum(exp(logit_j/T)). As T->0, the largest logit dominates (approaches argmax). As T->infinity, all probabilities approach 1/vocab_size (uniform distribution). This gives you a knob to control how "confident" the model should be.
# Visualizing temperature effects on a simple distribution # Suppose logits for 3 tokens are: [2.0, 1.0, 0.5] # T = 1.0 (normal): # exp(2.0) = 7.39, exp(1.0) = 2.72, exp(0.5) = 1.65 # sum = 11.76 # P = [0.63, 0.23, 0.14] <- Original distribution # T = 0.5 (sharper): # exp(2.0/0.5) = exp(4.0) = 54.6 # exp(1.0/0.5) = exp(2.0) = 7.39 # exp(0.5/0.5) = exp(1.0) = 2.72 # P = [0.84, 0.11, 0.04] <- More peaked! # T = 2.0 (flatter): # exp(2.0/2.0) = exp(1.0) = 2.72 # exp(1.0/2.0) = exp(0.5) = 1.65 # exp(0.5/2.0) = exp(0.25) = 1.28 # P = [0.48, 0.29, 0.23] <- More uniform!
The math of temperature is clean, but what matters in practice is how each range feels when you read the output. Here is a field guide for a character-level Shakespeare model:
Output is almost identical every run. Tends to repeat common phrases from training data. Useful only when you want near-deterministic output for testing.
Varied enough to feel natural, structured enough to stay coherent. This is the default for creative text. Most practitioners land somewhere in this range.
No transformation applied , you sample from the model's learned distribution exactly. May work well for strong models; tends to produce incoherent runs with smaller ones.
Rare characters and unusual combinations get boosted. Output becomes increasingly incoherent. Sometimes useful for "creative hallucination" but rarely for serious generation.
# Side-by-side with the same prompt "HAMLET:" and seed=42 # T=0.2 → "HAMLET: I will not be able to be able to be able to" # Repetition. Safe but boring. # T=0.8 → "HAMLET: What is this shadow of my father's sword," # Coherent and varied. Shakespeare-like. # T=1.5 → "HAMLET: ,wixy the! Ard uf mou fath,no. Cre." # Character-valid but meaningless. Too much noise.
Temperature has a formal connection to entropy , the measure of uncertainty in a probability distribution. High entropy means many tokens are roughly equally likely (high uncertainty). Low entropy means one or a few tokens dominate (low uncertainty). Temperature directly controls entropy:
import torch def distribution_entropy(logits, temperature): """Compute Shannon entropy of the sampling distribution.""" scaled = logits / temperature probs = torch.softmax(scaled, dim=-1) # H = -sum(p * log(p)), in nats log_probs = torch.log(probs + 1e-10) entropy = -(probs * log_probs).sum().item() return entropy # Example: logits for 5 tokens = [3.0, 2.0, 1.0, 0.5, 0.1] logits = torch.tensor([3.0, 2.0, 1.0, 0.5, 0.1]) for T in [0.2, 0.5, 0.8, 1.0, 1.5]: print(f"T={T}: entropy={distribution_entropy(logits, T):.3f}") # T=0.2: entropy=0.021 (very peaked) # T=0.5: entropy=0.318 # T=0.8: entropy=0.713 # T=1.0: entropy=0.951 # T=1.5: entropy=1.296 (flatter)
Thinking in entropy gives you intuition for how "surprised" the model is at each step. High-entropy steps are where multiple tokens are plausible (e.g., choosing between different plot directions). Low-entropy steps are where the next token is nearly forced (e.g., the second letter of "qu" in English is almost always "u").
T=0.2, T=0.8, and T=1.4. Use the same starting prompt each time. Notice how much the outputs vary within a temperature setting and across settings. Find the temperature where the output is still recognisably English but not obviously repetitive , that is your personal sweet spot for this model.
Restrict to k most likely tokens, why it prevents extremely unlikely tokens
Even with temperature scaling, the model might still sample from the long tail of unlikely tokens. These tokens often produce gibberish. Top-k sampling solves this by only considering the k most probable tokens: set the logits of all other tokens to -inf (which becomes 0 after softmax). This gives you a hard cutoff: only sample from the top k tokens.
def apply_top_k(logits, top_k=40): """Filter logits to only keep the top-k most likely tokens. Args: logits: Tensor of shape (batch, vocab_size) top_k: Number of top tokens to keep Returns: Filtered logits (other tokens set to -inf) """ if top_k <= 0: return logits # No filtering # Get the top-k values and their indices # values: (batch, top_k) - the k highest logits # indices: (batch, top_k) - their positions values, _ = torch.topk(logits, top_k) # Get the smallest value among the top-k # This is our threshold: anything below this gets filtered min_top_k_value = values[:, -1:] # Shape: (batch, 1) # Set all logits below the threshold to -inf # After softmax, -inf becomes 0 probability logits[logits < min_top_k_value] = float("-inf") return logits
Character-level models have small vocabularies (typically 65 tokens for Shakespeare: 26 lowercase + punctuation + special chars). With such a small vocabulary, top_k=40 is reasonable because it still considers most characters while excluding the very unlikely ones. For word-level models with 50,000+ tokens, you'd use larger k values like 50-100.
# For character-level model (vocab_size=65): # top_k=40 means we consider ~62% of the vocabulary # This excludes the bottom 25 characters (unlikely ones) # For word-level model (vocab_size=50000): # top_k=50 means we consider only 0.1% of the vocabulary! # This is very restrictive but prevents nonsense words. # Typical values: # - Char-level: top_k=20-40 # - Word-level: top_k=50-100 # - BPE/subword: top_k=40-80 # The idea: filter out the "long tail" of unlikely tokens # that would produce gibberish or break the text.
The weakness of top-k is that k is fixed regardless of the distribution shape. When the model is very confident, even k=5 might include tokens with near-zero probability. When the model is uncertain, k=40 might still exclude many plausible tokens. Top-p sampling (also called nucleus sampling) solves this by choosing k adaptively: keep the smallest set of tokens whose cumulative probability exceeds p.
def apply_top_p(logits, top_p=0.9): """Nucleus sampling: keep tokens whose cumulative probability <= top_p.""" # Sort tokens by probability (descending) sorted_logits, sorted_indices = torch.sort(logits, descending=True) sorted_probs = torch.softmax(sorted_logits, dim=-1) # Compute cumulative probabilities cumulative_probs = torch.cumsum(sorted_probs, dim=-1) # Remove tokens with cumulative probability > top_p # Shift by 1 to include the token that pushes over the threshold sorted_indices_to_remove = cumulative_probs - sorted_probs > top_p sorted_logits[sorted_indices_to_remove] = float("-inf") # Unsort back to original token order logits = torch.zeros_like(logits).scatter_(0, sorted_indices, sorted_logits) return logits
Always considers exactly k tokens. When the model is highly confident, k=40 still samples from tokens with <0.001% probability. Can let in noise.
Considers however many tokens are needed to reach p probability mass. Naturally contracts to 1–2 tokens when the model is confident, expands to many when uncertain. More robust across different text types.
apply_top_p and add a top_p parameter to your generate function. Try p=0.9 (common default) and p=0.95. Compare the outputs to top-k at k=40. For most character-level models, top-k and top-p produce similar output quality , the difference becomes more pronounced with large vocabulary word-level models.
Top-k and temperature are often applied together, and the order in which you apply them matters. The correct pipeline is: apply temperature first (scale the logits), then apply top-k (filter the scaled logits), then apply softmax to get probabilities. If you apply top-k on raw logits and then temperature on the filtered set, you get different (and subtly wrong) results.
def sample_next_token(logits, temperature=0.8, top_k=40): """Correct pipeline: temperature → top-k → softmax → sample.""" # Step 1: Temperature scaling (modifies the distribution shape) logits = logits / temperature # Step 2: Top-k filtering (removes low-probability tokens) if top_k > 0: values, _ = torch.topk(logits, min(top_k, logits.size(-1))) logits[logits < values[:, [-1]]] = float("-inf") # Step 3: Softmax (convert to probabilities , must come after both filters) probs = torch.softmax(logits, dim=-1) # Step 4: Sample return torch.multinomial(probs, num_samples=1) # Why order matters: # Temperature changes the relative gaps between logits. # Top-k selects by rank, not absolute value. # If you apply top-k first, temperature can't affect which tokens are kept. # Applying temperature first, then top-k, then softmax is the standard.
apply_top_k to your generate pipeline and wire up a top_k parameter. Generate samples at k=5, k=20, and k=0 (no filtering), holding temperature fixed at T=0.8. For your character-level model with ~65-token vocab, note the difference between k=5 (very restrictive) and k=0. Write down which value produces the best trade-off between safety and variety.
Combining temperature + top-k, torch.multinomial for sampling, @torch.no_grad()
@torch.no_grad() for memory-efficient inference, handles the details of prompt encoding (including unknown characters), and shows how to generate multiple independent samples in one call so you can compare different outputs from the same prompt.
@torch.no_grad() disables gradient computation during inference. When generating text, we don't need to compute gradients (we're not training), so disabling them saves memory and speeds up generation. Without it, PyTorch would track all operations for potential backpropagation, which wastes memory and compute. Always use it for inference!
@torch.no_grad() # Disables gradient tracking for efficiency def generate(model, prompt, stoi, itos, max_new_tokens=200, temperature=0.8, top_k=40): """Full generate function with temperature and top-k. Pipeline for each token: 1. Run model -> get logits for next position 2. Apply temperature scaling 3. Filter with top-k (remove unlikely tokens) 4. Convert to probabilities with softmax 5. Sample from distribution with multinomial 6. Append sampled token and repeat """ device = next(model.parameters()).device # Encode the prompt tokens = [stoi[c] for c in prompt if c in stoi] idx = torch.tensor([tokens], dtype=torch.long, device=device) model.eval() # Set to evaluation mode (disables dropout) for _ in range(max_new_tokens): # Only use last block_size tokens for context idx_cond = idx[:, -model.config.block_size:] # Forward pass logits, _ = model(idx_cond) # Focus on last token: (batch, vocab_size) logits = logits[:, -1, :] # Step 2: Apply temperature logits = logits / temperature # Step 3: Apply top-k filtering if top_k > 0: values, _ = torch.topk(logits, top_k) min_top_k = values[:, -1:] logits[logits < min_top_k] = float("-inf") # Step 4: Convert to probabilities probs = torch.softmax(logits, dim=-1) # Step 5: Sample from distribution # multinomial samples from probs, weighted by their values next_token = torch.multinomial(probs, num_samples=1) # Step 6: Append and continue idx = torch.cat([idx, next_token], dim=1) # Decode back to text return "".join([itos[i] for i in idx[0].tolist()])
torch.multinomial(probs, num_samples=1) samples one token from the probability distribution. Unlike argmax which always picks the highest probability, multinomial samples according to the probabilities: higher probability = more likely to be sampled, but not guaranteed. This is what makes the output varied and interesting.
# Example: sampling from a distribution probs = torch.tensor([0.6, 0.3, 0.1]) # 3 tokens # argmax always returns index 0 (highest prob) print(probs.argmax()) # Always: tensor(0) # multinomial samples according to probabilities print(torch.multinomial(probs, num_samples=1)) # Could be 0 (60% chance), 1 (30% chance), or 2 (10% chance) # This is the key to non-deterministic generation! # Same prompt + different samples = different output
The generate function receives a string prompt and must convert it to token IDs before the model can process it. For character-level models this is simple , one character = one token , but there are edge cases you must handle:
"\n") or the first token in the vocabulary.block_size characters before encoding.def encode_prompt(prompt, stoi, block_size): """Encode a string prompt into token IDs, handling edge cases.""" # Filter unknown characters (silently drop them) known_chars = [c for c in prompt if c in stoi] if not known_chars: # Fallback: start from newline if prompt is fully unknown print("Warning: prompt contained no known characters, using '\\n'") known_chars = ["\n"] if "\n" in stoi else [list(stoi.keys())[0]] # Truncate if longer than context window if len(known_chars) > block_size: known_chars = known_chars[-block_size:] print(f"Warning: prompt truncated to last {block_size} characters") # Convert to token IDs and add batch dimension token_ids = [stoi[c] for c in known_chars] return torch.tensor([token_ids], dtype=torch.long) # Usage idx = encode_prompt("To be or not", stoi, model.config.block_size) print(f"Prompt encoded to {idx.shape[1]} tokens") # (1, 14)
The generate function takes a batch dimension , the first axis of idx. If you pass a batch of size N, you get N independent samples from the same prompt in one forward pass per token. This is much faster than calling generate N times, because the GPU can process the batch in parallel.
@torch.no_grad() def generate_n_samples(model, prompt, stoi, itos, n=5, max_new_tokens=200, temperature=0.8, top_k=40): """Generate n independent samples from the same prompt.""" device = next(model.parameters()).device # Encode prompt once, then repeat for the batch tokens = [stoi[c] for c in prompt if c in stoi] idx = torch.tensor([tokens], dtype=torch.long, device=device) idx = idx.repeat(n, 1) # Shape: (n, prompt_len) model.eval() for _ in range(max_new_tokens): idx_cond = idx[:, -model.config.block_size:] logits, _ = model(idx_cond) logits = logits[:, -1, :] / temperature # (n, vocab_size) if top_k > 0: values, _ = torch.topk(logits, top_k) logits[logits < values[:, [-1]]] = float("-inf") probs = torch.softmax(logits, dim=-1) next_tokens = torch.multinomial(probs, num_samples=1) # (n, 1) idx = torch.cat([idx, next_tokens], dim=1) # Decode each sample in the batch samples = [] for i in range(n): text = "".join([itos[t] for t in idx[i].tolist()]) samples.append(text) return samples # Usage: generate 5 samples and print each samples = generate_n_samples(model, "To be or not", stoi, itos, n=5) for i, s in enumerate(samples): print(f"\n--- Sample {i+1} ---\n{s}")
generate_n_samples(n=5) vs calling your single-sample generate 5 times in a loop.
generate function decorated with @torch.no_grad(). Time a 500-token generation with and without the decorator using time.time(). Measure peak GPU or CPU memory with torch.cuda.max_memory_allocated() (or tracemalloc on CPU). Record the wall-clock and memory difference , this makes the cost of gradient tracking concrete.
@torch.no_grad() for memory-efficient inference. This is the function you'll use to make your model write text.
argparse for command-line interface, seeds for reproducibility, loading checkpoints
argparse. Day 27 covers building a CLI that loads checkpoints, accepts prompts and generation parameters, and supports reproducibility with random seeds. A well-structured CLI also means you can run generation experiments from shell scripts, compare outputs across checkpoints without touching Python, and share a single command with someone else and get the exact same result.
A command-line interface makes your generate function easy to use. argparse lets you define arguments like the checkpoint path, prompt, temperature, top_k, and seed. This allows users to run generation from the terminal without modifying code. The key is to save all necessary data (config, stoi, itos) in the checkpoint during training.
if __name__ == "__main__": import argparse # Set up argument parser parser = argparse.ArgumentParser( description="Generate text from a trained GPT checkpoint" ) # Required argument: checkpoint path parser.add_argument("checkpoint", help="Path to checkpoint file (e.g. checkpoint_final.pt)") # Optional arguments with defaults parser.add_argument("--prompt", default="To be or not", help="Starting text for generation") parser.add_argument("--max_new_tokens", type=int, default=200, help="Number of tokens to generate") parser.add_argument("--temperature", type=float, default=0.8, help="Sampling temperature (lower = more deterministic)") parser.add_argument("--top_k", type=int, default=40, help="Only sample from top-k most likely tokens") parser.add_argument("--seed", type=int, default=None, help="Random seed for reproducibility") args = parser.parse_args() # Set seed if provided (for reproducibility) if args.seed is not None: torch.manual_seed(args.seed) print(f"Seed set to {args.seed}")
python generate.py checkpoint_final.pt --prompt "Hello" --temperature 0.9 --seed 42. Try the same seed multiple times - you should get identical output. Try different seeds - you'll get different but equally valid outputs. This is the power of reproducible generation.
Checkpoints should contain everything needed for generation: the model config, trained weights, and the character mappings (stoi/itos). Load the checkpoint, reconstruct the model, and run generation. This makes your trained model portable and easy to use.
# Load the checkpoint print(f"Loading checkpoint: {args.checkpoint}") checkpoint = torch.load(args.checkpoint, weights_only=False) # Extract config and character mappings config = checkpoint["config"] stoi = checkpoint["stoi"] # string to index (char -> token ID) itos = checkpoint["itos"] # index to string (token ID -> char) # Reconstruct the model model = GPT(config) model.load_state_dict(checkpoint["model_state_dict"]) print(f"Model loaded: {sum(p.numel() for p in model.parameters())} parameters") # Generate text! print(f"\nGenerating with prompt: '{args.prompt}'") print(f"Temperature: {args.temperature}, Top-k: {args.top_k}\n") output = generate(model, args.prompt, stoi, itos, max_new_tokens=args.max_new_tokens, temperature=args.temperature, top_k=args.top_k) print("-" * 50) print(output) print("-" * 50)
Generation involves random sampling (torch.multinomial), so the same prompt produces different output each time. To get reproducible results, set a random seed with torch.manual_seed(seed). This ensures that the random number generator produces the same sequence of numbers, making generation deterministic for a given seed.
# Reproducible generation torch.manual_seed(42) print(generate(model, "To be or not", stoi, itos, temperature=0.8)) # Same output every time with seed=42! # From the command line: # python generate.py checkpoint_final.pt --prompt "To be or not" --seed 42 # Why seeds matter: # 1. Debugging: reproduce issues with specific outputs # 2. Comparison: compare model versions with same input # 3. Demos: show consistent results to others # 4. Testing: write tests that expect specific output
Once you have a CLI you can run generation sweeps: systematically testing many combinations of temperature, top-k, and prompt, and saving all outputs for comparison. A simple shell loop is enough. The outputs should be saved with metadata (checkpoint, seed, parameters) in the filename or a header so you can reconstruct what produced each sample later.
# In generate.py, add --output_file argument parser.add_argument("--output_file", default=None, help="If set, write output to this file instead of stdout") parser.add_argument("--num_samples", type=int, default=1, help="Number of independent samples to generate") # In the generation section: results = [] for i in range(args.num_samples): if args.seed is not None: torch.manual_seed(args.seed + i) # different seed per sample output = generate(model, args.prompt, stoi, itos, max_new_tokens=args.max_new_tokens, temperature=args.temperature, top_k=args.top_k) results.append(output) # Save or print if args.output_file: with open(args.output_file, "w") as f: header = f"# checkpoint={args.checkpoint} T={args.temperature} k={args.top_k}\n\n" f.write(header) for i, r in enumerate(results): f.write(f"--- Sample {i+1} ---\n{r}\n\n") print(f"Saved {args.num_samples} samples to {args.output_file}") else: for i, r in enumerate(results): print(f"\n--- Sample {i+1} ---\n{r}")
# Shell script: sweep temperatures and save all outputs #!/bin/bash for T in 0.5 0.7 0.9 1.1; do python generate.py checkpoint_final.pt \ --prompt "To be or not" \ --temperature $T \ --top_k 40 \ --num_samples 3 \ --seed 42 \ --output_file "samples_T${T}.txt" done # Creates: samples_T0.5.txt, samples_T0.7.txt, samples_T0.9.txt, samples_T1.1.txt
stoi, itos, and the full GPTConfig inside the checkpoint file during training, not just the model weights. Without them, you cannot reconstruct the model or decode the output. A weights-only checkpoint is unusable for generation.
generate.py with an argparse CLI that accepts checkpoint, --prompt, --temperature, --top_k, --max_new_tokens, and --seed. Run it twice with --seed 42 and confirm the output is identical. Then run it twice without --seed and confirm the outputs differ. Share one of the seeded outputs , this is the first output from your model that someone else can reproduce exactly.
What to expect at different training steps, trying different settings, key takeaways
The quality of generated text improves as training progresses. Here are real samples from a training run (6L/6H/384D on Shakespeare). Early in training, output is random characters. Mid-training, words and structure emerge. Late training, you get plausible Shakespeare-like text. But beware: after too many steps, the model overfits and regurgitates memorized training data.
# Step 200 (val loss ~3.5) - Random characters # The model hasn't learned anything meaningful yet "To be or notis p ce mei odorethleedetire'ilethed ye m arkesothir fnon b tigb'i." # Step 1000 (val loss 1.64) - Words and structure emerging # The model is learning words and basic grammar "To be or nothing are good men, The profent of little, our actory. CORIOLANUS: Is it now of your many death?" # Step 2400 (val loss ~1.60) - Peak quality # Plausible Shakespeare! But notice: it's starting to # memorize specific phrases from the training data. "To be or not to be some of you shall know That everlature by Romeo: what news, Which you had knock'd my part to speak" # Step 5000+ (overfitting) - Memorized text # The model regurgitates training data verbatim. # This is why you should save checkpoints and compare!
Different generation settings produce dramatically different output. Temperature controls randomness, top_k controls the candidate pool. Experiment with combinations to find what works best for your model and use case. Here are some starting points for a trained Shakespeare model.
# Load your trained model checkpoint = torch.load("checkpoint_final.pt", weights_only=False) config = checkpoint["config"] stoi = checkpoint["stoi"] itos = checkpoint["itos"] model = GPT(config) model.load_state_dict(checkpoint["model_state_dict"]) # Deterministic, repetitive (T=0.1, close to greedy) print("=== Low temperature (0.1) ===") print(generate(model, "To be or not to be", stoi, itos, temperature=0.1, top_k=40)) # Balanced (T=0.8, default) print("\n=== Balanced (0.8) ===") print(generate(model, "To be or not to be", stoi, itos, temperature=0.8, top_k=40)) # Creative, potentially incoherent (T=1.5) print("\n=== High temperature (1.5) ===") print(generate(model, "To be or not to be", stoi, itos, temperature=1.5, top_k=40)) # Very restrictive (T=0.8, top_k=10) print("\n=== Restrictive top-k (k=10) ===") print(generate(model, "To be or not to be", stoi, itos, temperature=0.8, top_k=10))
You've built a complete text generation pipeline! Your model can now write text using autoregressive generation with temperature sampling and top-k filtering. Here are the key concepts to remember as you move forward.
When generated text looks wrong, the cause is almost always one of five things. Here is a diagnostic guide:
Cause: Model hasn't trained long enough, or val loss is still very high (>3.0). Fix: Train more steps. Check that training loss is actually decreasing.
Cause: Early-to-mid training. The model has learned vocabulary but not grammar or context. Fix: Normal , keep training. This phase typically ends by val loss ~2.0.
Cause: Temperature too low, or greedy/near-greedy sampling. Fix: Raise temperature to 0.7–0.9. Add or raise top-k.
Cause: Model is overfitting , it has memorised training examples. Fix: Load an earlier checkpoint (before val loss started rising). Increase dropout.
Cause: Temperature too high, or top-k too large (sampling from the long tail). Fix: Lower temperature to 0.6–0.8. Set top-k=20–40.
The output reads like a plausible new sample from the training domain. Different each run but recognisably in the right style. Val loss typically 1.5–1.8 for Shakespeare at this model size.
# Quick diagnostic: check val loss before adjusting sampling params # If val_loss > 3.0 → more training, sampling params don't matter yet # If val_loss 1.8-2.5 → mid-quality model, temperature is the main lever # If val_loss < 1.5 → good model, fine-tune sampling for best output # If val_loss rising → overfitting, load earlier checkpoint # Always check these three things first: # 1. model.eval() is called before generate # 2. @torch.no_grad() is active (speeds up and prevents memory issues) # 3. Correct checkpoint loaded (stoi/itos match the model's training data)
sample_week4.txt. Then load two other checkpoints , one from earlier in training and one from later , and generate the same prompt with each. Put all three samples side by side in the file. Label them with their step number and val loss. This is your week 4 artefact: proof that your model can write, and a record of how generation quality evolves with training.
Train on real data, monitor progress, and experiment with model configurations
Week 5 Focus: Train → Generate → ExperimentOrganize model.py, train.py, generate.py, and understand how the files interact
The project is split into four Python files plus a data directory and a checkpoints directory. Each file has a single, clear responsibility.
llm-from-scratch/
├── data/
│ └── input.txt # Training corpus (e.g. Shakespeare, ~1 MB)
├── model.py # GPT class , nothing but architecture
├── train.py # Training loop, optimizer, checkpointing
├── generate.py # Load checkpoint and sample text
├── utils.py # Data loading, batching, tokenization helpers
└── checkpoints/
└── ckpt_step5000.pt # Saved model weights + optimizer state
The separation matters because it enforces what machine-learning engineers call the train/infer split: the code path for training is completely separate from the code path for inference. Mixing them in one script leads to bugs where training-only operations (dropout, gradient accumulation) accidentally run during generation.
model.py should contain exactly one thing: the GPTLanguageModel class built in Weeks 1–4. No training loop, no data loading, no file I/O. When another file needs the model, it does from model import GPTLanguageModel and that is all.
# model.py (skeleton) import torch import torch.nn as nn class GPTLanguageModel(nn.Module): def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout): super().__init__() self.token_embedding = nn.Embedding(vocab_size, n_embd) self.position_embedding = nn.Embedding(block_size, n_embd) self.blocks = nn.Sequential(*[Block(n_embd, n_head, block_size, dropout) for _ in range(n_layer)]) self.ln_f = nn.LayerNorm(n_embd) self.lm_head = nn.Linear(n_embd, vocab_size, bias=False) def forward(self, idx, targets=None): # idx: (B, T) token IDs → returns logits (B, T, V) and optional loss ...
Keeping all hyperparameters as constructor arguments (not module-level globals) means you can instantiate multiple model sizes in the same Python session , essential for the experiments in Days 33 and 34.
train.py is responsible for everything that happens during the learning phase: loading data, constructing batches, running the forward/backward pass, updating weights, logging losses, and saving checkpoints. It imports GPTLanguageModel from model.py and helpers from utils.py.
# train.py (top-level structure) from model import GPTLanguageModel from utils import get_batch, load_data, encode # 1. Hyperparameters config = { 'n_layer': 6, 'n_head': 6, 'n_embd': 384, 'block_size': 256, 'batch_size': 64, 'lr': 3e-4, 'max_steps': 5000 } # 2. Data train_data, val_data, vocab_size = load_data('data/input.txt') # 3. Model + optimizer model = GPTLanguageModel(vocab_size, **config).to(device) optimizer = torch.optim.AdamW(model.parameters(), lr=config['lr']) # 4. Training loop for step in range(config['max_steps']): xb, yb = get_batch(train_data, config['block_size'], config['batch_size']) logits, loss = model(xb, yb) optimizer.zero_grad() loss.backward() optimizer.step()
generate.py loads a saved checkpoint, reconstructs the model with the same hyperparameters used during training, and calls model.generate(). It never touches an optimizer or computes gradients , it puts the model in eval() mode first.
utils.py holds the glue code that both files need: reading input.txt, building the character vocabulary, encoding/decoding strings to integer IDs, and the get_batch() function that slices random windows from the training data.
config.py file that defines the TINY, SMALL, and MEDIUM configuration dicts and import them in both train.py and generate.py. This ensures the model you generate with is always the same size as the model you trained , a bug that is easy to make and hard to debug if the configs live in separate files.model.py with the GPTLanguageModel class stub, train.py with the config dict and training loop structure, generate.py with checkpoint loading logic, and utils.py with load_data() and get_batch() stubs. Verify you can import from one file to another without errors.
model.py defines the architecture; train.py runs the learning loop; generate.py runs inference; utils.py holds shared helpers. This separation lets you swap out any single piece without touching the others.Configure a GPU runtime, install dependencies, upload your files, and verify everything works
The core operation in every transformer layer is a matrix multiply: Q @ K.T and attention_weights @ V. A CPU executes matrix multiplies sequentially on a small number of cores. A GPU has thousands of tiny cores optimised to run the same arithmetic on thousands of values in parallel.
~45 minutes per 5 000-step run on the medium config. Fine for testing one config, painful for experiments.
~6–8 minutes per 5 000-step run. Fast enough to iterate through four or five hyperparameter combinations in a single session.
Colab defaults to a CPU runtime. You must switch before running any code:
Verify the GPU is active by running this in the first cell:
import torch print(torch.cuda.is_available()) # True print(torch.cuda.get_device_name(0)) # NVIDIA Tesla T4 print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB") # VRAM: 15.8 GB
Colab has PyTorch pre-installed, but you still need tiktoken and a few utilities. Run this in a cell:
!pip install tiktoken --quiet import os, urllib.request # Create project directories for d in ['data', 'checkpoints']: os.makedirs(d, exist_ok=True) # Download Shakespeare training data (~1 MB) url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" urllib.request.urlretrieve(url, "data/input.txt") with open("data/input.txt") as f: text = f.read() print(f"Corpus: {len(text):,} characters, ~{len(text)//1000} KB") # Corpus: 1,115,394 characters, ~1115 KB
Colab notebooks are ephemeral , the filesystem is wiped when the runtime disconnects. The cleanest workflow is to keep your .py files in a GitHub repo and clone them at the start of each session:
# Option A , clone from GitHub (recommended) !git clone https://github.com/YOUR_USERNAME/llm-from-scratch.git %cd llm-from-scratch # Option B , upload files manually via Colab's file browser # (Left sidebar → Files icon → Upload) # Option C , write files directly in notebook cells using %%writefile # %%writefile model.py # import torch ...
!free -h (RAM) and !df -h / (disk). Colab free tier gives you ~12 GB RAM and ~100 GB disk. Knowing these limits helps you decide how large a model and dataset you can use before running out of memory.torch.cuda.is_available() returns True and the device name shows "Tesla T4". Install tiktoken, create the project directories, and download the Shakespeare corpus. Print the corpus length to confirm everything works before moving to training.
torch.cuda.is_available() before starting a long training run. A training job that silently falls back to CPU will appear to work but will be 30–100× slower , you will only notice hours later when you check the elapsed time.Walk through train.py line by line, understand gradient clipping, and interpret the training output
train.py does , and why it is in that order , will help you debug anything that goes wrong.
Here is the core training loop with every line annotated:
model.train() # activates dropout , must be set for training for step in range(max_steps): # ── 1. Sample a random batch ──────────────────────────────────── xb, yb = get_batch('train') # xb shape: (batch_size, block_size) , input token IDs # yb shape: (batch_size, block_size) , target token IDs (xb shifted by 1) # ── 2. Forward pass ───────────────────────────────────────────── logits, loss = model(xb, yb) # logits: (B, T, vocab_size) , raw scores before softmax # loss: scalar cross-entropy between logits and yb # ── 3. Zero out stale gradients ───────────────────────────────── optimizer.zero_grad(set_to_none=True) # set_to_none=True is faster than zeroing: frees the memory entirely # ── 4. Backward pass ──────────────────────────────────────────── loss.backward() # PyTorch walks the computation graph and fills .grad on every parameter # ── 5. Gradient clipping ──────────────────────────────────────── torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Rescales gradients so their global norm ≤ 1.0 # Prevents a single bad batch from causing a catastrophic weight update # ── 6. Weight update ──────────────────────────────────────────── optimizer.step() # AdamW adjusts each parameter using its gradient + per-param momentum
Without clipping, a single batch that produces an unusually large gradient can push the model weights into a region where the loss explodes , you will see the loss suddenly jump to nan or a very large number. Clipping at max_norm=1.0 rescales the entire gradient vector (not individual gradients) so its magnitude is at most 1.0. This is cheap to compute and makes training dramatically more stable.
Rare but catastrophic: one unlucky batch sends loss to nan. Training has to be restarted from the last checkpoint.
Worst-case update size is bounded. Training continues smoothly through hard batches. Small constant cost per step.
Every 500 steps, train.py evaluates on the validation set and prints a status line. Here is what each field means:
step 500 | train 4.312 | val 4.358 | lr 3.00e-04 | 2.1s/step step 1000 | train 3.641 | val 3.702 | lr 3.00e-04 | 2.0s/step step 2000 | train 2.894 | val 2.983 | lr 2.41e-04 | 2.0s/step step 3000 | train 2.341 | val 2.476 | lr 1.64e-04 | 2.0s/step step 5000 | train 1.821 | val 2.094 | lr 5.00e-05 | 2.0s/step
Save a checkpoint periodically so you can resume training after a Colab disconnect or load a specific model state for generation experiments:
def save_checkpoint(model, optimizer, step, loss, path): torch.save({ 'step': step, 'model_state': model.state_dict(), 'optimizer_state': optimizer.state_dict(), 'val_loss': loss, 'config': model_config, # save hyperparams too! }, path) # Call every 1000 steps if step % 1000 == 0: save_checkpoint(model, optimizer, step, val_loss, f'checkpoints/ckpt_{step:05d}.pt')
max_steps=200 before committing to a full run. Verify the loss decreases, the checkpoint saves, and the generated sample text is at least recognisable as attempting to produce English. Only then increase max_steps to 5000.lr each step), and a checkpoint is saved. Load the checkpoint back and confirm model.eval() works. Only then increase max_steps to 5000 for a full run.
backward() before zero_grad(), gradients from the previous step accumulate and your updates will be wrong. Always save the model config alongside the weights so you can reconstruct the exact architecture for inference later.Load a checkpoint, understand temperature and top-k sampling, and generate text with different strategies
temperature and top_k let you navigate the tradeoff between diversity and coherence.
The checkpoint saved by train.py contains both model weights and the config dict. generate.py must reconstruct the same architecture before loading the weights, otherwise PyTorch will raise a shape mismatch error:
import torch from model import GPTLanguageModel ckpt = torch.load('checkpoints/ckpt_05000.pt', map_location='cpu') config = ckpt['config'] # saved alongside the weights model = GPTLanguageModel(**config) model.load_state_dict(ckpt['model_state']) model.eval() # disables dropout , critical for inference print(f"Loaded step {ckpt['step']}, val loss {ckpt['val_loss']:.4f}")
Before sampling, the model's raw output (logits) is divided by a temperature scalar. This changes the sharpness of the probability distribution:
# Inside model.generate() logits = logits[:, -1, :] / temperature # (B, vocab_size) probs = torch.softmax(logits, dim=-1) idx_next = torch.multinomial(probs, num_samples=1)
With a vocabulary of 50 000 tokens, even very low-probability tokens can occasionally be sampled. Top-k sampling zeroes out all but the k highest-probability tokens before sampling, eliminating outlier garbage:
def top_k_sample(logits, k, temperature=1.0): logits = logits / temperature # Keep only the top-k logits; set the rest to -inf top_vals, _ = torch.topk(logits, k) threshold = top_vals[..., -1, None] # kth largest value logits = logits.masked_fill(logits < threshold, float('-inf')) probs = torch.softmax(logits, dim=-1) return torch.multinomial(probs, num_samples=1)
Typical values: top_k=40 for creative writing, top_k=10 for more focused outputs. Setting top_k=1 is equivalent to greedy decoding.
temperature=0.7, top_k=40, then five more with temperature=1.2, top_k=200. Compare quality and diversity. Which setting gives more coherent Shakespearean language? Which gives more surprising , if sometimes broken , outputs?Load checkpoints from different points in training and observe how output quality evolves. This concretely shows what "the model is learning" means:
for step_num in [500, 1000, 2000, 5000]: ckpt = torch.load(f'checkpoints/ckpt_{step_num:05d}.pt') model.load_state_dict(ckpt['model_state']) model.eval() generated = model.generate(prompt_ids, max_new_tokens=100) print(f"\n--- Step {step_num} ---") print(decode(generated[0].tolist()))
temperature=0.5, top_k=10 for conservative output, (2) temperature=0.8, top_k=40 for balanced output, and (3) temperature=1.2, top_k=200 for creative output. Compare the results , which setting produces the most coherent Shakespearean language? Save your favorite output.
model.eval() before generating , forgetting this leaves dropout active, which randomly drops connections during inference and makes outputs non-deterministic in the wrong way. Temperature controls how peaked the distribution is; top-k controls which tokens are even considered. Together they give you full control over the diversity-quality tradeoff.Calculate parameter counts, measure training speed, and compare output quality across tiny, small, and medium models
The dominant cost is the embedding table and the attention + MLP blocks. A rough formula for a GPT-style model:
# Approximate parameter count embedding_params = vocab_size * n_embd # token embed table position_params = block_size * n_embd # position embed table per_block = ( 4 * n_embd * n_embd + # Q, K, V, out projections in attention 8 * n_embd * n_embd # two linear layers in MLP (4x expansion) ) total = embedding_params + position_params + n_layer * per_block # Verify against PyTorch: total_actual = sum(p.numel() for p in model.parameters()) print(f"{total_actual / 1e6:.2f} M parameters")
Run these three configurations sequentially and record the results:
TINY = {
'n_layer': 2, 'n_head': 2, 'n_embd': 128,
'block_size': 128, 'batch_size': 32, 'max_steps': 3000
} # ~0.8 M params, ~2 min on T4
SMALL = {
'n_layer': 4, 'n_head': 4, 'n_embd': 256,
'block_size': 256, 'batch_size': 64, 'max_steps': 5000
} # ~4.5 M params, ~6 min on T4
MEDIUM = {
'n_layer': 6, 'n_head': 6, 'n_embd': 384,
'block_size': 256, 'batch_size': 64, 'max_steps': 5000
} # ~10.7 M params, ~8 min on T4
GPU memory is consumed by three things: model weights, activations (all intermediate tensors kept for backprop), and optimizer state (AdamW stores two extra tensors per parameter). A rough estimate:
# Memory estimate in GB (float32) weights_gb = total_params * 4 / 1e9 # 4 bytes per float32 optimizer_gb = weights_gb * 2 # Adam m1 + m2 states activation_gb = batch_size * block_size * n_embd * n_layer * 4 / 1e9 # MEDIUM config on T4 (15.8 GB VRAM) # weights: ~0.04 GB, optimizer: ~0.08 GB, activations: ~0.6 GB # Total: well within 15.8 GB , room to increase batch_size
If you get an OutOfMemoryError, the first levers to pull are: reduce batch_size by half, then reduce block_size, then reduce n_embd.
temperature=0.8, top_k=40. Score them subjectively on: (1) correct English words, (2) correct punctuation, (3) stylistic coherence with Shakespeare. Does the MEDIUM model's quality justify its 2× training time relative to SMALL?O(n_embd² × n_layer). Doubling n_embd quadruples the parameter count. In practice the SMALL model (4.5 M params) achieves most of the quality gains of MEDIUM (10.7 M params) for the Shakespeare task because the dataset is small enough that even SMALL can model it well. Bigger models pay off on larger, more diverse datasets.Understand the quadratic cost of attention, learning rate warmup and cosine decay, and how to read loss curves
Self-attention computes a score for every token pair in the context. For a sequence of length T, that is T² pairs per head. Doubling block_size quadruples the attention computation and roughly doubles total memory.
# Context length experiment (hold model size constant) for ctx in [64, 128, 256, 512]: config = {**SMALL, 'block_size': ctx, 'max_steps': 3000} train_and_log(config, tag=f'ctx_{ctx}') # Expected val loss at 3000 steps (approximately): # ctx=64: 2.81 (short context, misses long-range dependencies) # ctx=128: 2.54 # ctx=256: 2.31 (sweet spot for Shakespeare) # ctx=512: 2.28 (marginal gain, 2× memory cost)
For the Shakespeare corpus, passages rarely require more than ~200 characters of context to make sense. Beyond block_size=256 the improvement in validation loss becomes small relative to the memory and compute cost.
A fixed learning rate is rarely optimal. The standard schedule used in modern LLMs has two phases:
lr_max. Early in training, weights are random and gradients are large and noisy. A high LR at step 0 would cause chaotic updates. Warmup lets the model stabilise first.lr_max to lr_min. As training converges the model is close to a good solution; smaller steps make finer adjustments without overshooting.def get_lr(step, warmup_steps, max_steps, lr_max, lr_min): if step < warmup_steps: return lr_max * step / warmup_steps # linear warmup decay = (step - warmup_steps) / (max_steps - warmup_steps) cosine = 0.5 * (1 + math.cos(math.pi * decay)) return lr_min + cosine * (lr_max - lr_min) # cosine decay # Apply at each step: lr = get_lr(step, warmup_steps=100, max_steps=5000, lr_max=3e-4, lr_min=3e-5) for g in optimizer.param_groups: g['lr'] = lr
Run three training jobs for 2000 steps each and observe the loss curves:
# Conservative: slow but stable python train.py --lr-max 1e-4 --warmup 50 --tag lr_conservative # Recommended: fast convergence, stable python train.py --lr-max 3e-4 --warmup 100 --tag lr_default # Aggressive: fastest early drop, risk of instability python train.py --lr-max 1e-3 --warmup 200 --tag lr_aggressive
block_size ∈ {128, 256} × lr_max ∈ {1e-4, 3e-4}. Train each for 3000 steps and record the final validation loss. Which combination gives the best loss? Does the optimal learning rate change when you increase the context length?block_size ∈ {128, 256} × lr_max ∈ {1e-4, 3e-4}. Train each configuration for 3000 steps and record the final validation loss. Which combination gives the best loss? Does the optimal learning rate change when you increase the context length?
O(T²) in the context length T , increasing block_size from 128 to 512 uses 16× more compute for the attention layers alone. Always use a learning rate schedule: warmup prevents early instability; cosine decay squeezes out the last fraction of quality at the end of training. For small datasets like Shakespeare, lr_max=3e-4 with 100 warmup steps is a reliable default.Plot loss curves with matplotlib, diagnose training problems, understand perplexity, and preview Week 6
Save a JSON log during training so you can plot it later without re-running anything:
import json log = [] for step in range(max_steps): # ... training step ... if step % eval_interval == 0: train_loss = estimate_loss('train') val_loss = estimate_loss('val') log.append({'step': step, 'train': train_loss, 'val': val_loss}) # Save after training finishes with open('checkpoints/loss_log.json', 'w') as f: json.dump(log, f)
Use a separate estimate_loss() function (not the batch loss) that averages over 200 random batches on each split. This gives a much more stable estimate than the noisy single-batch loss from the training loop.
import json, matplotlib.pyplot as plt with open('checkpoints/loss_log.json') as f: log = json.load(f) steps = [e['step'] for e in log] train_loss = [e['train'] for e in log] val_loss = [e['val'] for e in log] fig, ax = plt.subplots(figsize=(10, 5)) ax.plot(steps, train_loss, label='Train loss', lw=2) ax.plot(steps, val_loss, label='Val loss', lw=2, linestyle='--') ax.set_xlabel('Step'); ax.set_ylabel('Cross-entropy loss') ax.set_title('Training dynamics') ax.legend(); ax.grid(alpha=0.3) plt.tight_layout(); plt.savefig('loss_curve.png', dpi=150) plt.show()
A healthy run has both curves decreasing together and levelling off at similar values. Here are the four failure patterns to watch for:
max_norm=1.0) and lower lr_max by 3×.max_steps, or check that the data pipeline is working correctly.eval_batches count is too low , increase to 200+ batches so the estimate is stable enough to be informative.Cross-entropy loss is convenient for training but hard to interpret: is a loss of 2.1 good or bad? Perplexity converts loss into an intuitive number: it measures the average number of tokens the model is "confused between" at each position.
perplexity = math.exp(val_loss) # Examples: # val_loss = 4.0 → perplexity ≈ 55 (confused between ~55 tokens) # val_loss = 2.5 → perplexity ≈ 12 (confused between ~12 tokens) # val_loss = 1.5 → perplexity ≈ 4.5 (almost certain, ~4-5 candidates) # GPT-2 small achieves ~30 perplexity on WikiText-103 # A random baseline would be vocab_size ≈ 65 (char-level Shakespeare)
A character-level model on Shakespeare with val_loss ≈ 1.5 (perplexity ≈ 4.5) is performing well: from 65 possible characters, it narrows it down to about 4 or 5 at each step.
Optimize your model, fine-tune on custom data, and chart your path forward
Week 6 Focus: Expanding your experienceWhy data quality is the highest-leverage variable, and how to curate aggressively
Every token in your training set is a teaching example. If 20% of your data is headers, metadata, encoding errors, or off-domain text, your model spends 20% of its capacity learning those patterns. The gradient doesn't distinguish signal from noise , it optimizes for whatever is in the file.
Headers, footers, navigation menus, repeated boilerplate, encoding artifacts, mixed languages. Model learns to generate any of these with equal probability.
Consistent format, high semantic density, no noise. Model's capacity is entirely spent on the patterns you actually want.
The best sources for small-scale training are collections with consistent format, professional editorial quality, and a clear domain:
Aim for a single domain. A model trained on Shakespeare + Python code + Twitter will be worse at each than one trained on just Shakespeare.
Most datasets need the same four cleaning steps, in this order:
import re
def clean_corpus(text):
# 1. Strip non-content lines (headers, footers, page numbers)
lines = [l for l in text.splitlines() if not looks_like_metadata(l)]
# 2. Normalize whitespace (collapse multiple blank lines)
text = re.sub(r'\n{3,}', '\n\n', '\n'.join(lines))
# 3. Filter by document length (remove stubs and walls of text)
docs = text.split('\n\n')
docs = [d for d in docs if 50 < len(d.split()) < 2000]
# 4. Deduplicate (exact match is enough at small scale)
seen = set()
unique = []
for d in docs:
h = hash(d[:100])
if h not in seen:
seen.add(h)
unique.append(d)
return '\n\n'.join(unique)
def looks_like_metadata(line):
line = line.strip()
return (len(line) < 4 or
line.startswith('***') or
line.isupper() and len(line) < 40)
How to match model capacity to data size , and why getting it wrong in either direction costs you
Three hyperparameters control the vast majority of parameter count: n_embd (embedding dimension), n_layer (transformer blocks), and n_head (attention heads). The dominant term scales as roughly 12 × n_embd² × n_layer:
def count_params(n_embd, n_layer, vocab_size=50257):
embed = vocab_size * n_embd + 1024 * n_embd
per_block = 4 * n_embd * n_embd + 8 * n_embd * n_embd
return embed + per_block * n_layer
print(count_params(384, 6)) # ~22M
print(count_params(512, 8)) # ~50M
print(count_params(768, 12)) # ~117M (GPT-2 small)
The 2022 Chinchilla paper showed that for a given compute budget, a smaller model trained on more data beats a larger model trained for fewer steps. The optimal ratio is roughly 20 tokens of training data per parameter:
At small scale (1–20MB datasets), this means you should use a much smaller model than you might think. The right move is to get more data, not a bigger model.
Start small and scale up only when val loss has plateaued with more training steps:
# ~10M parameters , good for 1-5MB curated datasets config_small = dict(n_layer=6, n_head=6, n_embd=384, block_size=256) # ~25M parameters , good for 5-20MB datasets config_medium = dict(n_layer=8, n_head=8, n_embd=512, block_size=512) # ~85M parameters (GPT-2 Small) , needs 20MB+ of clean data config_large = dict(n_layer=12, n_head=12, n_embd=768, block_size=1024)
If val loss is still dropping at the end of training, you need more data or epochs , not a bigger model. If val loss diverges from train loss early, you need more data or a smaller model.
config_small and config_medium on your curated dataset for the same number of steps. Plot both train and val loss curves on the same graph. Which model is still improving at the end? Which has started to overfit? Use this to decide whether to scale up or get more data first.
How tokenizer choice determines sequence length, attention cost, and training speed
Transformer attention is quadratic in sequence length. Double the sequence length and attention takes 4× as long and uses 4× the memory. This means the tokenizer choice has a direct, measurable impact on training speed:
13 tokens. Vocabulary: ~100 chars. A 5MB corpus → ~5M tokens → very long sequences. Simple to implement.
4 tokens. Vocabulary: 50,257. A 5MB corpus → ~1.25M tokens. 4× shorter sequences = 16× less attention compute.
For most projects, use a pretrained tokenizer rather than training your own. The GPT-2 BPE tokenizer is compact, well-tested, and handles Unicode cleanly:
import tiktoken
import numpy as np
enc = tiktoken.get_encoding("gpt2") # 50,257 vocab size
with open("corpus.txt") as f:
text = f.read()
tokens = enc.encode(text)
print(f"Characters: {len(text):,}")
print(f"Tokens: {len(tokens):,}")
print(f"Ratio: {len(text)/len(tokens):.1f} chars per token")
# Save as binary for fast DataLoader access
arr = np.array(tokens, dtype=np.uint16)
arr.tofile("corpus.bin")
For specialized domains (chemistry notation, a specific programming language, musical scores), a custom BPE tokenizer learns the most common subwords in your actual corpus and can outperform a general-purpose one:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
trainer = trainers.BpeTrainer(
vocab_size=5000,
special_tokens=["<|endoftext|>"],
min_frequency=2
)
tokenizer.train(files=["corpus.txt"], trainer=trainer)
tokenizer.save("tokenizer.json")
# Verify round-trip integrity
enc = tokenizer.encode("Hello, world!")
assert tokenizer.decode(enc.ids) == "Hello, world!"
Squeeze the last performance out of training: context length, dropout, early stopping, and learning rate
A larger context window lets the model condition on more history , better for long-range structure. But it scales attention compute quadratically, so it costs real training time. Set it based on your actual document lengths:
import numpy as np
# Measure your corpus before choosing block_size
doc_lengths = [len(doc.split()) for doc in your_documents]
print(f"Median doc length: {np.median(doc_lengths):.0f} words")
print(f"95th percentile: {np.percentile(doc_lengths, 95):.0f} words")
# Set block_size near the 95th percentile , no point paying for unused context
# Rough guidelines:
# block_size=256 → 1-5MB datasets, short documents
# block_size=512 → 5-20MB datasets, paragraph-length documents
# block_size=1024 → GPT-2 setting, needs 20MB+ data
Dropout randomly zeros activations during training, forcing the model to learn redundant representations and reducing overfitting. Use a low rate , high dropout can prevent learning entirely:
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.drop = nn.Dropout(config.dropout) # 0.1 is a safe default
def forward(self, idx):
tok_emb = self.transformer.wte(idx)
pos_emb = self.transformer.wpe(pos)
x = self.drop(tok_emb + pos_emb) # Applied after embedding sum
# ... rest of forward
Always call model.eval() before generating , this disables dropout so inference is deterministic and uses the full model capacity.
Training to the final step is almost never optimal. Val loss improves, plateaus, then rises (overfitting). Track it explicitly and save the best checkpoint:
best_val_loss = float('inf')
patience, max_patience = 0, 5
for step in range(max_steps):
train_step(model, optimizer, batch)
if step % eval_interval == 0:
val_loss = evaluate(model, val_data)
if val_loss < best_val_loss:
best_val_loss = val_loss
patience = 0
torch.save(model.state_dict(), 'checkpoint_best.pt')
else:
patience += 1
if patience >= max_patience:
print(f"Early stopping at step {step}")
break
# Always load best checkpoint before generating
model.load_state_dict(torch.load('checkpoint_best.pt'))
The learning rate is the most sensitive hyperparameter. Use warmup for the first 1–5% of steps, then cosine decay to 10% of the peak LR:
3e-4 to 5e-41e-4 to 3e-46e-5 to 1e-4Temperature, top-k, prompting, and how to develop taste for your model's output
Dividing logits by temperature before softmax changes how "peaked" the distribution is. Low temperature concentrates probability on likely tokens (focused, repetitive). High temperature flattens it (creative, incoherent). Temperature is not a preference setting , it changes the math:
@torch.no_grad()
def generate(model, prompt_ids, max_new_tokens, temperature=0.8, top_k=40):
model.eval()
x = torch.tensor([prompt_ids], dtype=torch.long)
for _ in range(max_new_tokens):
logits = model(x[:, -model.config.block_size:])[0, -1, :]
logits = logits / temperature # <1.0 sharpens, >1.0 flattens
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[-1]] = float('-inf')
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
x = torch.cat([x, next_token.unsqueeze(0)], dim=1)
return x[0].tolist()
Typical ranges: 0.6–0.8 for structured text (code, templates), 0.8–1.0 for balanced creative text, 1.0–1.3 for maximum variety (risks incoherence).
Even at reasonable temperatures, the lowest-probability tokens in a 50k vocabulary are garbage: rare Unicode, mangled words, encoding artifacts. Top-k filtering zeros out everything outside the top-k candidates before sampling:
# top_k=1: greedy decoding , always picks the most likely token # top_k=10: very focused, low variety # top_k=40: GPT-2's default, good balance for most text # top_k=200: broad sampling, approaches no filtering # For char-level vocab (~100 tokens): top_k=5-10 makes sense # For BPE vocab (~50k tokens): top_k=40-100 is typical
Top-k and temperature work together: top-k eliminates garbage tokens; temperature tunes creativity within the remaining candidates.
The prompt tokens you provide are conditioning , the model continues from whatever context you give it. A well-chosen prompt can dramatically shift the style, topic, and structure of output:
Adapt a pretrained model to a new domain without training from scratch
A model pretrained on hundreds of billions of tokens has already internalized grammar, facts, code syntax, and reasoning patterns. Fine-tuning steers that compressed knowledge toward your domain rather than relearning it from zero:
Model must learn basic language statistics before learning your domain. Needs 100MB+ for good results. Slow and compute-intensive.
Language knowledge is already there. Only domain adaptation needs to be learned. 1–10MB of data often enough. Fast.
For domain adaptation, format data as plain continuation text. For instruction-following, use a consistent prompt/response template:
# Domain adaptation: plain continuation
with open("fine_tune_data.txt", "w") as f:
for doc in your_documents:
f.write(doc.strip() + "\n<|endoftext|>\n")
# Instruction format: for Q&A or chat behavior
template = "<|user|>{question}\n<|assistant|>{answer}<|endoftext|>"
with open("fine_tune_data.txt", "w") as f:
for q, a in qa_pairs:
f.write(template.format(question=q, answer=a) + "\n")
Same curation rules as Day 36 apply: 1–10MB of high-quality data beats 100MB of noise.
Fine-tuning is continued training at a much lower learning rate , low enough that the model adapts without forgetting what it already knows (catastrophic forgetting):
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Key: 10-100× lower LR than pretraining (GPT-2 used ~6e-4)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.1)
for batch in fine_tune_dataloader:
input_ids = batch["input_ids"]
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"loss: {loss.item():.4f}")
LoRA freezes the original weights and inserts tiny trainable matrices alongside the attention projections. You get ~95% of full fine-tuning quality with less than 1% of the trainable parameters , feasible even on a laptop GPU:
from peft import get_peft_model, LoraConfig, TaskType
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank: higher = more capacity, more params
lora_alpha=32, # Scaling factor (keep at 4× r)
target_modules=["c_attn"], # GPT-2's combined QKV projection
lora_dropout=0.05,
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable: 294,912 / 124,734,720 total (0.24%)
# Training loop is identical to full fine-tuning
model.save_pretrained("gpt2-lora-adapter/") # Saves only the tiny adapter
Advanced sampling, model distillation, RLHF, essential papers, and where to go next
Top-p (nucleus) sampling adapts dynamically: instead of always considering exactly k tokens, it takes the smallest set whose cumulative probability exceeds p. When the model is confident, the nucleus is small. When it's uncertain, the nucleus expands:
def top_p_sample(logits, p=0.9, temperature=0.8):
logits = logits / temperature
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumprobs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens where cumulative prob exceeds p
remove = cumprobs - F.softmax(sorted_logits, dim=-1) > p
sorted_logits[remove] = float('-inf')
probs = F.softmax(sorted_logits, dim=-1)
idx = torch.multinomial(probs, 1)
return sorted_indices[idx]
Beam search maintains the top-k most probable sequences at each step. Produces coherent, grammatical output , but feels generic and repetitive for creative tasks. Use it for summarization and translation; use sampling for open-ended generation.
Distillation trains a small "student" to mimic a large "teacher." Instead of hard labels (one-hot targets), the student trains on the teacher's soft probability distribution , which contains far richer information about near-misses:
T = 4.0 # Soften the teacher's distribution
with torch.no_grad():
teacher_logits = teacher(input_ids) / T
student_logits = student(input_ids) / T
soft_loss = F.kl_div(
F.log_softmax(student_logits, dim=-1),
F.softmax(teacher_logits, dim=-1),
reduction="batchmean"
) * (T ** 2) # Rescale gradients by T²
hard_loss = F.cross_entropy(student_logits * T, labels)
loss = 0.5 * soft_loss + 0.5 * hard_loss
DistilGPT-2 (82M) matches ~97% of GPT-2 (124M) quality at 2× inference speed. This is how models are made deployable without retraining from scratch at smaller size.
RLHF (Reinforcement Learning from Human Feedback) transforms a text-completion model into an assistant that follows instructions. Three stages:
DPO (Direct Preference Optimization) skips the reward model and optimizes preferences directly using a reparameterized loss. It's simpler, more stable, and has become the standard for smaller-scale alignment work. If you want to explore alignment, start with DPO.
Read these in order , each builds on the last, and you'll recognize every concept:
Key tools: Hugging Face Transformers (standard model library), PEFT (LoRA and friends), vLLM (high-throughput serving), llama.cpp (4-bit inference on CPU/Apple Silicon), nanoGPT (Karpathy's clean reference implementation).
--top_p flag to your generate script alongside --temperature and --top_k. Test it at p=0.9 and compare outputs to top-k at k=40. Second: read the abstract, introduction, and conclusion of "Attention Is All You Need." Write one sentence summarizing the problem it was solving when it was published in 2017. The habit of reading papers is how you stay current as the field moves.