1WK

Tokenization

Turn raw text into numbers the model can understand

Week 1 Focus: Text to tokens to IDs

0 / 7 complete

Day 1

Character-Level Tokenization vs BPE

Why tokenization matters, character-level basics, and the case for Byte Pair Encoding

Tokenization is the bridge between human language and numerical computation. A language model never "sees" text directly: it operates on sequences of integers. Day 1 establishes why this translation layer exists, explores the simplest approach (character-level), and motivates the move to BPE (Byte Pair Encoding), which powers models like GPT and Llama. The key insight is that tokenization is a trade-off between vocabulary size, sequence length, and semantic coherence. Character-level is trivial to implement but produces long sequences with minimal semantic granularity. BPE compresses text intelligently but requires a learned merge table.

Why tokenization is not optional

Neural networks require fixed-size vector inputs. Text is variable-length and symbolic. Tokenization solves this by mapping each unit of text to a unique integer ID in a finite vocabulary. The vocabulary size V determines the output dimension of the embedding layer and, by extension, a large portion of the model's parameter count.

Consider the sentence "The cat sat.". At the character level, this becomes ['T','h','e',' ','c','a','t',' ','s','a','t','.'] (length 12). At the word level, it becomes ['The','cat','sat.'] (length 3). At the subword level (BPE), it might become ['The','Ġcat','Ġsat.'] (length 3, with the space encoded as Ġ).

Character-level: Vocabulary is tiny (maybe 256 for bytes). Sequences are long. Every character gets equal weight, so "e" and "q" are equally fundamental, which is semantically wrong.
Word-level: Vocabulary explodes. You need an entry for every word form: "run", "running", "runs", "runner" are all separate. Out-of-vocabulary (OOV) words are a hard problem.
Subword (BPE): Vocabulary is manageable (typically 32k-50k). Common words stay whole. Rare words get split into meaningful pieces. "unfortunately" might become ['un', 'fortun', 'ately'].

Implementing character-level tokenization

The simplest tokenizer maps each unique character to an ID. With a character-level approach, your "vocabulary" is just the set of unique characters in your training corpus. This is trivial to build and deterministic: no learning required.

# Build a character-level vocabulary from a corpus
def build_char_vocab(text):
    chars = sorted(set(text))
    return {ch: i for i, ch in enumerate(chars)}

# Encode text to IDs
def encode_char(text, vocab):
    return [vocab[ch] for ch in text]

# Decode IDs back to text
def decode_char(ids, vocab):
    itos = {i: ch for ch, i in vocab.items()}
    return ''.join(itos[i] for i in ids)

# Example
corpus = "The quick brown fox jumps over the lazy dog"
vocab = build_char_vocab(corpus)
# vocab might be: {' ': 0, 'T': 1, 'a': 2, 'b': 3, ...}
ids = encode_char(corpus, vocab)
# ids: [1, 2, ...] - sequence length = 43

Experiment

Change the corpus to a longer text (try a paragraph from Wikipedia). Compare the sequence length of character-level tokenization versus splitting on spaces. How much longer is the character-level sequence? Try len(set(text)) to see your vocabulary size. What happens to sequence length if you include Unicode characters?

The problem with character-level: sequence length

Transformer attention is O(n^2) in sequence length n. If your sequences are 10x longer because you used characters instead of subwords, your compute cost increases 100x. This is the fundamental reason character-level tokenization is rarely used for large models, though it is common in toy implementations and specific domains (DNA sequences, for example).

Character-level

For "The quick brown fox..." (43 chars), sequence length = 43. Vocabulary = ~20. Cheap embedding layer, expensive attention everywhere else.

Subword (BPE)

Same text might be ~10 tokens. Vocabulary = 32,000. Larger embedding layer, but attention cost drops by ~18x (43^2 vs 10^2). Net win for any real model.

BPE: how it works conceptually

Byte Pair Encoding starts with character-level tokens, then iteratively merges the most frequent pair of adjacent tokens. After enough merges, common sequences like "ing", "tion", or whole words like "the" become single tokens.

The algorithm:

Initialize vocabulary with all individual characters/bytes (256 entries for byte-level BPE).
Count all adjacent pairs in the training corpus.
Merge the most frequent pair into a new token. Add it to the vocabulary.
Repeat for a fixed number of merges (e.g., 32,000 times).
Store the merge table. This is your tokenizer.

# Conceptual BPE merge step
def get_pair_counts(tokens):
    counts = {}
    for i in range(len(tokens) - 1):
        pair = (tokens[i], tokens[i+1])
        counts[pair] = counts.get(pair, 0) + 1
    return counts

# After many merges, "t" + "h" -> "th", "th" + "e" -> "the"
# The merge table stores: {(116, 104): 256, (256, 101): 257, ...}

The critical detail: encoding at inference time uses the same merge table. You apply merges greedily from the bottom of the table up, which guarantees the same tokenization the model was trained with.

Using a pretrained BPE tokenizer

In practice, you rarely train BPE from scratch. You use a library like tiktoken (OpenAI's tokenizer) or tokenizers (Hugging Face). This ensures compatibility with pretrained models and handles edge cases like Unicode correctly.

import tiktoken

# Load GPT-4's tokenizer (cl100k_base)
enc = tiktoken.get_encoding("cl100k_base")

text = "The quick brown fox jumps over the lazy dog"
ids = enc.encode(text)
# ids: [464, 2068, 7586, 21831, 1917, 3463, 262, 429, 1332, 311]
# Only 10 tokens for 43 characters!

decoded = enc.decode(ids)
# decoded == text (round-trip integrity)

# Check vocabulary size
print(enc.n_vocab)  # 100256

Important Never mix tokenizers. If you train a model using one tokenizer's vocabulary, you must use the exact same tokenizer at inference time. The merge tables are not interchangeable between different BPE implementations.

▶

Interactive companion tool

Watch BPE tokenization happen step by step. Control how many merge steps to apply, see character pairs merge in real time, and compare character-level vs BPE sequence lengths side by side.

bpe-visualizer.html ↗

Your action Install tiktoken or tokenizers. Tokenize a sample text at character level and with BPE. Compare sequence lengths. Try encoding a word that wasn't in your training corpus (for a from-scratch BPE) vs a pretrained one. Observe how subword splitting differs.

Key takeaway

Tokenization is a lossy compression step that dramatically affects model compute cost. BPE finds the sweet spot: common words stay whole, rare words split meaningfully, and the vocabulary stays bounded. Character-level is simpler but costs you quadratically in sequence length.

Day 2

Data Loading and Batch Creation

PyTorch Datasets, DataLoaders, and the art of creating training batches

A language model learns by looking at many examples of "given this context, what comes next?" To provide these examples efficiently, you need a data pipeline that can serve up thousands of (input, target) pairs per second. Day 2 covers how to turn a raw text file into a PyTorch DataLoader that yields properly shaped batches: and why the details of that shaping matter more than you think.

The (input, target) pair for language modeling

Language modeling is fundamentally next-token prediction. Given a sequence of token IDs [x1, x2, ..., xn], the model learns to predict [x2, x3, ..., xn+1]. Input and target are offset by one position.

# For a sequence of token IDs
tokens = [464, 2068, 7586, 21831, 1917]  # "The quick brown..."

# Input: all tokens except the last
inputs = tokens[:-1]   # [464, 2068, 7586, 21831]

# Target: all tokens except the first (shifted left by 1)
targets = tokens[1:]    # [2068, 7586, 21831, 1917]

# Model learns: given 464 -> predict 2068, given 2068 -> predict 7586, etc.

When working with sequences longer than one step (which you always should), inputs and targets become matrices of shape (batch_size, sequence_length). The loss is computed over all positions in the sequence.

Building a PyTorch Dataset

A Dataset in PyTorch defines how to get a single item. For language modeling, an item is typically a chunk of text of fixed length. The dataset needs to tokenize the raw text and serve up (input_chunk, target_chunk) pairs.

import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, text, tokenizer, seq_len):
        self.tokenizer = tokenizer
        self.seq_len = seq_len
        # Tokenize the entire text once
        self.ids = tokenizer.encode(text)
        # Number of complete sequences we can make
        self.n_samples = len(self.ids) // seq_len

    def __len__(self):
        return self.n_samples

    def __getitem__(self, idx):
        start = idx * self.seq_len
        end = start + self.seq_len
        # Input: tokens[start:end-1], Target: tokens[start+1:end]
        chunk = self.ids[start:end]
        x = torch.tensor(chunk[:-1], dtype=torch.long)
        y = torch.tensor(chunk[1:], dtype=torch.long)
        return x, y

Experiment

Try different seq_len values: 16, 64, 256, 1024. For each, print the number of samples and the shape of x and y. What happens to the number of samples as seq_len increases? What's the trade-off between longer sequences (more context) and more samples (more gradient updates per epoch)?

DataLoader: batching and shuffling

The DataLoader wraps your Dataset and adds batching, shuffling, and multi-process loading. For language modeling, you typically want:

batch_size: How many sequences per batch. Larger = more stable gradients, more memory usage.
shuffle: Whether to randomize sample order each epoch. Almost always True for training.
num_workers: Parallel data loading. More workers = faster data pipeline, but more CPU usage.

dataset = TextDataset(text, tokenizer, seq_len=128)
dataloader = DataLoader(
    dataset,
    batch_size=32,      # 32 sequences per batch
    shuffle=True,          # Randomize order each epoch
    num_workers=4,        # Parallel loading (set to 0 on Windows if issues)
    pin_memory=True       # Speeds up GPU transfer
)

for batch_x, batch_y in dataloader:
    # batch_x: shape (32, 127) - 32 sequences of 127 tokens each
    # batch_y: shape (32, 127) - the targets
    print(batch_x.shape, batch_y.shape)
    break  # Just check the first batch

The train/validation split

You need separate data for training and validation. The validation set measures generalization: if your model memorizes the training set but fails on validation, you're overfitting. A typical split is 90/10 or 95/5 for large corpora.

def train_val_split(text, train_ratio=0.95):
    split_idx = int(len(text) * train_ratio)
    return text[:split_idx], text[split_idx:]

train_text, val_text = train_val_split(raw_text)

train_dataset = TextDataset(train_text, tokenizer, seq_len=128)
val_dataset = TextDataset(val_text, tokenizer, seq_len=128)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

Critical detail Always split at the character level, not the token level. If you tokenize first and then split, you might cut a sequence in the middle of a multi-token word. Split the raw text, then tokenize each portion separately.

Handling the end of the corpus

What happens when seq_len doesn't divide the corpus evenly? You have two options: discard the remainder (simple, loses a bit of data) or pad the last sequence (requires attention masks). For simplicity in a from-scratch implementation, discarding is fine.

# In __init__ of TextDataset:
self.n_samples = len(self.ids) // seq_len  # Integer division discards remainder
self.ids = self.ids[:self.n_samples * seq_len]  # Trim to exact multiple

For very large datasets, you might also implement a streaming dataset that doesn't load everything into memory. But for learning purposes, loading the full tokenized corpus is simpler and fast enough for corpora up to ~100MB.

▶

Interactive companion tool

Visualize how your dataset gets sliced into training batches. Toggle sliding-window sampling, see the input/target offset, and compute tokens, batches, and memory for any config.

batch-shapes.html ↗

Your action Download a sample text corpus (try the TinyShakespeare dataset). Create a TextDataset with seq_len=128 and batch_size=32. Iterate through one epoch and print the total number of batches. Verify that inputs and targets are correctly offset by one position for the first sample.

Key takeaway

The data pipeline is where many training bugs hide. Inputs and targets must be offset by exactly one position. The sequence length determines both your context window and your memory usage. Always verify your shapes with a single batch before starting training.

Day 3

Embedding Layer and Position Encodings

Turn token IDs into vectors, and teach the model where each token sits in sequence

An embedding layer is a learned lookup table that maps each token ID to a dense vector. Position encodings solve a fundamental problem: the embedding layer is permutation-invariant, meaning it doesn't know if token 5 comes before or after token 10. Position encodings inject order information into the representation. Day 3 builds both from scratch and explains why the choice of position encoding matters enormously for model quality.

The embedding layer: what it actually is

An embedding layer is literally a matrix of shape (vocab_size, d_model). When you look up token ID 42, you get row 42 of this matrix. The values in this matrix are learned during training: they are parameters, just like weights in a linear layer.

The intuition: tokens with similar meanings should have similar embedding vectors (small cosine distance). "King" and "Queen" should be close; "King" and "Toaster" should be far apart. The model learns this by adjusting the embedding matrix to minimize the training loss.

import torch.nn as nn

vocab_size = 50257   # GPT-2 vocabulary size
d_model = 768        # Embedding dimension (GPT-2 small)

embedding = nn.Embedding(vocab_size, d_model)
# embedding.weight: shape (50257, 768) - these are all learnable parameters

# Forward pass: batch of token IDs
token_ids = torch.tensor([[464, 2068, 7586]])  # shape (1, 3)
embeddings = embedding(token_ids)          # shape (1, 3, 768)
# Each token ID is replaced by its 768-dim vector

38.6M

Parameters in GPT-2's embedding layer alone (50k x 768)

768

Dimensions per token in GPT-2 small

Why position matters: the permutation problem

The embedding lookup described above is position-agnostic. The vector for token ID 42 at position 0 is identical to the vector for token ID 42 at position 500. But "The cat" (cat at position 1) and "Cat the" (cat at position 0) mean different things! The model needs to know where each token sits in the sequence.

This is solved by adding position information to each token's embedding. There are two main approaches:

Learned positional embeddings: Just another lookup table, mapping position 0, 1, 2, ... to vectors. These are learned during training alongside token embeddings.
Sinusoidal encodings (original Transformer): Fixed mathematical formula using sine and cosine waves of different frequencies. Not learned, just computed.

Learned positions

Simple to implement (another nn.Embedding). Works well in practice. GPT and most modern LLMs use this approach. The model figures out what position information is useful.

Sinusoidal (fixed)

Elegant idea: different positions get different "frequencies". But it's less flexible than learned, and you need to handle extrapolation to longer sequences carefully. Mostly historical now.

Implementing learned positional embeddings

class EmbeddingWithPosition(nn.Module):
    def __init__(self, vocab_size, d_model, max_seq_len=1024):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)

    def forward(self, token_ids):
        # token_ids: (batch, seq_len)
        seq_len = token_ids.shape[1]
        # Position indices: [0, 1, 2, ..., seq_len-1]
        positions = torch.arange(seq_len, device=token_ids.device)

        # Look up embeddings
        tok_emb = self.token_embedding(token_ids)      # (batch, seq_len, d_model)
        pos_emb = self.position_embedding(positions)    # (seq_len, d_model)

        # Add them! Broadcasting handles the batch dimension.
        return tok_emb + pos_emb

The addition is the key operation. Each token's final representation is token_embedding[token_id] + position_embedding[position]. Both are d_model-dimensional vectors, so they add element-wise.

Sinusoidal encodings: the original approach

The original Transformer paper (Vaswani et al., 2017) used fixed sinusoidal encodings. The formula:

For dimension i (where i is even):
PE(pos, i) = sin(pos / 10000^(i/d_model))

For dimension i (where i is odd):
PE(pos, i) = cos(pos / 10000^((i-1)/d_model))

def sinusoidal_position_encoding(seq_len, d_model):
    # Create position indices and dimension indices
    positions = torch.arange(seq_len).unsqueeze(1)  # (seq_len, 1)
    dims = torch.arange(d_model).unsqueeze(0)       # (1, d_model)

    # Compute the div_term: 1 / 10000^(2i/d_model)
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float()
        * -(math.log(10000.0) / d_model)
    )

    # Allocate the encoding matrix
    pe = torch.zeros(seq_len, d_model)
    pe[:, 0::2] = torch.sin(positions * div_term)
    pe[:, 1::2] = torch.cos(positions * div_term)

    return pe  # (seq_len, d_model) - add this to token embeddings

Experiment

Visualize the sinusoidal encodings for positions 0-99 and dimensions 0-63 using matplotlib.imshow(). Observe the different frequency patterns. Now compare with learned position embeddings from a trained model (if available). Which ones look more structured? Try changing the 10000 base in the formula to 500 or 50000. How does it affect the encoding?

Why not just concatenate?

You might wonder: why add the position encoding? Why not concatenate token embedding (768-dim) with position encoding (768-dim) to get 1536-dim, then project down?

The answer is simplicity and parameter efficiency. Adding keeps the dimension at d_model throughout the network. Concatenation would double the effective dimension at the cost of more parameters. Addition works because the model learns to "interpret" each dimension as encoding both token identity and position simultaneously.

This is also why the initialization scale of position embeddings matters. If position embeddings are too large relative to token embeddings, the token information gets overwhelmed. If they're too small, the model can't distinguish positions. In practice, both are typically initialized with similar scales (e.g., standard normal with std=0.02).

Your action Implement both learned and sinusoidal position encodings. Pass a small batch through each and verify the output shapes. Try visualizing the learned position embeddings after a few training steps: do they show any structure? Compare the parameter counts between the two approaches.

Key takeaway

Embeddings turn discrete token IDs into continuous vectors that neural networks can process. Position encodings solve the permutation problem by injecting order information. The addition of token and position embeddings is simple but effective: the model learns to interpret each dimension as encoding both what the token is and where it sits.

Day 4

Training Loop Basics

The engine room: optimizer, loss function, backpropagation, and gradient clipping

The training loop is where your model actually learns. It's the repetitive process of: forward pass -> compute loss -> backward pass -> update weights. Day 4 covers the full loop, the cross-entropy loss function (the standard for language modeling), the AdamW optimizer, and gradient clipping (essential for stabilizing transformer training). Getting these details right is the difference between a model that learns and one that explodes.

The training loop skeleton

Every training loop follows this structure. The details change, but the flow is invariant:

def train(model, dataloader, optimizer, device, epoch):
    model.train()  # Enable dropout, batch norm updates, etc.
    total_loss = 0

    for batch_idx, (inputs, targets) in enumerate(dataloader):
        # Move data to device (GPU if available)
        inputs = inputs.to(device)
        targets = targets.to(device)

        # Forward pass
        logits = model(inputs)  # shape: (batch, seq_len, vocab_size)
        loss = compute_loss(logits, targets)

        # Backward pass
        optimizer.zero_grad()   # Clear previous gradients
        loss.backward()         # Compute gradients via backprop
        optimizer.step()        # Update parameters

        total_loss += loss.item()

        if batch_idx % 100 == 0:
            print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")

    return total_loss / len(dataloader)

Cross-entropy loss for language modeling

The model outputs logits of shape (batch, seq_len, vocab_size). For each position, this is a vector of raw scores (not probabilities) for each token in the vocabulary. Cross-entropy loss measures how far these scores are from the target token ID.

Mathematically: CE(p, y) = -log(p[y]) where p is the softmax of the logits and y is the target token ID. PyTorch's CrossEntropyLoss combines softmax and negative log-likelihood in one numerically stable operation.

import torch.nn.functional as F

def compute_loss(logits, targets):
    # logits: (batch, seq_len, vocab_size)
    # targets: (batch, seq_len)

    # Reshape for CrossEntropyLoss:
    # It expects (batch*vocab_size, ...) but actually (N, C) where N = batch*seq_len
    batch_size, seq_len, vocab_size = logits.shape
    logits = logits.view(-1, vocab_size)   # (batch*seq_len, vocab_size)
    targets = targets.view(-1)            # (batch*seq_len,)

    return F.cross_entropy(logits, targets)

A key detail: we compute loss over all positions in the sequence. The model is learning to predict every next token simultaneously. This is what makes transformers efficient: one forward pass teaches the model about every position in the context window.

The AdamW optimizer

AdamW (Adam with decoupled weight decay) is the standard optimizer for transformers. It adapts the learning rate per parameter based on the history of gradients, and applies weight decay (L2 regularization) in a way that's decoupled from the adaptive learning rate.

import torch.optim as optim

optimizer = optim.AdamW(
    model.parameters(),
    lr=3e-4,           # Learning rate: 0.0003 is a good starting point
    weight_decay=0.1,   # Weight decay: helps prevent overfitting
    betas=(0.9, 0.95)  # Adam momentum parameters
)

AdamW parameters

lr=3e-4 is the GPT-3 learning rate. weight_decay=0.1 is standard. betas=(0.9, 0.95) gives slightly more momentum to the second moment than the original (0.9, 0.999).

Common mistakes

Using SGD (too slow convergence). Forgetting weight_decay (overfitting). Setting lr too high (>1e-3, loss explodes). Setting lr too low (<1e-5, barely learns).

Gradient clipping: preventing explosion

Transformers are prone to gradient explosion, especially early in training. A single large gradient can destroy your model's weights. Gradient clipping solves this by scaling the gradients if their norm exceeds a threshold.

# Inside the training loop, after loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Then optimizer.step()

The value max_norm=1.0 means: if the total gradient norm exceeds 1.0, scale all gradients down proportionally so the norm equals 1.0. This keeps training stable without changing the direction of the gradient update.

Experiment

Try training with and without gradient clipping. Monitor the gradient norm using torch.nn.utils.clip_grad_norm_() (it returns the norm before clipping). Plot gradient norms over batches. Without clipping, do you see occasional massive spikes? Try max_norm values of 0.5, 1.0, and 5.0. Which produces the most stable training?

Learning rate scheduling

A constant learning rate is rarely optimal. The standard schedule for transformers is warmup followed by cosine decay: start with a very small LR, ramp up to the target LR over N steps, then decay down to near-zero.

def get_lr_scheduler(optimizer, warmup_steps, total_steps):
    def lr_lambda(step):
        if step < warmup_steps:
            return float(step) / float(max(1, warmup_steps))
        return 0.5 * (1.0 + math.cos(
            math.pi * (step - warmup_steps) / (total_steps - warmup_steps)
        ))

    return optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

# Usage
scheduler = get_lr_scheduler(optimizer, warmup_steps=2000, total_steps=100000)
# After optimizer.step():
scheduler.step()

Warmup prevents early instability (the model's loss is very high at the start, leading to large gradients). Cosine decay allows fine-tuning in the later stages of training.

Your action Write a complete training loop with the components above. Use a tiny model (2 layers, 64 d_model) and a small dataset. Train for 1000 steps. Plot the loss over time. Verify that loss decreases. Add gradient clipping and observe whether it changes the training dynamics.

Key takeaway

The training loop is straightforward in structure but sensitive in detail. Cross-entropy loss over all positions teaches the model to predict every next token. AdamW with weight decay is the standard optimizer. Gradient clipping is essential for stability. Learning rate warmup and decay help the model converge to a better solution.

Day 5

Model Architecture Preview

The full transformer architecture: layers, dimensions, and how it all connects

Before diving into the full transformer implementation in Week 2, Day 5 provides a high-level architectural overview. You'll understand the full stack: token embeddings + position encodings feed into a stack of transformer blocks, each containing multi-head attention and feed-forward layers, topped with a language modeling head that projects back to vocabulary size. Understanding the shape flow through the entire model is the goal.

The full architecture at a glance

A GPT-style transformer follows this data flow:

Token IDs (batch, seq_len) -> Token Embedding -> (batch, seq_len, d_model)
Add Position Encoding -> (batch, seq_len, d_model)
Pass through N Transformer Blocks (each preserves shape)
Final Layer Norm -> (batch, seq_len, d_model)
Language Modeling Head (linear layer) -> (batch, seq_len, vocab_size)
Optionally apply softmax to get probabilities

class GPT(nn.Module):
    def __init__(self, vocab_size, d_model=768, n_layers=12, n_heads=12, max_seq_len=1024):
        super().__init__()
        self.embedding = EmbeddingWithPosition(vocab_size, d_model, max_seq_len)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads)
            for _ in range(n_layers)
        ])
        self.ln_final = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, token_ids):
        x = self.embedding(token_ids)        # (batch, seq_len, d_model)
        for block in self.blocks:
            x = block(x)                     # shape preserved
        x = self.ln_final(x)                 # (batch, seq_len, d_model)
        logits = self.lm_head(x)             # (batch, seq_len, vocab_size)
        return logits

What's inside a Transformer Block

Each transformer block has two main sub-layers, each wrapped with residual connections and layer normalization (pre-norm is the modern standard):

Multi-Head Self-Attention: The mechanism that lets tokens "see" other tokens in the sequence. This is where the model builds context understanding.
Feed-Forward Network: A simple 2-layer MLP applied independently to each position. This is where the model processes the information gathered by attention.

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model)

    def forward(self, x):
        # Pre-norm residual: norm -> sublayer -> residual add
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

Pre-norm vs Post-norm The original Transformer used post-norm (sublayer -> residual -> norm). Modern implementations (including GPT-2/3) use pre-norm (norm -> sublayer -> residual). Pre-norm trains more stably and is the recommended approach for new implementations.

Multi-Head Attention preview

Attention is the core innovation of the Transformer. It computes, for each token, a weighted sum of all tokens in the sequence (including itself). The weights are determined by how "relevant" each other token is to the current token.

Multi-head attention runs this process in parallel across multiple "heads", each learning different types of relationships:

Some heads might focus on syntactic relationships (subject-verb agreement)
Some heads might focus on semantic relationships (related concepts)
Some heads might focus on positional patterns (nearby tokens vs distant ones)

The full mathematical form (you'll implement this in Week 2):

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Where Q (queries), K (keys), and V (values) are linear projections of the input, and d_k is the dimension per head.

The Feed-Forward Network

The FFN is surprisingly simple: two linear layers with a non-linearity (typically GELU) in between. It expands to 4 * d_model internally, then projects back to d_model.

class FeedForward(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),   # Expand
            nn.GELU(),                       # Non-linearity (smoother than ReLU)
            nn.Linear(4 * d_model, d_model), # Project back
        )

    def forward(self, x):
        return self.net(x)

The 4x expansion factor is a hyperparameter. GPT-3 uses 4x. Some models use 8x. The FFN typically accounts for about 2/3 of the model's parameters.

Parameter count estimation

Understanding the parameter count helps you choose model dimensions that fit your hardware. Here's the breakdown for a GPT-style model:

12M

Parameters in GPT-2 Small (124M total) from embedding + LM head

1.2B

Parameters in GPT-3 Small (125M) per transformer block (x12 blocks)

# Rough parameter count for one transformer block:
d_model = 768
n_heads = 12
# Attention: Q, K, V projections + output projection = 4 * d_model * d_model
attn_params = 4 * d_model * d_model  # ~2.4M
# FFN: expand (d_model -> 4*d_model) + project (4*d_model -> d_model)
ffn_params = d_model * 4 * d_model + 4 * d_model * d_model  # ~4.7M
# Layer norms: 2 * 2 * d_model (negligible)
print(attn_params + ffn_params)  # ~7.1M per block
print(12 * 7100000 + 12400000)  # ~98M + 12M embedding = ~110M total

Experiment

Calculate the parameter count for a model with d_model=512, n_layers=6, n_heads=8. Now try d_model=1024, n_layers=24, n_heads=16. How does the parameter count scale? What's the bottleneck: attention or FFN? Use sum(p.numel() for p in model.parameters()) to verify your calculation on an actual PyTorch model.

▶

Interactive companion tool

Break down parameter counts for any GPT config. Toggle weight tying, compare architectures side by side, and estimate memory usage.

parameter-counter.html ↗

Your action Sketch the full architecture on paper or in code comments. Trace the shape of a tensor through each component: (batch, seq_len) -> ... -> (batch, seq_len, vocab_size). Build a minimal GPT model with 2 layers and d_model=64. Print the parameter count and verify it matches your manual calculation.

Key takeaway

The transformer architecture is a stack of identical blocks, each combining multi-head self-attention (for building context) and a feed-forward network (for processing information). The shape is preserved throughout: (batch, seq_len, d_model). The final linear layer projects back to vocabulary size for next-token prediction.

Day 6

Experiments and Testing Approaches

How to evaluate your model, run ablation studies, and debug training issues

Building the model is only half the battle. Day 6 covers how to rigorously test and experiment with your implementation. You'll learn to compute perplexity (the standard LM evaluation metric), run ablation studies to understand which components matter, debug common training failures, and design experiments that teach you something about how transformers work. The goal is to build intuition, not just a working model.

Perplexity: the evaluation metric for LMs

Perplexity is the exponential of the average negative log-likelihood. Lower is better. A perplexity of N means the model is as confused as if it had to choose uniformly from N tokens at each step.

Mathematically: Perplexity = exp(cross_entropy_loss). If your model achieves a cross-entropy loss of 3.0, perplexity is e^3 ≈ 20.09, meaning the model is about as uncertain as choosing from 20 tokens randomly.

def compute_perplexity(model, dataloader, device):
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():  # No gradient computation needed
        for inputs, targets in dataloader:
            inputs, targets = inputs.to(device), targets.to(device)
            logits = model(inputs)
            loss = compute_loss(logits, targets)
            total_loss += loss.item() * targets.numel()
            total_tokens += targets.numel()

    avg_loss = total_loss / total_tokens
    return math.exp(avg_loss)  # Perplexity

# Usage
val_ppl = compute_perplexity(model, val_loader, device)
print(f"Validation Perplexity: {val_ppl:.2f}")

Good perplexity

On TinyShakespeare: ~30-50 is decent for a small model. GPT-3 on Common Crawl: ~20-30. Lower means better predictions. But perplexity depends heavily on the dataset and tokenizer.

Bad perplexity

Perplexity equal to vocab_size (e.g., 50,000) means the model is predicting uniformly at random. Perplexity that increases during training means something is broken (learning rate too high, bug in loss, etc.).

▶

Interactive companion tool

Convert between cross-entropy loss and perplexity. See where your model sits on the scale from random to very good, with benchmark references for common checkpoints.

perplexity.html ↗

Ablation studies: what happens when you remove things?

An ablation study removes one component at a time to measure its contribution. This is how you build understanding of what each part does.

No position encoding: Does the model still learn? (It shouldn't, for tasks requiring order understanding.)
Single head (n_heads=1): How much does multi-head attention help?
No FFN: Just attention blocks. Does the model collapse?
Smaller d_model: How does capacity affect perplexity?
Character-level vs BPE: Same model, different tokenization. Which converges faster?

# Ablation: run the same training with different configurations
configs = [
    {"d_model": 64, "n_layers": 2, "n_heads": 2, "name": "tiny"},
    {"d_model": 128, "n_layers": 4, "n_heads": 4, "name": "small"},
    {"d_model": 256, "n_layers": 6, "n_heads": 8, "name": "medium"},
]

for cfg in configs:
    model = GPT(vocab_size, cfg["d_model"], cfg["n_layers"], cfg["n_heads"])
    # Train and evaluate...
    ppl = compute_perplexity(model, val_loader, device)
    print(f"{cfg['name']}: Perplexity = {ppl:.2f}")

Debugging training failures

When training fails, it's usually one of these issues. Here's how to diagnose each:

Loss is NaN: Learning rate too high. Gradient explosion. Try gradient clipping, lower LR, or check for bugs in your attention mask.
Loss doesn't decrease: Learning rate too low. Model has insufficient capacity. Bug in the training loop (e.g., forgot optimizer.step()).
Loss decreases on train but not val: Overfitting. Use weight decay, dropout, or get more data.
Loss spikes randomly: Gradient explosion. Enable gradient clipping with max_norm=1.0.
Model outputs the same token repeatedly: The model has "collapsed" to predicting the most frequent token. This can happen with too-high learning rates early in training.

Experiment

Intentionally break your training loop in each of the ways described above. Observe the symptoms. Then fix them. This "break it to understand it" approach is one of the fastest ways to build debugging intuition. Try setting lr=1.0 (absurdly high) and watch the loss explode. Try commenting out optimizer.zero_grad() and observe gradient accumulation.

Overfitting deliberately: the Sanity Check

Before training on a large dataset, try overfitting on a single batch of 10-20 examples. If your model can't memorize 10 examples, something is fundamentally broken.

# Sanity check: overfit a single batch
single_batch = next(iter(train_loader))  # Get one batch
inputs, targets = single_batch
inputs, targets = inputs.to(device), targets.to(device)

for step in range(1000):
    logits = model(inputs)
    loss = compute_loss(logits, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 100 == 0:
        print(f"Step {step}, Loss: {loss.item():.4f}")
        if loss.item() < 0.01:
            print("SUCCESS: Model can memorize!")
            break

This test should take less than a minute. If loss doesn't go below 0.01 within 1000 steps, debug your model before wasting time on full training.

Experiment: scaling laws in practice

Scaling laws describe how model performance improves as you increase parameters, data, or compute. You can observe this empirically with small models:

# Run the same training with increasing model sizes
# Plot: d_model vs final perplexity
# Expect: perplexity decreases as d_model increases (diminishing returns)

# Also try: fixed model, varying training data size
# Expect: more data = lower perplexity (up to a point)

The key insight from scaling laws research: performance follows a power law with compute. Doubling compute (model size or training steps) gives you a predictable improvement in perplexity. This is why big tech companies keep building bigger models: the relationship is reliable.

Your action Run the overfitting sanity check with a tiny model. Compute perplexity on train and validation sets after training. Run one ablation: compare a model with and without position encodings. Does position encoding matter for a next-token prediction task on a small corpus? Measure the difference in perplexity.

Key takeaway

Evaluation is as important as training. Perplexity gives you a scalar measure of model quality. Ablation studies reveal what each component contributes. Deliberate overfitting on a small batch is the ultimate sanity check. When something goes wrong, the symptoms (NaN loss, stagnant loss, random spikes) point directly to the cause.

Day 7

Putting It All Together: Your First End-to-End Language Model

Combine tokenization, data pipeline, embeddings, and training loop into one complete working script that generates text

Days 1–6 built each component in isolation: tokenizer, data pipeline, embeddings, training loop, architecture sketch, evaluation. Day 7 assembles them into a single cohesive script. By the end you'll have a working character-level language model trained on TinyShakespeare that actually generates text , crude text, but text that demonstrates the whole pipeline functions end-to-end. This integration step is where subtle bugs surface and where the abstraction layers you've built start earning their keep.

The integration map: how the pieces connect

Before writing code, trace the data flow. Each component you built in Days 1–6 occupies a specific slot in the pipeline:

Raw text → tokenizer (Day 1): text string becomes a flat list of integer IDs.
Token IDs → DataLoader (Day 2): the ID list is sliced into overlapping (input, target) pairs and batched.
Batch → embedding layer (Day 3): each token ID looks up a learned vector; position encodings are added.
Embedded batch → tiny model (Day 5): a stack of linear layers and activations transforms the embeddings into logits over the vocabulary.
Logits → loss → optimizer (Day 4): cross-entropy measures prediction error; AdamW updates weights; gradient clipping keeps things stable.
Trained model → perplexity + generation (Day 6): evaluate quality and sample text to verify the model learned something.

Every component is already implemented. Day 7 is about wiring them together correctly and verifying the combined system works.

Complete script: train.py

Here is a single self-contained training script that combines every week-1 component. Read it as a checklist , each section maps to a day.

import math, torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# ── 1. TOKENIZER (Day 1) ───────────────────────────────────────
def build_vocab(text):
    chars = sorted(set(text))
    stoi  = {ch: i for i, ch in enumerate(chars)}
    itos  = {i: ch for ch, i in stoi.items()}
    return stoi, itos

def encode(text, stoi): return [stoi[ch] for ch in text]
def decode(ids, itos):  return ''.join(itos[i] for i in ids)

# ── 2. DATA PIPELINE (Day 2) ───────────────────────────────────
class TextDataset(Dataset):
    def __init__(self, ids, seq_len):
        self.ids     = torch.tensor(ids, dtype=torch.long)
        self.seq_len = seq_len
    def __len__(self):
        return len(self.ids) - self.seq_len
    def __getitem__(self, i):
        x = self.ids[i : i + self.seq_len]
        y = self.ids[i + 1 : i + self.seq_len + 1]
        return x, y

# ── 3. MODEL: embeddings + tiny transformer skeleton (Days 3 & 5)
class TinyLM(nn.Module):
    def __init__(self, vocab_size, d_model, seq_len, n_layers):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(seq_len, d_model)
        self.blocks  = nn.Sequential(*[
            nn.Sequential(
                nn.LayerNorm(d_model),
                nn.Linear(d_model, 4 * d_model),
                nn.GELU(),
                nn.Linear(4 * d_model, d_model),
            )
            for _ in range(n_layers)
        ])
        self.ln_f  = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, x):
        B, T = x.shape
        pos  = torch.arange(T, device=x.device)
        h    = self.tok_emb(x) + self.pos_emb(pos)
        for block in self.blocks:
            h = h + block(h)   # residual connection
        return self.lm_head(self.ln_f(h))

# ── 4. TRAINING LOOP (Day 4) ───────────────────────────────────
def train(model, loader, optimizer, device, max_steps=3000):
    model.train()
    step = 0
    for epoch in range(999):
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            loss   = nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)), y.view(-1))
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            step += 1
            if step % 500 == 0:
                print(f"step {step:,}  loss {loss.item():.4f}")
            if step >= max_steps:
                return

# ── 5. GENERATION ─────────────────────────────────────────────
@torch.no_grad()
def generate(model, prompt_ids, itos, seq_len, n_tokens=200, temperature=0.8):
    model.eval()
    ids = list(prompt_ids)
    for _ in range(n_tokens):
        ctx    = torch.tensor([ids[-seq_len:]], dtype=torch.long)
        logits = model(ctx)[0, -1] / temperature
        probs  = torch.softmax(logits, dim=-1)
        nxt    = torch.multinomial(probs, num_samples=1).item()
        ids.append(nxt)
    return decode(ids, itos)

# ── 6. MAIN ───────────────────────────────────────────────────
if __name__ == "__main__":
    # Load TinyShakespeare (download from karpathy/char-rnn repo)
    with open("input.txt") as f:
        text = f.read()

    stoi, itos = build_vocab(text)
    ids  = encode(text, stoi)
    split = int(0.9 * len(ids))
    train_ids, val_ids = ids[:split], ids[split:]

    SEQ_LEN, BATCH = 64, 32
    train_ds = TextDataset(train_ids, SEQ_LEN)
    val_ds   = TextDataset(val_ids,   SEQ_LEN)
    train_dl = DataLoader(train_ds, batch_size=BATCH, shuffle=True)

    device = "cuda" if torch.cuda.is_available() else "cpu"
    model  = TinyLM(vocab_size=len(stoi), d_model=128,
                    seq_len=SEQ_LEN, n_layers=4).to(device)
    opt    = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)

    print(f"vocab size: {len(stoi)}  params: {sum(p.numel() for p in model.parameters()):,}")
    train(model, train_dl, opt, device, max_steps=3000)

    prompt = encode("HAMLET:\n", stoi)
    print(generate(model, prompt, itos, SEQ_LEN))

Shape verification: catching bugs before training

Before committing to a full training run, run a single forward pass and verify every tensor shape. This catches the majority of integration bugs in under a second.

# Shape smoke test , run before training
B, T = 4, SEQ_LEN
dummy_x = torch.zeros(B, T, dtype=torch.long).to(device)
logits  = model(dummy_x)
print(logits.shape)  # expect (4, 64, vocab_size)

dummy_y = torch.zeros(B, T, dtype=torch.long).to(device)
loss    = nn.functional.cross_entropy(
    logits.view(-1, logits.size(-1)), dummy_y.view(-1))
print(loss.item())   # expect ≈ log(vocab_size), ~4.2 for 65 chars

Initial loss sanity check

A randomly initialized model should produce a loss close to ln(vocab_size). For a 65-character vocabulary, that's ln(65) ≈ 4.17. If your initial loss is wildly different, there's a bug in the loss computation or model output shape.

Loss = 0 immediately

If loss starts at 0, the target tensor is all zeros and you're accidentally predicting the padding token perfectly , or your labels and inputs are the same tensor. Check the offset in your TextDataset.__getitem__.

Reading the training output

With 3,000 steps on TinyShakespeare you should see loss drop from ~4.2 to somewhere around 1.5–1.8. Here's what to look for at each checkpoint:

Steps 0–500: Loss drops fast from ~4.2 to ~2.5. The model is learning frequent characters and basic bigram patterns.
Steps 500–1500: Loss drops more slowly to ~2.0. Common words are now predicted reliably. Generated text starts looking like English noise.
Steps 1500–3000: Loss plateaus around 1.6–1.8 for this model size. You'll see occasional real words and some punctuation structure in generated text.

Loss not decreasing past step 500? Check: (1) Is optimizer.zero_grad() called before loss.backward()? (2) Is the model in model.train() mode? (3) Is the learning rate too low (1e-5 or below)? (4) Are all tensors on the same device?

Generating text: temperature and sampling

The generate function above uses temperature sampling. Temperature controls how "sharp" the probability distribution is before sampling:

# temperature = 1.0 → unchanged distribution
# temperature < 1.0 → sharper, more repetitive but coherent
# temperature > 1.0 → flatter, more random and creative

for temp in [0.5, 0.8, 1.0, 1.2]:
    out = generate(model, prompt, itos, SEQ_LEN, n_tokens=100, temperature=temp)
    print(f"\n── temperature={temp} ──\n{out}")

At temperature=0.5 the model will repeat common phrases. At temperature=1.2 it will occasionally produce nonsense characters. The sweet spot for readability is usually 0.7–0.9.

Experiment

Try greedy decoding (always pick the most likely token) by replacing torch.multinomial(probs, 1) with torch.argmax(logits).item(). Compare output to temperature=0.5 sampling. Does greedy produce better or worse text? Why might it produce repetitive loops?

Your action Run train.py to completion and print the generated text. Then make three targeted changes and observe the effect: (1) double d_model from 128 to 256 , does loss drop lower? (2) set temperature=0.5 and temperature=1.2 , compare the outputs. (3) Change the prompt from "HAMLET:\n" to "OPHELIA:\n" , does the model produce different-flavored text? Document your findings in a comment at the top of the file.

Key takeaway

A working end-to-end script is worth more than six isolated components. Integration reveals the bugs that unit-testing misses: shape mismatches between pipeline stages, device placement errors, off-by-one errors in the target sequence. After Day 7 you have a real, runnable language model. Week 2 replaces the MLP blocks with proper causal self-attention , the only architectural change needed to go from this toy to the GPT family.

Tokenization understoodCharacter-level vs BPE, implemented both, understand the trade-offs

Data pipeline builtPyTorch Dataset/DataLoader, batches flowing, shapes verified

Embeddings masteredToken + position embeddings, shape flow traced, parameters counted

Training loop readyCross-entropy, AdamW, gradient clipping, LR scheduling in place

Next week: The Transformer. You'll implement multi-head self-attention from scratch, build the transformer block, add causal masking so the model can't cheat by looking ahead, and train your first attention-based language model.

2WK

The Transformer

Build the GPT architecture: attention, layers, and forward pass

Week 2 Focus: The GPT architecture from scratch

0 / 7 complete

Day 8

The GPT Architecture at a Glance

Big picture overview, GPTConfig dataclass, and the forward pass skeleton

Before diving into individual components, you need to see the full picture. A GPT model is a stack of transformer blocks that process token embeddings through self-attention and feed-forward layers. Day 8 establishes the high-level architecture using a GPTConfig dataclass to hold hyperparameters and a skeleton forward() method that shows the data flow. The key insight is that every design decision: d_model, n_heads, n_layers: directly impacts parameter count and compute cost. Understanding the skeleton helps you see where each component fits before implementing it.

Why a config dataclass?

GPT models have many hyperparameters: embedding dimension, number of heads, number of layers, vocabulary size, sequence length, dropout rate. Hardcoding these leads to messy code and makes experimentation painful. A dataclass centralizes all configuration in one place.

from dataclasses import dataclass
import torch
import torch.nn as nn

@dataclass
class GPTConfig:
    """Configuration for a GPT model."""
    block_size: int = 1024      # Maximum sequence length (context window)
    vocab_size: int = 50257      # GPT-2 vocabulary size
    n_layer: int = 12           # Number of transformer blocks
    n_head: int = 12             # Number of attention heads
    n_embd: int = 768            # Embedding dimension (d_model)
    dropout: float = 0.1
    bias: bool = False            # Use bias in Linear layers? (GPT-2 omits them)

    # Derived properties (computed automatically)
    def head_size(self):
        return self.n_embd // self.n_head  # d_k = d_model / n_heads

# Create a small GPT config for experimentation
config = GPTConfig(
    block_size=256,
    vocab_size=50257,
    n_layer=6,
    n_head=6,
    n_embd=384
)

Experiment

Try changing n_embd from 384 to 512 or 768. Notice how head_size changes automatically. What happens if you set n_head to 7 with n_embd=384? (Hint: 384 is not divisible by 7). This is why d_model is almost always a multiple of n_heads.

The GPT class skeleton

The GPT class composes three main components: token embeddings, position embeddings, a stack of transformer blocks, a final LayerNorm, and an LM head (output projection). The forward pass flows: input IDs -> token_emb -> pos_emb -> blocks -> ln_f -> lm_head -> logits.

class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        # Token embeddings: map token IDs to vectors
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd, bias=config.bias),
        ))

        # Language model head (projects to vocabulary)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

    def forward(self, idx, targets=None):
        # idx: (batch, sequence_length) - token IDs
        B, T = idx.shape

        assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"

        # Positional indices
        pos = torch.arange(T, device=idx.device)

        # Embeddings
        tok_emb = self.transformer.wte(idx)          # (B, T, n_embd)
        pos_emb = self.transformer.wpe(pos)          # (T, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)

        # Transformer blocks
        for block in self.transformer.h:
            x = block(x)

        # Final layernorm
        x = self.transformer.ln_f(x)

        # Output projection (logits)
        logits = self.lm_head(x)                     # (B, T, vocab_size)

        # Loss computation (if targets provided)
        loss = None
        if targets is not None:
            loss = nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

The shape annotations in comments are crucial. They help you trace the tensor dimensions as data flows through the model. If you get a shape mismatch error, these annotations are your debugging guide.

Why this architecture order?

GPT follows a specific architectural pattern that differs from the original Transformer encoder-decoder. Understanding WHY each piece is placed where it is helps you debug and modify the architecture.

Embeddings first: Raw token IDs are meaningless numbers. Embeddings map them to a semantic vector space where similar tokens are close together.
Positional embeddings added: Attention is permutation-invariant (it doesn't know token order). Positional embeddings inject order information.
Blocks in sequence: Each block refines the representation. Early blocks learn low-level patterns; later blocks learn high-level semantics.
Final LayerNorm: Stabilizes the output before the LM head. Without it, the last block's output might have drifting activations.
LM head at the end: Projects the final representation to vocabulary-sized logits for next-token prediction.

GPT style (decoder-only)

Self-attention + FFN blocks only. No encoder. Causal masking ensures autoregressive generation. Simpler, works excellently for generation tasks.

Full encoder-decoder

The original Transformer has both an encoder (bidirectional attention) and decoder (causal attention). More complex, better for translation, overkill for language modeling.

Your action Create the GPTConfig dataclass and the GPT class skeleton. Print the model architecture using print(model). Count the parameters with sum(p.numel() for p in model.parameters()). Try different configs: a tiny model (n_embd=128, n_layer=2, n_head=4) vs a GPT-2 medium (n_embd=1024, n_layer=24, n_head=16).

Key takeaway

The GPT architecture is a carefully ordered pipeline: embeddings provide semantic meaning, positional information gives order, transformer blocks progressively refine representations, and the LM head converts to vocabulary predictions. Every line in the forward pass has a specific purpose: removing any component degrades performance.

Day 9

Self-Attention Mechanism

QKV projections, scaled dot-product attention, causal masking, and why masking matters

Self-attention is the core innovation that allows transformers to model long-range dependencies. Unlike RNNs that process tokens sequentially, attention lets every token "look at" every other token directly. Day 9 implements the CausalSelfAttention class with Query, Key, and Value projections, scaled dot-product attention, and the crucial causal mask that prevents the model from cheating by looking at future tokens. The causal mask is what makes the model autoregressive: it can only attend to past and current positions, never future ones.

Query, Key, and Value: what are they?

Attention works by comparing each token's Query against all tokens' Keys to compute compatibility scores, then using those scores to take a weighted sum of the Values. Think of it as a soft dictionary lookup: "given my Query, which Keys are relevant, and what are their Values?"

Query (Q): What the current token is looking for. "I am a verb, I need to find my subject."
Key (K): What each token offers for matching. "I am a noun, here is my identifying information."
Value (V): The actual information to aggregate. "Here is my semantic content to pass along."

class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.n_embd % config.n_head == 0, "n_embd must be divisible by n_head"

        # Single linear projection for Q, K, V (more efficient than separate ones)
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)

        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        self.head_size = config.n_embd // config.n_head

Experiment

Print the shape of self.c_attn.weight. It should be (3*n_embd, n_embd). Why do we project Q, K, V together? (Hint: it's one matrix multiply instead of three). What's the total parameter count of this layer for n_embd=768?

Scaled dot-product attention

The attention formula is: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. The scaling by sqrt(d_k) prevents the dot products from growing too large, which would push softmax into regions with extremely small gradients.

    def forward(self, x):
        # x: (batch, sequence_length, n_embd)
        B, T, C = x.size()

        # Project to Q, K, V and split
        qkv = self.c_attn(x)  # (B, T, 3 * n_embd)
        q, k, v = qkv.split(self.n_embd, dim=2)

        # Reshape for multi-head: (B, T, n_head, head_size) -> (B, n_head, T, head_size)
        # head_size = n_embd / n_head
        q = q.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_size).transpose(1, 2)

        # Compute attention scores: (B, n_head, T, T)
        # Each token's query compares against all tokens' keys
        att = (q @ k.transpose(-2, -1)) / (self.head_size ** 0.5)

        # APPLY CAUSAL MASK HERE (next step)

        # Softmax to get attention weights
        att = nn.functional.softmax(att, dim=-1)

        # Apply attention to values
        y = att @ v  # (B, n_head, T, head_size)

        # Re-assemble heads: (B, n_head, T, head_size) -> (B, T, n_embd)
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        # Final projection
        y = self.c_proj(y)

        return y

The transpose operations (transpose(1, 2)) are needed to get the shape (B, n_head, T, head_size) for parallel attention computation across heads. The final transpose back recombines the heads.

Causal masking: why it's crucial

Without a causal mask, each token can attend to ALL tokens including future ones. This would let the model "cheat" during training by simply copying the next token's information. The causal mask ensures token i can only attend to tokens 0 through i.

        # Create causal mask: upper triangle is -inf (masked out)
        # Shape: (T, T) where mask[i,j] = 0 if j <= i, -inf if j > i
        causal_mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        att = att.masked_fill(causal_mask, float('-inf'))

After applying the mask and softmax, future positions have exactly 0 attention weight. The model cannot "see" tokens that haven't been generated yet, which is essential for autoregressive generation.

Common bug Forgetting the causal mask is a silent killer. Your model will achieve great training loss (because it's cheating) but will generate gibberish because during inference, future tokens don't exist. Always verify your mask by printing att[0].detach().cpu() for a small sequence.

▶

Interactive companion tool

See the full attention computation laid out: Q, K, V projections as heatmaps, the raw QKᵀ matrix, scaled and masked, and the final softmax weights. Toggle causal masking on/off and explore multi-head mode.

attention-explorer.html ↗

Your action Implement the full CausalSelfAttention class. Test it with a random input tensor of shape (2, 16, 768). Print the attention weights for the first head and first batch. Verify that the upper triangle is all zeros (after softmax). Remove the causal mask and observe how the attention weights change: you should see non-zero values in the upper triangle.

Key takeaway

Self-attention lets tokens communicate directly regardless of distance. Q, K, V projections create the query-key-value framework. The causal mask is what makes attention autoregressive: without it, the model cheats by looking ahead. Scaled dot-product prevents gradient vanishing in the softmax.

Day 10

Multi-Head Attention

Why one head isn't enough, splitting dimensions, and parallel computation across heads

Single-head attention forces all relationships to compete for the same attention space. Multi-head attention gives the model multiple "perspectives": one head can focus on syntactic relationships (subject-verb), another on semantic similarity, another on positional patterns. Day 10 completes the multi-head implementation by properly reshaping tensors and running attention in parallel across heads. The magic number is d_k = d_model / n_heads: this keeps the compute cost of multi-head attention roughly equal to single-head while providing multiple attention perspectives.

Why multiple heads?

Imagine trying to understand a sentence with only one type of attention. Should you attend to the subject? The verb? The object? The previous adjective? With a single head, you must average all these needs into one attention pattern. Multi-head lets each head specialize.

Multi-head (n_head=12)

Head 1 might learn subject-verb relationships. Head 2 might track noun-adjective pairs. Head 3 might focus on positional proximity. All run in parallel with the same cost as single-head.

Single-head (n_head=1)

All relationship types compete for the same attention weights. The model must learn a "compromise" attention pattern that's suboptimal for any specific relationship type. Expressivity is severely limited.

The key insight: splitting d_model into n_heads chunks of size d_k means the total compute (n_heads * (T * d_k * T)) equals the single-head compute (T * d_model * T). You get more expressivity for free.

The reshape dance explained

Getting shapes right is 90% of the battle in multi-head attention. Here's the complete reshape sequence with explanations:

# Starting shape: (B, T, n_embd) where n_embd = n_head * head_size
# Example: (2, 16, 768) with n_head=12, head_size=64

# Step 1: Split embedding dimension into heads
# (B, T, n_embd) -> (B, T, n_head, head_size)
q = q.view(B, T, self.n_head, self.head_size)

# Step 2: Move head dimension before sequence for parallel computation
# (B, T, n_head, head_size) -> (B, n_head, T, head_size)
q = q.transpose(1, 2)

# Now we can do batch matrix multiply across heads:
# (B, n_head, T, head_size) @ (B, n_head, head_size, T) -> (B, n_head, T, T)
scores = q @ k.transpose(-2, -1)

Experiment

Set n_head=1 and observe the shapes. The transpose becomes a no-op. Then set n_head=768 (head_size=1). What happens to the attention scores shape? Try n_head=768 with a real input: does the model train? (Hint: head_size=1 is too small for meaningful attention).

Complete multi-head forward pass

Here's the complete forward method with proper multi-head handling, causal masking, and head recombination:

    def forward(self, x):
        B, T, C = x.size()

        # Project and split Q, K, V
        qkv = self.c_attn(x)  # (B, T, 3*C)
        q, k, v = qkv.split(self.n_embd, dim=2)

        # Reshape for multi-head attention
        # (B, T, C) -> (B, T, n_head, head_size) -> (B, n_head, T, head_size)
        q = q.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_size).transpose(1, 2)

        # Scaled dot-product attention
        # (B, n_head, T, head_size) @ (B, n_head, head_size, T) -> (B, n_head, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / (self.head_size ** 0.5))

        # Causal mask: prevent attending to future tokens
        # torch.triu creates upper triangular matrix
        causal_mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        att = att.masked_fill(causal_mask, float('-inf'))

        # Softmax to get attention weights (now upper triangle is 0)
        att = nn.functional.softmax(att, dim=-1)

        # Apply attention to values
        # (B, n_head, T, T) @ (B, n_head, T, head_size) -> (B, n_head, T, head_size)
        y = att @ v

        # Re-assemble: (B, n_head, T, head_size) -> (B, T, C)
        y = y.transpose(1, 2).contiguous()  # (B, T, n_head, head_size)
        y = y.view(B, T, C)  # (B, T, n_embd)

        # Final projection + dropout
        y = self.c_proj(y)
        y = nn.functional.dropout(y, p=self.dropout, training=self.training)

        return y

The contiguous() call after transpose is important: transpose returns a view with rearranged strides, not a contiguous tensor. view() requires contiguous memory, so we call contiguous() first.

Your action Complete the CausalSelfAttention implementation. Test with different n_head values: 1, 6, 12, 24. For each, print the shape of y to verify it's always (B, T, n_embd) regardless of head count. Time the forward pass with torch.cuda.Event if you have a GPU. Does more heads always mean slower?

Key takeaway

Multi-head attention gives the model multiple perspectives on the same sequence at no extra compute cost (due to the d_k split). The reshape dance (view + transpose) is the key to parallel attention across heads. Always verify tensor shapes at each step: (B, n_head, T, head_size) is your friend.

Day 11

The MLP Block

Feed-forward network, GELU vs ReLU, expansion ratio, and why we need it

After attention mixes information across tokens, the MLP (Multi-Layer Perceptron) block processes each token independently, adding non-linearity and increasing model capacity. Day 11 implements the feed-forward network that doubles the embedding dimension (expansion ratio of 4), applies GELU activation, and projects back. The MLP typically accounts for ~2/3 of a transformer's parameters, making it the dominant component by size. GELU is preferred over ReLU because it's smoother and avoids the "dying ReLU" problem where neurons get stuck at 0.

Why two transformations?

The MLP serves a different purpose than attention. While attention mixes information across positions, the MLP processes each position independently, adding computational depth. The expansion to 4*d_model followed by projection back to d_model creates a "bottleneck" that forces the model to learn compact representations.

First linear layer (expansion): Projects from d_model to 4*d_model. This gives the model more "room" to represent complex patterns.
GELU activation: Adds non-linearity. Without it, the entire transformer would be a linear model regardless of depth.
Second linear layer (projection): Projects back from 4*d_model to d_model. This compresses the representation back to the embedding space.

class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)
        # Note: GELU is applied in forward(), not stored as a Module

    def forward(self, x):
        # x: (B, T, n_embd)
        x = self.c_fc(x)          # (B, T, 4*n_embd)
        x = nn.functional.gelu(x) # GELU activation
        x = self.c_proj(x)        # (B, T, n_embd)
        x = self.dropout(x)
        return x

GELU vs ReLU: why it matters

ReLU (max(0, x)) was the standard activation, but it has problems: it's not smooth at 0, and neurons can "die" (always output 0) if weights push inputs consistently negative. GELU (Gaussian Error Linear Unit) is smoother and theoretically motivated by dropout.

GELU activation

Smooth everywhere, approximately x * sigmoid(1.702*x). Small negative values still pass through (stochastic regularization effect). Used in GPT-2, GPT-3, BERT. Better empirical performance.

ReLU activation

Not smooth at 0. "Dying ReLU" problem: if a neuron's weights push input below 0 consistently, it never recovers. All negative values become exactly 0. Simpler but inferior for deep transformers.

# GELU implementation (PyTorch has it built-in)
# But here's what it looks like conceptually:
def gelu_approx(x):
    return x * torch.sigmoid(1.702 * x)

# Or the exact formulation (what PyTorch uses):
def gelu_exact(x):
    return 0.5 * x * (1.0 + torch.erf(x / (2.0 ** 0.5)))

Experiment

Create a small neural network with one hidden layer. Train it on a simple task (like XOR or a small classification problem) using ReLU, then GELU. Compare convergence speed and final accuracy. Plot the activations: GELU should show a smoother transition around 0.

The expansion ratio: why 4?

The expansion ratio of 4 (d_ff = 4 * d_model) is a hyperparameter inherited from the original Transformer and widely used in GPT models. It's a sweet spot: too small and the MLP is a bottleneck that limits capacity; too large and you waste parameters on an overly wide network.

For a transformer block with d_model=768 and expansion 4:

Attention QKV projection: 768 * (3 * 768) = 1,769,472 parameters
Attention output projection: 768 * 768 = 589,824 parameters
MLP first layer: 768 * (4 * 768) = 2,359,296 parameters
MLP second layer: (4 * 768) * 768 = 2,359,296 parameters

The MLP accounts for ~2/3 of the block's parameters. This is typical and intentional: the MLP provides most of the model's capacity.

# Parameter count for MLP with expansion ratio E
def mlp_params(d_model, expansion=4):
    d_ff = expansion * d_model
    return d_model * d_ff + d_ff * d_model  # = 2 * expansion * d_model^2

print(mlp_params(768, 4))   # 4,718,592
print(mlp_params(768, 2))   # 2,359,296 (half the params)
print(mlp_params(768, 8))   # 9,437,184 (double the params)

Your action Implement the MLP class. Create a GPT model and print the parameter count per component using sum(p.numel() for p in layer.parameters()). Verify that MLP params are roughly 2x the attention params. Try expansion ratios of 2, 4, and 8. How does this affect total model size and training speed?

Key takeaway

The MLP block is the parameter-heavy component of transformers, accounting for ~2/3 of parameters. GELU activation is superior to ReLU for deep networks. The expansion ratio of 4 is a well-tuned hyperparameter: it provides enough capacity without wasting parameters. The MLP processes each token independently, complementing attention's cross-token mixing.

Day 12

LayerNorm and Residual Connections

Pre-norm vs post-norm, gradient flow, and why LayerNorm works better than BatchNorm here

A transformer block needs two more pieces to work: LayerNorm for stabilizing activations and residual connections for gradient flow. Day 12 implements the complete Block class with pre-norm or post-norm ordering. The choice between pre-norm (LayerNorm before attention/MLP) and post-norm (LayerNorm after) dramatically affects training stability. Pre-norm, used in GPT-2 and later models, provides better gradient flow and allows deeper networks without skip connections between every layer.

Why LayerNorm, not BatchNorm?

BatchNorm normalizes across the batch dimension, which assumes a fixed batch size and similar statistics across samples. LayerNorm normalizes across the feature dimension (for each sample independently), making it batch-size-agnostic and better suited for sequences of varying length.

LayerNorm

Normalizes across features: LayerNorm(x) = (x - mean(x)) / std(x) * gamma + beta where mean/std are computed across the embedding dimension. Works with any batch size, any sequence length. No running statistics needed.

BatchNorm (wrong choice)

Normalizes across batch: requires tracking running statistics, behaves differently during training vs inference, struggles with small batches or variable-length sequences. Designed for CNNs, not transformers.

class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        # x: (B, T, n_embd)
        # Pre-norm: normalize BEFORE the sublayer
        x = x + self.attn(self.ln_1(x))  # Residual + Attention
        x = x + self.mlp(self.ln_2(x))   # Residual + MLP
        return x

Pre-norm vs Post-norm

The original Transformer used post-norm: x = ln(attention(x) + x). GPT-2 and later models switched to pre-norm: x = x + attention(ln(x)). Pre-norm has better gradient flow because the residual path is "cleaner": gradients can flow directly through the identity skip connection without passing through LayerNorm.

# Post-norm (original Transformer):
def forward_postnorm(self, x):
    x = self.ln_1(x + self.attn(x))  # LayerNorm AFTER residual
    x = self.ln_2(x + self.mlp(x))
    return x

# Pre-norm (GPT-2, more stable):
def forward_prenorm(self, x):
    x = x + self.attn(self.ln_1(x))  # LayerNorm BEFORE attention
    x = x + self.mlp(self.ln_2(x))
    return x

Pre-norm is now the default for most implementations because it allows training deeper networks without the optimization difficulties that post-norm can exhibit. The residual connection ensures that even if the attention or MLP output is garbage, the model can default to the identity function.

Experiment

Implement both pre-norm and post-norm blocks. Train a small model (6 layers) on a tiny dataset. Compare training stability (loss curve smoothness) and final performance. Try scaling to 24 layers: post-norm might diverge while pre-norm trains stably.

Residual connections and gradient flow

The residual connection x + sublayer(x) is crucial for deep networks. During backpropagation, the gradient can flow directly through the identity connection, avoiding the vanishing gradient problem that plagues deep networks.

Mathematically, if y = x + F(x), then dy/dx = I + dF/dx. The identity term I ensures that even if F(x) has tiny gradients, the overall gradient is at least 1 (the identity contribution).

# Visualizing the residual connection:
# Without residual: x -> LayerNorm -> Attention -> output
# Gradient must pass through all operations (can vanish)

# With residual: x -> LayerNorm -> Attention -> + -> output
#                      ^---------------------|
# Gradient has a direct path through the skip connection

# This is why we can train 100+ layer transformers:
# Each layer's skip connection provides a gradient superhighway

In the complete GPT model, the residual connections in each Block, combined with pre-norm, create a stable gradient flow that allows training models with billions of parameters.

Your action Implement the Block class with pre-norm. Create a 12-layer GPT model. Visualize the computation graph or simply verify that the output shape matches the input shape: assert y.shape == x.shape. Remove the residual connections and observe how the model trains (or fails to train) on a simple task.

Key takeaway

LayerNorm stabilizes activations by normalizing across features, making it ideal for transformers. Pre-norm (LayerNorm before sublayer) provides better gradient flow than post-norm, enabling deeper networks. Residual connections are the gradient superhighway that makes deep transformers trainable at all.

Day 13

Weight Tying and Output Layer

Sharing embeddings between input and output, lm_head, and the full forward pass

The output projection (lm_head) maps final representations back to vocabulary-sized logits. There's a clever optimization: the output embedding matrix can be shared with the input embedding matrix, a technique called "weight tying." Day 13 implements weight tying in the GPT class and completes the full forward pass. Weight tying reduces parameters by ~n_embd * vocab_size and creates a nice symmetry: the input embedding projects tokens into the representation space, and the output embedding (tied weights) projects back to token space using the same transformation matrix.

What is weight tying?

In a typical language model, you have:

wte (input embeddings): Maps token IDs to vectors. Shape: (vocab_size, n_embd).
lm_head (output projection): Maps vectors to vocabulary logits. Shape: (n_embd, vocab_size).

Notice that these shapes are transposes of each other. Weight tying shares the same matrix for both: lm_head.weight = wte.weight. This works because the operation is conceptually the same: measuring similarity between a vector and each token's embedding.

class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd, bias=config.bias),
        ))

        # Option 1: Separate lm_head (more parameters)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # Option 2: Weight tying (share wte weights)
        # self.lm_head.weight = self.transformer.wte.weight  # Tie weights

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.config.block_size

        pos = torch.arange(T, device=idx.device)

        # Embeddings
        tok_emb = self.transformer.wte(idx)
        pos_emb = self.transformer.wpe(pos)
        x = self.transformer.drop(tok_emb + pos_emb)

        # Transformer blocks
        for block in self.transformer.h:
            x = block(x)

        # Final LayerNorm
        x = self.transformer.ln_f(x)

        # Output projection to vocabulary logits
        if targets is not None:
            # During training: we need logits for all positions
            logits = self.lm_head(x)
            loss = nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )
        else:
            # During inference: only need logits for the last position
            logits = self.lm_head(x[:, [-1], :])
            loss = None

        return logits, loss

Experiment

Compare parameter counts with and without weight tying. For GPT-2 (n_embd=768, vocab_size=50257), the lm_head alone has 768 * 50257 ≈ 38.6M parameters. With tying, you save all those parameters. Does weight tying affect model quality? Train two small models and compare perplexity.

Why weight tying works

Weight tying is motivated by the symmetry of the task. The input embedding asks: "What is the vector representation of token X?" The output embedding asks: "What is the probability of token X given this vector?" These are inverse operations, so sharing weights makes sense.

Mathematically, the output logits for position i are computed as:

# Without tying:
logits[i, j] = dot(x[i], W_out[j]) + b[j]
# Where W_out is a separate matrix for output projection

# With tying: W_out = W_in^T (transpose of input embeddings)
logits[i, j] = dot(x[i], W_in[j])
# This is just measuring similarity between x[i] and token j's embedding

Weight tying is especially beneficial for smaller models where the embedding parameters are a significant fraction of total parameters. For GPT-3 scale models, the savings are less impactful but still meaningful.

▶

Interactive companion tool

See the impact of weight tying on total parameter count. Toggle it on/off for any GPT config and watch the LM head parameters appear or disappear.

parameter-counter.html ↗

The complete forward pass

With all components in place, the full forward pass is: input IDs → token embeddings + positional embeddings → dropout → N transformer blocks (each: pre-norm → attention → residual, pre-norm → MLP → residual) → final LayerNorm → output projection → logits.

The shape flow for a concrete example with batch=4, seq_len=128, n_embd=768:

idx: (4, 128) - token IDs
tok_emb: (4, 128, 768) - after wte
pos_emb: (128, 768) - broadcast to (4, 128, 768)
x after dropout: (4, 128, 768)
After N blocks: still (4, 128, 768)
After ln_f: (4, 128, 768)
logits: (4, 128, 50257) - vocabulary-sized output

Each position's logits represent the model's prediction for the NEXT token at that position. The cross-entropy loss compares these predictions against the shifted target sequence.

Your action Complete the GPT class with weight tying. Add a method generate() that takes input IDs and generates new tokens autoregressively. Test the full forward pass with a small batch. Verify shapes at each step. Enable weight tying and confirm that lm_head.weight is transformer.wte.weight returns True.

Key takeaway

Weight tying shares the input embedding matrix with the output projection, reducing parameters and creating conceptual symmetry. The full forward pass is a carefully ordered pipeline where each component has a clear purpose. After Day 13, you have a complete (though untrained) GPT model ready for training.

Day 14

Week 2 Wrap-up

Parameter count analysis, testing the full model, and preparing for training

Week 2 is complete! You've built the entire GPT architecture from scratch. Day 14 wraps up by analyzing the parameter count, testing the model with random inputs, and ensuring everything works before Week 3's training loop. Understanding your model's parameter budget is crucial: it tells you how much memory you need, how fast training will be, and whether your architecture choices are reasonable. A GPT-2 small has ~124M parameters: here's where they all come from.

Parameter count breakdown

Let's count parameters for a GPT-2 small configuration: n_layer=12, n_head=12, n_embd=768, vocab_size=50257, block_size=1024.

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def gpt2_small_params():
    # Embeddings
    wte = 50257 * 768          # 38,597,056
    wpe = 1024 * 768           # 786,432

    # One transformer block
    # Attention:
    c_attn = 768 * (3 * 768)    # 1,769,472 (QKV projection)
    c_proj_attn = 768 * 768    # 589,824 (output projection)
    # MLP:
    c_fc = 768 * (4 * 768)        # 2,359,296
    c_proj_mlp = (4 * 768) * 768  # 2,359,296
    # LayerNorms: 2 * (2 * 768) = 3,072 (small, negligible
    block_params = c_attn + c_proj_attn + c_fc + c_proj_mlp

    # 12 blocks
    all_blocks = 12 * block_params    # 85,377,024

    # Final LayerNorm
    ln_f = 2 * 768              # 1,536

    # LM head (if not tied)
    lm_head = 768 * 50257        # 38,597,056

    total = wte + wpe + all_blocks + ln_f + lm_head
    return total  # ~124,734,720 parameters

print(f"GPT-2 Small estimated params: {gpt2_small_params():,}")

▶

Interactive companion tool

Explore the full parameter count breakdown for GPT-2 Small/Medium/Large and any custom config. Includes memory estimates for FP32 and BF16.

parameter-counter.html ↗

124M

Total parameters (GPT-2 Small)

38M

Embeddings (wte + wpe)

85M

Transformer blocks (12 layers)

Testing the full model

Before training, verify that your model runs correctly with a forward pass using random inputs. Check output shapes, run a backward pass to verify gradients flow, and test generation.

import torch
from torch.utils.data import DataLoader

# Create model
config = GPTConfig(vocab_size=50257, n_embd=768, n_layer=12, n_head=12, block_size=1024)
model = GPT(config)
model.to('cuda' if torch.cuda.is_available() else 'cpu')

# Test forward pass
B, T = 4, 128
idx = torch.randint(0, config.vocab_size, (B, T))
if torch.cuda.is_available():
    idx = idx.cuda()

logits, loss = model(idx, targets=idx)  # Use same as targets for testing
print(f"Logits shape: {logits.shape}")  # Should be (4, 128, 50257)
print(f"Loss: {loss.item()}")

# Test backward pass
loss.backward()
print("Backward pass successful!")

# Check gradient norms (should be reasonable, not NaN or Inf)
total_norm = 0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm: {total_norm}")

Experiment

Time the forward and backward pass. Try different batch sizes (1, 4, 16, 32) and observe memory usage. What's the maximum batch size you can fit on your GPU? How does the gradient norm change with batch size? (Hint: it should scale roughly as 1/sqrt(batch_size) without gradient accumulation).

Ready for training!

After Week 2, you have:

A complete GPT model architecture with all components implemented from scratch
Understanding of why each component exists and how it fits in the pipeline
Verified shape flow and gradient propagation
Parameter count under control (you know where every parameter lives)

Week 3 will add the training loop: data loading, loss computation, optimizer setup (AdamW), learning rate scheduling, gradient clipping, and evaluation. Your model is ready to learn!

Before Week 3 Make sure your model forward pass works perfectly. A bug in the architecture will be nearly impossible to diagnose once training starts. Run the tests, verify shapes, check gradients. The time you spend debugging now saves hours during training.

Your action Run the full parameter count analysis for your model. Verify the forward and backward pass. Test with multiple batch sizes. Calculate the memory needed for training (parameters + gradients + optimizer states + activations). A 124M parameter model needs roughly 500MB-1GB of GPU memory depending on batch size and sequence length.

Key takeaway

You've built a complete GPT model from scratch! The parameter count breakdown shows that most parameters live in the embeddings and MLP layers. Testing the full model before training is essential: debug now, not during training. Week 3 will bring your model to life with a training loop.

Architecture understoodGPTConfig, attention, MLP, LayerNorm, residuals: all implemented

Multi-head attentionQKV projections, causal masking, parallel heads: working

Transformer blocksPre-norm, residual connections, MLP with GELU: stacked and tested

Model completeFull forward pass, weight tying, parameter count verified: ready to train

Next week: Training Loop. You'll implement the complete training loop with AdamW optimizer, learning rate scheduling (warmup + cosine decay), gradient clipping, mixed precision training, and checkpointing. Your model will finally learn to predict text!

3WK

The Training Loop

Teach your model language with data, loss, and optimization

Week 3 Focus: Teaching techniques

0 / 7 complete

Day 15

The Training Objective

Next-token prediction, cross-entropy loss, and loss interpretation

Before we can train our model, we need to answer one fundamental question: what exactly are we trying to minimize? This is the training objective , the measure of how wrong the model is, and the signal that drives learning. For language models, the answer is elegantly simple: predict the next token. Every position in every training sequence becomes a supervised example where the context is the input and the following character is the label. Understanding this deeply , not just as a formula but as an intuition , is what separates someone who can run training from someone who can diagnose it when it goes wrong.

1 Next-Token Prediction & Self-Supervised Learning

A language model is trained to predict the next token given all previous tokens in a sequence. This framing is called self-supervised learning: no human labeling is needed because the text itself provides its own labels. Given the sequence "hello", the model learns to predict "e" from "h", "l" from "he", "l" from "hel", "o" from "hell" , all from a single training example. A sequence of 256 characters gives the model 255 separate prediction tasks to learn from simultaneously.

Why is this powerful? Self-supervised learning unlocks the entire internet as training data. Manual annotation for next-token prediction would be impossible at scale , you'd need a human to label billions of individual token positions. Instead, any text file you can find becomes a valid training corpus with zero annotation cost.

The label for any input is simply the input shifted right by one position. This is the "shift" you'll see in every data loader implementation:

# A single sequence becomes many prediction targets automatically
# Input sequence:  [h, e, l, l, o,  ,  w, o, r, l, d]
# Target sequence: [e, l, l, o,  ,  w, o, r, l, d, !]
#                  ↑ predict next char at every position in parallel

def get_input_and_labels(sequence):
    x = sequence[:-1]   # All tokens except the last (inputs)
    y = sequence[1:]    # All tokens except the first (labels, shifted by 1)
    return x, y         # Same length: every input position has a label

During training, the model sees all positions in parallel (thanks to the causal mask from Week 2), computing a prediction and a loss at every single position. This is why transformers are so data-efficient: one forward pass on a 256-token sequence gives you 256 gradient signals rather than just one.

2 Cross-Entropy Loss

Cross-entropy loss quantifies how surprised the model is by the correct answer. The model outputs a probability distribution over all vocabulary characters, and we measure how much probability mass it assigned to the actual next character. Mathematically: loss = -log(p_correct). If the model assigns 1% probability to the right character, -log(0.01) ≈ 4.6. If it assigns 50%, -log(0.5) ≈ 0.69. Perfect prediction would give loss 0.

Why cross-entropy specifically? It aligns exactly with maximum likelihood estimation: minimizing cross-entropy is mathematically equivalent to maximizing the probability the model assigns to the training data. It also has ideal gradient properties , when the model is very wrong, gradients are large; when it's nearly right, gradients are small. This prevents oscillation near the optimum.

In practice we average the loss over every token in every sequence in the batch. This is why the logits need to be reshaped before passing to PyTorch's loss function:

import torch.nn.functional as F

# logits shape: (batch_size, seq_len, vocab_size)
# targets shape: (batch_size, seq_len)
# F.cross_entropy expects: (N, C) logits and (N,) targets
# So we flatten the batch and sequence dimensions together:

loss = F.cross_entropy(
    logits.view(-1, logits.size(-1)),  # → (batch*seq_len, vocab_size)
    targets.view(-1)                    # → (batch*seq_len,)
)
# loss is now a single scalar: mean CE across all positions and batches

There is also a useful connection to perplexity: perplexity = e^loss. A loss of 4.2 gives perplexity ≈ 66, which means the model is as uncertain as if it were choosing uniformly from 66 options , consistent with a 65-character vocabulary at random. A loss of 1.0 gives perplexity ≈ 2.7, meaning the model is only as uncertain as a coin flip weighted toward the correct answer.

3 Interpreting Loss Numbers

Loss numbers are meaningless in isolation , they only make sense relative to your vocabulary size and your baseline. For a character-level model on TinyShakespeare (65 characters), here are the key milestones to know:

~4.2: Random baseline (-ln(1/65)). The model has learned nothing. Training hasn't started or something is broken.
~3.0–3.5: The model has learned that some characters appear more often than others (e.g., spaces, 'e', 't'). It's using frequency statistics.
~2.0–2.5: Common letter combinations are being predicted (e.g., "th", "he", "in"). Outputs start looking like scrambled English words.
~1.5–2.0: Whole words appear in outputs. The model understands basic vocabulary.
~1.0–1.2: Good performance. Grammatically plausible sentences, consistent character voices in Shakespeare-like text.
<0.8: Likely overfitting if val loss is higher. The model is memorizing training sequences.

A concrete intuition for confidence: At loss=1.0, the model assigns on average e^(-1.0) ≈ 37% probability to the correct next character. At loss=2.0, it's e^(-2.0) ≈ 13.5%. At the random baseline of 4.2, it's 1/65 ≈ 1.5%. Watch for train and val loss diverging , that's your first sign of overfitting.

▶

Interactive companion tool

Convert between cross-entropy loss and perplexity. See where your model sits on the scale from random to very good, with benchmark references for common checkpoints.

perplexity.html ↗

4 From Loss to Weight Updates

Loss is a number, but training is about changing weights. The chain connecting them is backpropagation: once we have a loss scalar, PyTorch traces every operation that produced it and computes the gradient of the loss with respect to every learnable parameter. These gradients tell each weight which direction to move to reduce the loss.

High loss means large gradients , the model is very wrong and needs big corrections. Low loss means small gradients , the model is mostly right and only needs fine-tuning. This is why training curves have a characteristic shape: rapid initial improvement (large gradients pulling weights into roughly the right region), then slower refinement (small gradients making precise adjustments).

The three-line update you'll write in every training loop:

loss.backward()          # Compute gradients for every parameter
optimizer.step()         # Subtract gradient × learning_rate from each weight
optimizer.zero_grad()    # Clear gradients so they don't accumulate into next step

The order matters: zero_grad() must happen either before backward() or right after step(), never between them. If you forget it, gradients from previous batches pile up and your updates will be wildly too large.

Your action Open a Python session and compute the random-guess baseline for your dataset: import math; vocab_size = len(set(open('input.txt').read())); print(-math.log(1/vocab_size)). Write this number down , it's your starting line. Then add a one-liner to your training script that prints "Random baseline: {baseline:.4f}" before the first training step so you always know how far you've come.

Key Takeaway: Cross-entropy loss quantifies training progress. Character-level models with good performance hit 1.0-1.2 loss, while random guessing gives ~4.2.

Day 16

Data Loading

Character-level tokenization, stoi/itos mappings, train/val split

The data pipeline is where most silent training bugs live. Before your model ever sees a weight update, you need to read raw text, convert every character to an integer, split the corpus into training and validation sets, and sample batches of correctly paired inputs and labels. Every step has a way to silently go wrong , and a wrong data pipeline will produce confidently incorrect training runs, with loss falling smoothly toward a wrong answer. Today you build and verify each piece before wiring them together.

1 Character-Level Tokenization

Neural networks can't process raw text , they need numbers. Tokenization is the process of converting characters (or words, or subwords) into integer indices that can be looked up in an embedding table. At the character level, we collect every unique character in the dataset, sort them into a consistent order, and assign each one an integer from 0 to vocab_size−1.

Why sort? Sorting guarantees reproducibility. Without it, Python's set() returns characters in an arbitrary order that varies between runs. Two training runs on the same data would produce different stoi mappings, making saved checkpoints incompatible with each other.

Why character-level for small datasets? Word-level tokenization needs enough examples of each word to learn its embedding reliably , typically thousands of occurrences. Character-level tokenization only needs a ~65-character vocabulary, so even 1MB of text gives you tens of thousands of examples per character. The tradeoff is longer sequences (characters instead of words), but this is manageable at our scale.

def load_data(path):
    with open(path, 'r', encoding='utf-8') as f:
        text = f.read()

    # Collect all unique characters and sort for reproducibility
    chars = sorted(list(set(text)))
    vocab_size = len(chars)

    # stoi: character → integer index (used during encoding)
    stoi = {ch: i for i, ch in enumerate(chars)}
    # itos: integer index → character (used during decoding/generation)
    itos = {i: ch for i, ch in enumerate(chars)}

    # Convenience wrappers so callers don't need to know stoi/itos directly
    encode = lambda s: [stoi[c] for c in s]
    decode = lambda ids: ''.join([itos[i] for i in ids])

    return text, stoi, itos, encode, decode, vocab_size

After calling this, encode("hello") gives you something like [32, 29, 36, 36, 39] and decode([32, 29, 36, 36, 39]) gives back "hello". Always verify that decode(encode(text)) == text before proceeding , if this fails, there are characters in your data that aren't in your vocabulary.

2 Encoding the Full Dataset & Train/Val Split

Once we have our encode function, we convert the entire text corpus into a single long tensor of integers. This is the format our model will consume throughout training. We then split it 90% / 10% into training and validation sets.

Why split sequentially, not randomly? For text, shuffling individual characters before splitting would leak information: the validation set would contain characters from the middle of training sequences, and the model would have implicitly "seen" the surrounding context during training. A clean sequential split ensures the validation set is genuinely unseen text.

import torch

# Encode the full text to a 1D integer tensor
# dtype=torch.long (int64) because embedding layers expect integer indices
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Dataset: {len(data):,} tokens, vocab size: {vocab_size}")
# e.g. "Dataset: 1,115,394 tokens, vocab size: 65"

# Sequential 90/10 split
split_idx = int(0.9 * len(data))
train_data = data[:split_idx]   # ~1,003,854 tokens
val_data   = data[split_idx:]   # ~111,540 tokens

3 Batch Sampling with get_batch

We don't train on the full dataset at once , memory and compute don't allow it. Instead, we sample small random batches of fixed-length subsequences. Each batch contains batch_size sequences of length context_len, drawn from random positions in the data. The label for each input sequence is the same sequence shifted one position to the right.

Why random positions? If we always sampled in order, the model would see the beginning of the text thousands of times before the end appears once. Random sampling ensures uniform coverage across the whole corpus within each epoch, making training more stable and preventing the model from learning position-specific patterns.

def get_batch(data, batch_size, context_len, device):
    # Sample batch_size random starting positions
    # Stop context_len from the end so we can always get a full sequence
    idx = torch.randint(len(data) - context_len, (batch_size,))

    # x: input sequences, shape (batch_size, context_len)
    x = torch.stack([data[i: i+context_len] for i in idx])
    # y: target sequences, shifted by 1, same shape
    y = torch.stack([data[i+1: i+context_len+1] for i in idx])

    # Move to the correct device before returning
    return x.to(device), y.to(device)

# Sanity check: y should be x shifted left by one position
x, y = get_batch(train_data, batch_size=4, context_len=8, device='cpu')
print(x[0])           # e.g. tensor([32, 29, 36, 36, 39,  1, 47, 53])
print(y[0])           # e.g. tensor([29, 36, 36, 39,  1, 47, 53, 56])
assert (x[0][1:] == y[0][:-1]).all(), "Labels must be inputs shifted by 1!"

Run that assertion every time you change the data pipeline. The most common bug in data loading is accidentally setting y = data[i:i+context_len] (same as x) instead of y = data[i+1:i+context_len+1]. The model will train, loss will drop, but it's learning to copy its input rather than predict the next token , a subtle and completely silent failure.

▶

Interactive companion tool

Debug your batch shapes interactively. See how dataset size, sequence length, and batch size affect samples per epoch, tokens seen, and memory footprint.

batch-shapes.html ↗

Your action Download the TinyShakespeare dataset (wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt). Implement load_data and get_batch from today's lesson. Print the vocab size, dataset length, and the shapes of one batch. Then verify that y[0] equals x[0] shifted left by one position , if they don't match, your labels are wrong and nothing will train correctly.

Key Takeaway: Character-level tokenization is ideal for small datasets. Always split into train/val sets to monitor overfitting.

Day 17

Device Setup & Data Loading

MPS vs CUDA vs CPU, get_batch function, tensor shapes

Your model and its data must live on the same device , a matrix multiplication between a tensor on CPU and one on GPU raises an immediate error. More importantly, the difference between training on CPU vs a GPU is not marginal: a 10M parameter model that takes 3 hours on Apple Silicon M3 would take days on a laptop CPU. Day 17 wires together device detection with the encoding pipeline from Day 16 to produce a complete, device-aware training setup, and gives you the tools to verify everything is on the right device before launching a real run.

1 Device Selection: MPS vs CUDA vs CPU

A "device" in PyTorch refers to a specific memory space and compute unit. A CPU tensor lives in your machine's RAM and is computed by your CPU cores. A CUDA tensor lives in your NVIDIA GPU's VRAM and is computed in parallel across thousands of GPU cores. An MPS tensor does the same on Apple Silicon's GPU. You can't mix them in a single operation.

Why does this matter for speed? Transformer training is dominated by large matrix multiplications. A CPU runs these serially across a handful of cores. A GPU runs them across thousands of cores simultaneously. For a model with ~10M parameters, this translates to a 20–50x speedup in practice. The time difference between training runs is what makes GPU selection worth thinking about before you start.

We prioritize: CUDA (NVIDIA) → MPS (Apple Silicon) → CPU (fallback for testing and debugging).

import torch

def get_device():
    if torch.cuda.is_available():
        device = torch.device('cuda')
    elif torch.backends.mps.is_available():  # Apple M1/M2/M3/M4
        device = torch.device('mps')
    else:
        device = torch.device('cpu')
    print(f"Using device: {device}")
    return device

# Verify your model is on the right device after moving it:
# model.to(device)
# print(next(model.parameters()).device)  # Should match get_device() output

MPS caveat: Not all PyTorch operations are supported on MPS. Unsupported ops silently fall back to CPU, which can make timing benchmarks misleading. If you see unexpected slowness on MPS, check for fallback warnings with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 or run torch.backends.mps.is_built() to confirm MPS support is compiled in.

2 Tensor Dtypes & the Complete Data Pipeline

When we encode text to a tensor, we must use dtype=torch.long (64-bit integer). This is required because our tokens are indices , they get passed to an embedding layer which performs a table lookup, and embedding layers require integer inputs. If you accidentally use torch.float, you'll get a cryptic "expected Long" error at training time.

device = get_device()

# Encode: text string → 1D integer tensor (stays on CPU for now)
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Full dataset: {data.shape}, dtype: {data.dtype}")
# Full dataset: torch.Size([1115394]), dtype: torch.int64

# Sequential train/val split (do NOT shuffle text data)
split_idx = int(0.9 * len(data))
train_data = data[:split_idx]
val_data   = data[split_idx:]

# Note: the full dataset stays on CPU here.
# Individual batches are moved to device inside get_batch() via .to(device).
# This avoids requiring GPU memory for the full dataset at once.

Keeping the full dataset on CPU and only moving individual batches to GPU is important for memory management. A 1MB text file encodes to ~1M tokens, which is ~8MB as int64 , small enough for CPU RAM, but you're likely running other processes too. On the GPU, you want memory reserved for model weights, activations, and gradients , not static data that can be streamed in batch by batch.

3 Pre-Training Sanity Checks

Before committing to a multi-hour training run, spend 5 minutes verifying your pipeline end-to-end. These checks catch the most common setup errors without wasting compute:

# 1. Verify device is correct
device = get_device()
model = GPT(config).to(device)
print(next(model.parameters()).device)   # Must match device

# 2. Verify data shapes
x, y = get_batch(train_data, batch_size=4, context_len=64, device=device)
print(x.shape, x.device)  # torch.Size([4, 64]), cuda:0 (or mps, or cpu)
print(y.shape, y.device)  # torch.Size([4, 64]), same device

# 3. Verify a single forward pass runs without errors
with torch.no_grad():
    logits = model(x)
    print(logits.shape)   # (4, 64, vocab_size)

# 4. Verify the initial loss is near the random baseline
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
print(f"Initial loss: {loss.item():.4f} (expect ~{math.log(vocab_size):.2f})")

That last check is especially valuable: an untrained model with random weights should assign roughly equal probability to every character, giving a loss very close to -ln(1/vocab_size). If your initial loss is significantly lower than the random baseline, something is wrong , likely your model or data has been accidentally pre-fitted, or there's a label leakage bug. If it's higher than the baseline, check your loss function is averaging correctly across all positions.

Your action Add get_device() to your script and confirm it prints the right device. Then benchmark a single forward pass: create a dummy batch at your intended dimensions and use import time; t=time.time(); [model(x) for _ in range(10)]; print((time.time()-t)/10). Multiply by your planned number of steps to estimate total training time before committing to a full run.

Key Takeaway: Always use GPU/MPS when available for 10-50x faster training. Proper tensor shape management avoids runtime errors.

Day 18

Learning Rate Schedule

Warmup phase, cosine decay, get_lr function

The learning rate is the single most consequential hyperparameter in training. Too high and the model diverges , gradients become enormous, weights fly off to infinity, and your loss hits NaN. Too low and training crawls, getting stuck in shallow local minima, wasting hours of compute on negligible improvement. A fixed learning rate forces you to choose between these extremes. A schedule lets you have it both ways: aggressive steps early when you need rapid progress, precise steps late when you need to converge. Today you implement the warmup + cosine decay schedule used in GPT-3 and most modern transformer training runs.

1 Why Constant Learning Rate Fails

At the start of training, the model's weights are randomly initialized and the gradients are noisy , they point in roughly the right direction but with a lot of variance. Using a large constant LR here leads to overshooting: the optimizer steps too far, bounces around the loss landscape, and can diverge entirely (loss → NaN or infinity). Using a small constant LR avoids this but then converges too slowly throughout the entire run.

The insight behind LR scheduling is that the two phases of training have different needs:

Early training: weights are far from optimal, we want large-ish steps to make rapid progress , but not so large that noisy early gradients cause instability. Solution: warmup, gradually increase LR from near-zero to maximum over the first 1–2% of training steps.
Late training: weights are in a good region, we want small, precise steps to converge to the minimum without overshooting. Solution: decay, gradually reduce LR from maximum toward a small minimum value.

A practical way to see this: with max_lr=3e-4 and your typical 5,000-step training run, a constant LR of 3e-4 often diverges in the first 100 steps. Drop it to 3e-5 and it's stable but barely moves. The schedule gives you 3e-4 where it's safe (after warmup) and backs off smoothly as precision matters.

2 Linear Warmup + Cosine Decay

The schedule has three regions. During warmup, LR rises linearly from 0 to max_lr. After warmup, it follows a cosine curve down to a minimum LR (typically 10% of max). After max_steps, it holds at the minimum.

Why cosine specifically? The cosine function has a useful property: it's steep in the middle (large LR reductions when the model is improving quickly) and flat near the ends (gentle changes at the start and end of decay). This mirrors the natural shape of improvement , you want bigger LR reductions early in the decay when the model still has a lot to improve, and smaller changes at the end to avoid disrupting a nearly-converged model.

import math

def get_lr(step, warmup_steps=100, max_lr=3e-4, max_steps=5000, min_lr=3e-5):
    # Phase 1: Linear warmup from 0 → max_lr over warmup_steps
    if step < warmup_steps:
        return max_lr * (step / warmup_steps)

    # Phase 3: Hold at min_lr after decay is complete
    if step >= max_steps:
        return min_lr

    # Phase 2: Cosine decay from max_lr → min_lr
    # decay_ratio goes from 0.0 (start of decay) to 1.0 (end of decay)
    decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
    # coeff goes from 1.0 → 0.0 following a cosine curve
    coeff = 0.5 * (math.cos(math.pi * decay_ratio) + 1.0)
    return min_lr + coeff * (max_lr - min_lr)

Warmup length rule of thumb: 1–2% of total training steps. For a 5,000-step run, that's 50–100 warmup steps. For a 50,000-step run, 500–1,000. Too short a warmup and you get early instability; too long and you waste steps at low LR when you could be making progress.

3 Applying the Schedule in the Training Loop

PyTorch optimizers have a fixed LR set at initialization. To change it mid-training, you modify the optimizer's param_groups directly. This must happen at the start of each training step, before the forward pass:

for step in range(max_steps):
    # 1. Update LR for this step (before the forward pass)
    lr = get_lr(step)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # 2. Standard training step
    x, y = get_batch(train_data, batch_size, context_len, device)
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Confirm schedule is working: print LR every 100 steps
    if step % 100 == 0:
        actual_lr = optimizer.param_groups[0]['lr']
        expected_lr = get_lr(step)
        print(f"Step {step}: LR={actual_lr:.2e} (expected {expected_lr:.2e})")

Verifying that the actual LR in optimizer.param_groups[0]['lr'] matches your get_lr(step) output is a quick sanity check that the schedule is actually being applied. It's easy to accidentally update the wrong variable or miss a step , printing this every 100 iterations costs nothing and catches schedule bugs immediately.

Your action Copy get_lr() into a notebook. Generate LR values for steps 0–2000 with warmup_steps=500 and plot them: plt.plot(steps, [get_lr(s) for s in steps]). Add a vertical dashed line at step 500. The curve should rise linearly then arc down smoothly , if it doesn't, your formula has a bug. Fix it now before wiring it into your training loop, where a broken schedule is nearly impossible to debug.

▶

Interactive companion tool

Plot any learning rate schedule interactively. Compare warmup + cosine decay against constant or linear schedules. See how schedule choices affect loss convergence.

lr-schedule.html ↗

Key Takeaway: Warmup + cosine decay stabilizes training and improves convergence. Never use constant LR for transformer models.

Day 19

The Optimizer

AdamW vs SGD, weight decay, betas, why AdamW for transformers

Gradient descent points us in the right direction , downhill on the loss surface , but it doesn't tell us how far to step, or whether to step equally in every direction. The optimizer makes these decisions. SGD takes the same-sized step for every parameter based only on the current gradient. AdamW keeps a running history of gradients and their magnitude, effectively building a per-parameter model of the loss surface curvature. This makes it dramatically more effective for transformers, where different parameters (embeddings, attention weights, MLP weights) have very different gradient statistics. AdamW is the standard optimizer for nearly every large language model ever trained , understanding why helps you tune it confidently.

1 How Adam Works: Adaptive Moments

Adam maintains two exponential moving averages for each parameter: m (first moment , the gradient direction) and v (second moment , the squared gradient magnitude). The update for each weight uses the ratio m / sqrt(v), which normalizes the gradient by its recent magnitude. This means parameters that receive large gradients take smaller effective steps, and parameters with small gradients take larger steps relative to their signal.

Why this matters for transformers: Embedding parameters only receive gradient updates for the specific tokens present in a batch , most of the embedding table gets zero gradient every step. SGD treats these sparse updates the same as dense ones, leading to inconsistent effective learning rates. Adam's per-parameter running averages adapt naturally to this sparsity: rarely-updated embeddings accumulate gradient history slowly and receive proportionally larger updates when they do appear.

SGD rule of thumb: it can work for convolutional networks where all parameters receive dense gradients every step. For transformers, you should always use Adam or a derivative , training time and final quality both suffer significantly with SGD.

2 AdamW: Fixing Adam's Weight Decay Bug

The original Adam paper coupled weight decay with the gradient update, which caused a subtle interaction: weight decay was effectively scaled by the per-parameter adaptive LR, making it inconsistent , parameters with large gradient variance got less regularization than intended. AdamW fixes this by applying weight decay directly to the weights, independently of the gradient update:

# Adam (with L2 regularization, the "wrong" way):
# gradient = gradient + weight_decay * weight   ← mixes with adaptive LR

# AdamW (decoupled weight decay, the "right" way):
# weight = weight - lr * gradient_update        ← adaptive gradient step
# weight = weight * (1 - lr * weight_decay)     ← separate shrinkage term

from torch.optim import AdamW

optimizer = AdamW(
    model.parameters(),
    lr=3e-4,           # Initial LR; will be overridden by get_lr() schedule
    betas=(0.9, 0.95), # beta1=0.9: gradient momentum; beta2=0.95: squared gradient momentum
    weight_decay=0.1   # Penalizes large weights to prevent overfitting
)

What beta1=0.9 means: the first moment (gradient direction) is 90% old estimate + 10% new gradient. This smooths out noisy batch-to-batch gradient variation. What beta2=0.95 means: the second moment (gradient magnitude) decays slower , 95% old + 5% new , because variance estimates are noisier and need more history to stabilize. GPT-2 and GPT-3 both used beta2=0.95; PyTorch's default of 0.999 is often too slow to adapt for transformer training.

Weight decay=0.1: at each step, every weight shrinks by lr × weight_decay = 3e-4 × 0.1 = 3e-5 of its current value. This continuously pushes weights toward zero, preventing any single weight from growing too large. The gradient signal must overcome this shrinkage to maintain a weight's magnitude, which means only useful, well-supported weights stay large.

3 Selective Weight Decay with Parameter Groups

Not all parameters should have weight decay. Bias terms and LayerNorm parameters are typically excluded , they're 1D scalars that don't benefit from magnitude regularization and can be actively harmed by it (shrinking a bias term that's learned to be +5.0 is counterproductive). The standard practice from GPT papers is to separate parameters into two groups:

def configure_optimizer(model, lr, weight_decay):
    # Separate parameters: apply weight decay only to 2D+ tensors (weights, not biases/norms)
    decay_params = [p for n, p in model.named_parameters()
                    if p.requires_grad and p.dim() >= 2]
    no_decay_params = [p for n, p in model.named_parameters()
                       if p.requires_grad and p.dim() < 2]

    param_groups = [
        {'params': decay_params,    'weight_decay': weight_decay},
        {'params': no_decay_params, 'weight_decay': 0.0},
    ]
    return AdamW(param_groups, lr=lr, betas=(0.9, 0.95))

optimizer = configure_optimizer(model, lr=3e-4, weight_decay=0.1)
print(f"Decay params: {sum(p.numel() for p in decay_params):,}")
print(f"No-decay params: {sum(p.numel() for p in no_decay_params):,}")

For a 10M parameter model, you'll typically find that 99%+ of parameters are in the decay group (all those weight matrices) and only a small fraction , biases, LayerNorm scales and shifts , are in the no-decay group. This split has a measurable effect on training stability and final model quality, particularly for smaller models where regularization pressure is more significant relative to the overall parameter count.

Your action Initialize your model and an AdamW optimizer. Print sum(p.numel() for p in model.parameters()) to confirm the total parameter count the optimizer will manage. Then run exactly 10 training steps and print the loss after each one , loss should decrease with every step. If it's flat or rising, check that you're calling optimizer.zero_grad() before loss.backward(), not after.

Key Takeaway: AdamW with weight decay is the go-to optimizer for transformers. It balances fast convergence and overfitting prevention.

Day 20

The Full Training Loop

Eval mode, gradient clipping, checkpointing, sample generation

Every piece from this week , data loading, device setup, the LR schedule, the optimizer , comes together in the training loop. The loop itself is short: it fits on one screen. But the order of operations is strict, and violating it produces bugs that are often silent. Loss may still drop, but slower than it should, or toward the wrong answer. Today you build the full loop correctly, understand exactly why each line is where it is, and add the validation and checkpointing logic that transforms a training script into something you can actually trust.

1 The Training Step: Order of Operations

Each training step must follow this exact sequence. Every deviation is either an error or a deliberate optimization that you need to understand first:

Update the learning rate for this step (before the forward pass)
Zero gradients so previous step's gradients don't accumulate
Forward pass: feed the input batch through the model to get logits
Compute loss: cross-entropy between logits and targets
Backward pass: compute gradients via backpropagation
Clip gradients: cap the total gradient norm to prevent explosions
Optimizer step: update weights using the clipped gradients

for step in range(total_steps):
    # 1. Set LR for this step
    lr = get_lr(step)
    for pg in optimizer.param_groups:
        pg['lr'] = lr

    # 2. Get a batch
    x, y = get_batch(train_data, batch_size, context_len, device)

    # 3. Zero gradients BEFORE the forward pass
    optimizer.zero_grad()

    # 4. Forward pass
    logits = model(x)

    # 5. Compute loss
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))

    # 6. Backward pass: compute gradients
    loss.backward()

    # 7. Clip gradients to prevent explosions
    # clip_grad_norm_ scales ALL gradients down if their collective L2 norm > 1.0
    grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # 8. Update weights
    optimizer.step()

Why does zero_grad() placement matter? PyTorch accumulates gradients: calling backward() adds new gradients on top of existing ones. If you forget to zero before each step, gradients from 10 previous batches pile up and your effective update is 10x too large. The resulting instability can look like a bad learning rate , a subtle misdirection.

Why gradient clipping? During certain unlucky batches, the loss can be very large, producing enormous gradients that would cause an enormous weight update. Gradient clipping rescales the entire gradient vector if its L2 norm exceeds a threshold (1.0 is standard). This doesn't change the direction of the update , only its maximum magnitude , keeping training numerically stable without meaningfully slowing learning.

2 Validation, Checkpointing & Sample Generation

Validation runs after each epoch (or every N steps on longer runs). The critical difference from training: you use model.eval() and torch.no_grad(). These serve different purposes: eval() switches dropout to pass-through mode (no random dropping during validation , we want deterministic outputs). torch.no_grad() tells PyTorch not to build a computation graph for these forward passes, saving memory and compute since we won't be backpropagating.

best_val_loss = float('inf')

for epoch in range(num_epochs):
    # --- Training phase ---
    model.train()
    train_losses = []
    for step in range(steps_per_epoch):
        # ... training step from above ...
        train_losses.append(loss.item())

    # --- Validation phase ---
    model.eval()
    val_losses = []
    with torch.no_grad():          # No gradient tracking needed
        for _ in range(val_steps):
            x, y = get_batch(val_data, batch_size, context_len, device)
            logits = model(x)
            val_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
            val_losses.append(val_loss.item())

    avg_train = sum(train_losses) / len(train_losses)
    avg_val   = sum(val_losses)   / len(val_losses)
    print(f"Epoch {epoch+1}: train={avg_train:.4f}, val={avg_val:.4f}, gap={avg_val-avg_train:.4f}")

    # Save best checkpoint (not just every N epochs)
    if avg_val < best_val_loss:
        best_val_loss = avg_val
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'train_loss': avg_train,
            'val_loss': avg_val,
            'stoi': stoi,   # Save vocab mappings so you can decode later
            'itos': itos,
        }, 'best_checkpoint.pt')
        print(f"  → New best val loss: {best_val_loss:.4f}, checkpoint saved")

Note that we save the vocabulary mappings (stoi and itos) in the checkpoint. If you reload a checkpoint later without them, you won't be able to decode the model's generated tokens back to text. A checkpoint that's missing the tokenizer is almost useless for generation.

3 Diagnosing a Training Loop That Isn't Converging

Sometimes you run the training loop and something feels wrong. Here are the most common failure modes and how to identify them:

Loss stuck at the random baseline (~4.2): Almost always a data pipeline bug. Check that y is x shifted by one position, not identical to x. Also verify the model output is (batch, seq_len, vocab_size) , if it's wrong shape, cross_entropy silently misbehaves.
Loss is NaN after a few steps: Learning rate too high, or missing gradient clipping. Check that clip_grad_norm_ is called. Try halving max_lr.
Loss drops smoothly but very slowly: LR too low, or warmup too long. Check your get_lr() output at step 0 , it should be near zero, but by step warmup_steps it should be at max_lr.
Train loss drops but val loss stays flat: The model may be too large for your dataset and overfitting from the very first epoch. Try a smaller model config (fewer layers or smaller embedding dimension) or increase weight decay.
Loss oscillates wildly: Batch size too small or LR too high. Increasing batch size often stabilizes this without changing the effective learning rate.

The most valuable debugging tool is printing loss every step for the first 20 steps. Loss should drop monotonically or near-monotonically. If it doesn't, the issue is in the training step order, not the model architecture.

Your action Assemble the complete training script using every piece from this week: data loader, device setup, LR schedule, and AdamW. Run 3 epochs on TinyShakespeare with a 2L/2H/128D model. Print f"Epoch {e}: train={train_loss:.4f}, val={val_loss:.4f}" at the end of each epoch. Confirm that train loss drops from ~4.2 toward ~2.5 by epoch 3 , if it doesn't budge, there's a bug in your data pipeline or gradient update order.

Key Takeaway: A full training loop includes validation, gradient clipping, checkpointing, and sample generation. The train/val loss gap is the best overfitting indicator.

Day 21

Watching the Model Learn

Overfitting signs, train vs val loss, when to stop training

Training is only half the work. Understanding what the numbers and samples mean , and knowing when to stop , is what separates a model that was trained from a model that actually learned. A loss drop from 4.2 to 1.0 is a transformation from "random character guesser" to something that has internalized spelling, common words, and grammatical patterns. You can see this progression in the generated text long before the loss numbers tell the whole story. Day 21 gives you the diagnostic vocabulary to read what your training curves are telling you and make confident decisions about your model.

1 Reading the Loss Curve

A healthy loss curve has a characteristic shape: a steep initial drop, followed by a long gradual improvement, followed by a plateau. Understanding what each phase looks like helps you diagnose problems early rather than waiting for a 3-hour run to finish.

Rapid initial drop (epochs 1–2): The model is learning the most obvious patterns , character frequency distribution, common bigrams. Loss typically falls from ~4.2 to ~2.5 quickly. If this doesn't happen, your training loop has a bug.
Slow improvement (epochs 3–10): Learning higher-order patterns , words, syntax, style. Loss moves from ~2.5 toward ~1.2. This is the long, productive middle of training.
Plateau or divergence: Val loss stops improving or starts rising while train loss continues to drop. This is overfitting , the model is memorizing training examples rather than generalizing.

The train/val loss gap is your primary overfitting diagnostic:

Gap < 0.2: Underfitting , model may be too small, or training hasn't run long enough
Gap 0.2–0.5: Healthy generalization , the model has learned patterns, not memorized data
Gap > 0.5: Overfitting , model is too large for your dataset, or you've trained too long
Val loss rising while train drops: Stop training now and use the checkpoint from the previous epoch

One subtlety: train loss is always expected to be somewhat lower than val loss, because the model has seen the training data before (it's literally what it trained on). A small gap isn't a problem. Only a large or widening gap is concerning.

2 Model Size vs Data: Finding the Right Match

Model capacity (number of parameters) must be matched to the amount of training data. Too large a model on too little data will memorize the training set , it has enough capacity to store it, rather than learning to generalize. Too small a model won't have enough capacity to represent the patterns in your data. On 1MB of text (~1M characters of TinyShakespeare), here's how different configurations perform:

# Config notation: (n_layers, n_heads, embed_dim) → approx parameter count

# 2L/2H/128D → ~0.5M params
# Trains fast (minutes on CPU, seconds on GPU)
# Sample: "the king of his the and to be the my"
# (valid words, no meaningful grammar or coherence)

# 4L/4H/256D → ~4M params
# Trains in ~30min on MPS/CUDA
# Sample: "HAMLET: I will not speak with her, my lord."
# (coherent sentences, character voices emerging)

# 6L/6H/384D → ~10M params
# Trains in ~2-3h on MPS, ~30min on good CUDA
# Sample: "KING HENRY: The queen hath given me many a tear"
# (multi-sentence coherence, consistent dramatic tone)

# Rule of thumb: for 1MB text, 10M params is about the right ceiling.
# Above this, you'll see val loss rise early no matter how you tune.

Start with the smallest config (2L/2H/128D) to verify your pipeline end-to-end in minutes. Then scale up once you're confident training is working correctly. There's no benefit to debugging a 10M parameter model when the same bug would show up in a 0.5M parameter model 20x faster.

3 Early Stopping & Loading the Best Checkpoint

The model checkpoint with the lowest validation loss is your best model , not necessarily the one from the last epoch. If you overfitted, your final checkpoint is actually worse than your best one. This is why we save best_checkpoint.pt whenever val loss improves, as shown in Day 20. The code to find the best epoch and load it:

# If you've been printing per-epoch losses, find the best manually:
losses = {
    1: (3.21, 3.18),   # (train, val)
    2: (2.45, 2.49),
    3: (1.82, 1.90),
    4: (1.51, 1.63),
    5: (1.24, 1.44),   # ← best val loss
    6: (1.05, 1.52),   # gap increasing: overfitting starting
    7: (0.89, 1.71),   # clearly overfitting
}
best_epoch = min(losses, key=lambda e: losses[e][1])  # epoch 5

# Load the best checkpoint
checkpoint = torch.load('best_checkpoint.pt', weights_only=False)
model.load_state_dict(checkpoint['model_state_dict'])
stoi = checkpoint['stoi']
itos = checkpoint['itos']
print(f"Loaded best checkpoint from epoch {checkpoint['epoch']+1}")
print(f"Best val loss: {checkpoint['val_loss']:.4f}")

Once you've loaded the best checkpoint, generate a few samples (Week 4 covers generation in depth). Compare samples from epoch 5 vs epoch 7 in the example above , the epoch 7 model has lower train loss but higher val loss, and its outputs will be more repetitive and less varied because it has memorized specific training sequences rather than learning the underlying language patterns. This comparison makes overfitting concrete and visible, not just a number.

How to resume training from a checkpoint: Save optimizer_state_dict in your checkpoint (as shown in Day 20) and restore it with optimizer.load_state_dict(checkpoint['optimizer_state_dict']). This restores Adam's moment estimates, so training resumes from exactly where it left off. Without this, Adam's running averages reset to zero and you effectively restart warmup , sometimes causing a temporary loss spike in the first few steps after resuming.

▶

Interactive companion tool

Convert between cross-entropy loss and perplexity. See where your model sits on the scale from random to very good, with benchmark references for common checkpoints.

perplexity.html ↗

Your action Take the train and val loss values from your Day 20 run and plot both curves on the same graph with matplotlib. Identify the epoch where val loss is at its lowest , that's your best checkpoint. Load it with torch.load('checkpoint_epochN.pt') and generate a 200-character sample. Compare it side-by-side with a sample from your final (potentially overfitted) checkpoint. The difference makes overfitting tangible, not just a number on a graph.

Key Takeaway: Monitor train/val loss gap to detect overfitting. Stop training when val loss plateaus. Larger models produce better samples but need more data.

4WK

Text Generation

Make your model write: sampling strategies, temperature, and top-k

Week 4 Focus: Making the model write

0 / 7 complete

Day 22

The Generate Function

Autoregressive generation: predict one token at a time, append and repeat

Text generation with a GPT is autoregressive: generate one token at a time, append it to the input, and repeat. This is the core loop that makes your model "write". Day 22 establishes the basic generate function structure that you'll build upon throughout the week. By the end of this day you'll understand why generation is fundamentally different from training, what the context window limit means during inference, and how the generate loop maps to real code.

Why autoregressive generation?

Language models predict the next token given a sequence of previous tokens. To generate text, you start with a prompt, predict the next token, append it to the sequence, and repeat. This is called autoregressive generation because each prediction depends on all previous predictions. The term "autoregressive" comes from time series analysis where each value is regressed on previous values of the same variable.

def generate(model, idx, max_new_tokens):
    """Generate text autoregressively.

    Args:
        model: The GPT model
        idx: Tensor of shape (batch, seq_len) with starting tokens
        max_new_tokens: Number of tokens to generate
    Returns:
        Tensor with original prompt + generated tokens
    """
    model.eval()  # Set to evaluation mode
    for _ in range(max_new_tokens):
        # Get predictions for next token
        # Only use the last block_size tokens for context
        idx_cond = idx[:, -model.config.block_size:]

        # Forward pass: (batch, seq_len, vocab_size)
        logits, _ = model(idx_cond)

        # Focus on the last token's predictions
        # logits shape: (batch, seq_len, vocab_size)
        # We want: (batch, vocab_size) for the last position
        logits = logits[:, -1, :]

        # Convert to probabilities
        probs = torch.softmax(logits, dim=-1)

        # Sample the next token
        next_token = torch.multinomial(probs, num_samples=1)

        # Append to the sequence
        idx = torch.cat([idx, next_token], dim=1)

    return idx

Experiment

Try running this basic generate function with a trained model. Notice how it uses model.eval() to disable dropout, and how it only passes the last block_size tokens to the model (because GPT can only handle sequences up to block_size tokens). What happens if you forget to truncate the sequence?

The basic loop structure

The generate loop has three key parts: (1) prepare the input sequence, (2) run the model to get logits for the next token, (3) sample or select the next token and append it. This loop repeats until you've generated the desired number of tokens. The "autoregressive" part means the input to each iteration includes all previously generated tokens.

# How to use the generate function
import torch
from model import GPT, GPTConfig
from tokenizer import get_stoi, get_itos

# Load your trained model
checkpoint = torch.load("checkpoint_final.pt", weights_only=False)
model = GPT(checkpoint["config"])
model.load_state_dict(checkpoint["model_state_dict"])

# Get character mappings
stoi = checkpoint["stoi"]
itos = checkpoint["itos"]

# Encode the prompt
prompt = "To be or not"
tokens = [stoi[c] for c in prompt if c in stoi]
idx = torch.tensor([tokens], dtype=torch.long)

# Generate!
output_idx = generate(model, idx, max_new_tokens=200)

# Decode back to text
output_text = "".join([itos[i] for i in output_idx[0].tolist()])
print(output_text)

Managing the context window

GPT models have a fixed context window , the maximum sequence length they were trained on (block_size). During generation the sequence grows longer with every token you append. Once it exceeds block_size, you must truncate it or the model will receive a positional ID it has never seen during training, producing garbage output.

The fix is one line: always pass only the last block_size tokens into the model. The generated text still accumulates in full in idx (for decoding), but the model only "looks back" as far as its training allowed.

# Without truncation , works fine until sequence length > block_size
logits, _ = model(idx)  # idx grows indefinitely , will crash or hallucinate

# With truncation , always safe
idx_cond = idx[:, -model.config.block_size:]  # keep last block_size tokens
logits, _ = model(idx_cond)

# The full idx still stores the entire generated sequence for decoding.
# Only the model input is truncated, not the stored output.

Experiment

Remove the truncation line and generate 1000 tokens from a model with block_size=256. Watch what happens after 256 tokens: the output will degrade or crash with an index-out-of-bounds error in the position embedding. Then add the truncation back and generate the same prompt. The difference is dramatic.

eval mode vs train mode

Calling model.eval() before generating is not optional , it changes the model's behaviour in two important ways:

Dropout is disabled. During training, dropout randomly zeroes some activations to regularise the model. During generation you want deterministic, reproducible forward passes. With dropout on, you get different outputs from the same prompt even at temperature 0.
Batch normalisation uses running statistics. If your model has BatchNorm layers (GPT typically doesn't, but worth knowing), eval mode freezes the running mean and variance instead of computing them from the current batch.

# Correct inference setup
model.eval()  # Disable dropout, use running BN stats

# Even better: pair with no_grad to save memory
model.eval()
with torch.no_grad():
    output = generate(model, idx, max_new_tokens=200)

# If you need to continue training after generation:
model.train()  # Switch back to train mode

⚠

Common mistake Forgetting model.eval() means dropout is active during generation. The same prompt with temperature=0 will produce different output on every call. If you're debugging generation quality and results are mysteriously inconsistent, check whether you set eval mode.

~30%

Typical dropout rate during training , silently active if you forget model.eval()

O(T²)

Attention cost per forward pass , grows with sequence length, why block_size matters

Your action Write the generate function from scratch without copying. Given a model, a starting token tensor, and max_new_tokens, implement the loop: forward pass → slice last logit → softmax → multinomial sample → cat. Call it with your trained checkpoint and a short prompt. Print the raw output to the terminal. Verify the output grows by exactly max_new_tokens tokens.

Key Takeaway: Autoregressive generation is the foundation of how GPT models write text. Each token prediction depends on all previous tokens, creating a chain of predictions that builds up the output one token at a time.

Day 23

Greedy Decoding

Always pick the most probable token - why it's deterministic and repetitive

The simplest decoding strategy is greedy decoding: always pick the most probable next token (argmax). While this seems logical, it leads to deterministic and repetitive output. Day 23 explores why greedy decoding fails, dissects the feedback loop that creates repetition, and shows why you need to introduce randomness to get interesting text. Understanding greedy's failure mode is the motivation for everything else in this week.

Why greedy decoding is the naive approach

Greedy decoding always selects the token with the highest probability. It's simple and deterministic: the same prompt always produces the same output. But this is exactly why it's problematic. Language is inherently probabilistic, and the most probable continuation isn't always the most interesting or coherent one. Greedy decoding tends to get stuck in loops or produce bland, repetitive text.

def generate_greedy(model, idx, max_new_tokens):
    """Greedy decoding: always pick the most probable token.

    This is deterministic - same prompt = same output every time.
    """
    model.eval()
    with torch.no_grad():  # No need to track gradients for inference
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -model.config.block_size:]

            # Get logits for next token
            logits, _ = model(idx_cond)
            logits = logits[:, -1, :]

            # GREEDY: always pick the most probable token
            # argmax returns the index of the highest value
            next_token = logits.argmax(dim=-1, keepdim=True)

            # Append and continue
            idx = torch.cat([idx, next_token], dim=1)

    return idx

Experiment

Run greedy decoding with your trained model. Try different prompts. Notice how the output is always the same for the same prompt. Also notice how it tends to repeat phrases or get stuck in loops. This is why we need sampling strategies that introduce randomness.

The problem with greedy decoding

Greedy decoding has two main problems: (1) It's deterministic, which means no creativity or variation. (2) It tends to be repetitive because the highest-probability continuation often reinforces itself. For example, if the model predicts "the" as the most likely next token, and "the" leads to another "the", you get "the the the...". Sampling strategies like temperature and top-k break this cycle by introducing controlled randomness.

# Compare greedy vs sampled output
# Greedy (temperature=0.1, very close to argmax):
# "To be or not to be or not to be or not to be..."

# With temperature sampling (temperature=0.8):
# "To be or not to be, that is the question. Whether 'tis..."

# The greedy output is stuck in a loop!
# This happens because the model learns that certain
# patterns (like "to be or not") are highly probable,
# and greedy always picks the highest probability.

The feedback loop that causes repetition

Repetition isn't accidental , it's the direct consequence of always choosing the most probable token. Here's why the loop forms:

The model sees "to be or not" and assigns highest probability to " to" (it has seen this phrase many times in training).
Greedy picks " to". Now the context is "to be or not to".
The model has now seen "or not to" many times followed by " be", so " be" gets highest probability.
The cycle continues: "to be or not to be or not to be..."

The model isn't "broken" , it's doing exactly what it was trained to do: predict the most likely continuation. The problem is that the most likely continuation locally is not always the most interesting continuation globally. Greedy optimises for each token in isolation, not the quality of the full sequence.

# Greedy feedback loop visualised
# Each step the model sees its own previous output

# Step 1: "To be or not" → P(next=" to") = 0.41  ← greedy picks this
# Step 2: "be or not to" → P(next=" be") = 0.55  ← greedy picks this
# Step 3: "or not to be" → P(next=" or") = 0.38  ← greedy picks this
# Step 4: "not to be or" → P(next=" not") = 0.47 ← loop locked in

# With sampling at T=0.8, other tokens get a chance:
# Step 1: "To be or not" → samples " to" (41%), " ,", " here", ...
# Step 2: might go an entirely different direction

Beam search: the "smarter" alternative to greedy

Beam search is the middle ground between greedy (one candidate) and full sampling (all candidates). Instead of keeping only the single best token at each step, beam search keeps the top k partial sequences (called "beams") and expands all of them. At the end, it picks the full sequence with the highest total log-probability.

Greedy (beam=1)

Fast, deterministic, but myopic. Can lock into loops. Used only when speed is critical and output quality is less important.

Beam search (beam=4–10)

Slower (k× more forward passes), but considers more options. Popular in translation and summarisation. Still deterministic , same beam width always produces the same output.

For open-ended text generation (stories, Shakespeare, creative writing), neither greedy nor beam search is ideal , they both optimise for highest probability, which produces safe, bland text. Sampling with temperature and top-k, covered in the next two days, is the standard approach for creative generation because it introduces controlled randomness.

⚠

Beam search for LLMs Modern large language models like GPT-4 do not use beam search for chat responses , they use sampling. Beam search produces "average" high-probability text, which feels robotic. Sampling produces varied, human-like text. Beam search is still used in constrained tasks like code generation with unit-test verification.

Your action Add a generate_greedy variant to your generate file that uses argmax instead of multinomial. Run it with the prompt "To be or not" and generate 200 characters. Run it a second time with the exact same prompt , you should get identical output. Now compare it side-by-side with your sampling version. Write down one observation about where greedy gets stuck.

Key Takeaway: Greedy decoding is deterministic and tends to be repetitive. The highest-probability token isn't always the best choice for generating interesting, coherent text. We need sampling strategies that balance probability with diversity.

Day 24

Temperature Sampling

Softmax with temperature scaling: T->0 approaches greedy, T>1 flattens distribution

Temperature controls the randomness of sampling by scaling the logits before applying softmax. Higher temperature = more random, lower temperature = more deterministic. Day 24 explains the math behind temperature scaling, shows you exactly what it does to a probability distribution, and gives you a practical guide for choosing the right temperature for different generation goals.

Why temperature helps

Temperature scaling divides the logits by a temperature value before applying softmax. This changes the probability distribution: low temperature makes high-probability tokens even more likely (sharper distribution), high temperature flattens the distribution and gives rare tokens more chance. The "sweet spot" for coherent but varied text is typically T=0.7-0.9.

def apply_temperature(logits, temperature=1.0):
    """Scale logits by temperature before softmax.

    Temperature effects:
    - T = 1.0: Normal probabilities (no change)
    - T -> 0: Approaches greedy (argmax)
    - T > 1.0: Flattens distribution, more random
    - T = 0.7-0.9: Sweet spot for coherent text
    """
    if temperature <= 0:
        raise ValueError("Temperature must be positive")

    # Divide logits by temperature
    # Higher T = flatter distribution
    # Lower T = sharper distribution (more peaked)
    scaled_logits = logits / temperature

    # Apply softmax to get probabilities
    probs = torch.softmax(scaled_logits, dim=-1)

    return probs

Experiment

Try generating text with different temperatures: T=0.1 (almost greedy, very repetitive), T=0.8 (balanced), T=1.5 (very creative but potentially incoherent). Notice how temperature affects both the diversity and coherence of the output. The math: softmax computes exp(logit_i) / sum(exp(logit_j)). Dividing by T changes how much each logit contributes.

The math behind temperature

Softmax computes: P(token_i) = exp(logit_i) / sum(exp(logit_j)) for all j. When you divide logits by temperature T, you get: P(token_i) = exp(logit_i/T) / sum(exp(logit_j/T)). As T->0, the largest logit dominates (approaches argmax). As T->infinity, all probabilities approach 1/vocab_size (uniform distribution). This gives you a knob to control how "confident" the model should be.

# Visualizing temperature effects on a simple distribution
# Suppose logits for 3 tokens are: [2.0, 1.0, 0.5]

# T = 1.0 (normal):
# exp(2.0) = 7.39, exp(1.0) = 2.72, exp(0.5) = 1.65
# sum = 11.76
# P = [0.63, 0.23, 0.14]  <- Original distribution

# T = 0.5 (sharper):
# exp(2.0/0.5) = exp(4.0) = 54.6
# exp(1.0/0.5) = exp(2.0) = 7.39
# exp(0.5/0.5) = exp(1.0) = 2.72
# P = [0.84, 0.11, 0.04]  <- More peaked!

# T = 2.0 (flatter):
# exp(2.0/2.0) = exp(1.0) = 2.72
# exp(1.0/2.0) = exp(0.5) = 1.65
# exp(0.5/2.0) = exp(0.25) = 1.28
# P = [0.48, 0.29, 0.23]  <- More uniform!

Practical guide: what each temperature range feels like

The math of temperature is clean, but what matters in practice is how each range feels when you read the output. Here is a field guide for a character-level Shakespeare model:

T = 0.1 – 0.4 (very focused)

Output is almost identical every run. Tends to repeat common phrases from training data. Useful only when you want near-deterministic output for testing.

T = 0.6 – 0.9 (the sweet spot)

Varied enough to feel natural, structured enough to stay coherent. This is the default for creative text. Most practitioners land somewhere in this range.

T = 1.0 (raw model)

No transformation applied , you sample from the model's learned distribution exactly. May work well for strong models; tends to produce incoherent runs with smaller ones.

T > 1.2 (chaotic)

Rare characters and unusual combinations get boosted. Output becomes increasingly incoherent. Sometimes useful for "creative hallucination" but rarely for serious generation.

# Side-by-side with the same prompt "HAMLET:" and seed=42

# T=0.2 → "HAMLET: I will not be able to be able to be able to"
#   Repetition. Safe but boring.

# T=0.8 → "HAMLET: What is this shadow of my father's sword,"
#   Coherent and varied. Shakespeare-like.

# T=1.5 → "HAMLET: ,wixy the! Ard uf mou fath,no. Cre."
#   Character-valid but meaningless. Too much noise.

Temperature and entropy: the deeper connection

Temperature has a formal connection to entropy , the measure of uncertainty in a probability distribution. High entropy means many tokens are roughly equally likely (high uncertainty). Low entropy means one or a few tokens dominate (low uncertainty). Temperature directly controls entropy:

T → 0: entropy → 0. The distribution collapses to a spike on one token (argmax / greedy). Zero uncertainty.
T = 1: entropy is whatever the model learned. The natural output of the model.
T → ∞: entropy → log(vocab_size). All tokens are equally likely. Maximum uncertainty.

import torch

def distribution_entropy(logits, temperature):
    """Compute Shannon entropy of the sampling distribution."""
    scaled = logits / temperature
    probs = torch.softmax(scaled, dim=-1)
    # H = -sum(p * log(p)), in nats
    log_probs = torch.log(probs + 1e-10)
    entropy = -(probs * log_probs).sum().item()
    return entropy

# Example: logits for 5 tokens = [3.0, 2.0, 1.0, 0.5, 0.1]
logits = torch.tensor([3.0, 2.0, 1.0, 0.5, 0.1])
for T in [0.2, 0.5, 0.8, 1.0, 1.5]:
    print(f"T={T}: entropy={distribution_entropy(logits, T):.3f}")
# T=0.2: entropy=0.021  (very peaked)
# T=0.5: entropy=0.318
# T=0.8: entropy=0.713
# T=1.0: entropy=0.951
# T=1.5: entropy=1.296  (flatter)

Thinking in entropy gives you intuition for how "surprised" the model is at each step. High-entropy steps are where multiple tokens are plausible (e.g., choosing between different plot directions). Low-entropy steps are where the next token is nearly forced (e.g., the second letter of "qu" in English is almost always "u").

▶

Interactive companion tool

Adjust temperature and watch the probability distribution reshape in real time. Layer on top-k or top-p filtering and see which tokens survive.

sampling-playground.html ↗

Your action Generate the same 200-character sample five times at each of three temperatures: T=0.2, T=0.8, and T=1.4. Use the same starting prompt each time. Notice how much the outputs vary within a temperature setting and across settings. Find the temperature where the output is still recognisably English but not obviously repetitive , that is your personal sweet spot for this model.

Key Takeaway: Temperature gives you control over the randomness of generation. T=0.7-0.9 is usually the sweet spot for coherent but varied text. Lower temperatures for more focused output, higher for more creative exploration.

Day 25

Top-k Sampling

Restrict to k most likely tokens, why it prevents extremely unlikely tokens

Top-k sampling restricts the model to only consider the k most likely tokens. This prevents the model from sampling extremely unlikely tokens that would produce nonsense. Day 25 covers how top-k works, how to choose k for different vocabulary sizes, and introduces its more adaptive cousin , top-p (nucleus) sampling , which adjusts the candidate set dynamically based on probability mass rather than a fixed count.

Why top-k prevents nonsense

Even with temperature scaling, the model might still sample from the long tail of unlikely tokens. These tokens often produce gibberish. Top-k sampling solves this by only considering the k most probable tokens: set the logits of all other tokens to -inf (which becomes 0 after softmax). This gives you a hard cutoff: only sample from the top k tokens.

def apply_top_k(logits, top_k=40):
    """Filter logits to only keep the top-k most likely tokens.

    Args:
        logits: Tensor of shape (batch, vocab_size)
        top_k: Number of top tokens to keep
    Returns:
        Filtered logits (other tokens set to -inf)
    """
    if top_k <= 0:
        return logits  # No filtering

    # Get the top-k values and their indices
    # values: (batch, top_k) - the k highest logits
    # indices: (batch, top_k) - their positions
    values, _ = torch.topk(logits, top_k)

    # Get the smallest value among the top-k
    # This is our threshold: anything below this gets filtered
    min_top_k_value = values[:, -1:]  # Shape: (batch, 1)

    # Set all logits below the threshold to -inf
    # After softmax, -inf becomes 0 probability
    logits[logits < min_top_k_value] = float("-inf")

    return logits

Experiment

Try different top_k values: k=5 (very restrictive, may be repetitive), k=40 (good for char-level), k=0 or None (no filtering). For character-level models with vocab_size=65, top_k=40 is reasonable - it still considers most characters but excludes the very unlikely ones. For word-level models with large vocabularies, you might use k=50 or k=100.

Good values for char-level models

Character-level models have small vocabularies (typically 65 tokens for Shakespeare: 26 lowercase + punctuation + special chars). With such a small vocabulary, top_k=40 is reasonable because it still considers most characters while excluding the very unlikely ones. For word-level models with 50,000+ tokens, you'd use larger k values like 50-100.

# For character-level model (vocab_size=65):
# top_k=40 means we consider ~62% of the vocabulary
# This excludes the bottom 25 characters (unlikely ones)

# For word-level model (vocab_size=50000):
# top_k=50 means we consider only 0.1% of the vocabulary!
# This is very restrictive but prevents nonsense words.

# Typical values:
# - Char-level: top_k=20-40
# - Word-level: top_k=50-100
# - BPE/subword: top_k=40-80

# The idea: filter out the "long tail" of unlikely tokens
# that would produce gibberish or break the text.

Top-p (nucleus) sampling: the adaptive alternative

The weakness of top-k is that k is fixed regardless of the distribution shape. When the model is very confident, even k=5 might include tokens with near-zero probability. When the model is uncertain, k=40 might still exclude many plausible tokens. Top-p sampling (also called nucleus sampling) solves this by choosing k adaptively: keep the smallest set of tokens whose cumulative probability exceeds p.

def apply_top_p(logits, top_p=0.9):
    """Nucleus sampling: keep tokens whose cumulative probability <= top_p."""
    # Sort tokens by probability (descending)
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    sorted_probs = torch.softmax(sorted_logits, dim=-1)

    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Remove tokens with cumulative probability > top_p
    # Shift by 1 to include the token that pushes over the threshold
    sorted_indices_to_remove = cumulative_probs - sorted_probs > top_p
    sorted_logits[sorted_indices_to_remove] = float("-inf")

    # Unsort back to original token order
    logits = torch.zeros_like(logits).scatter_(0, sorted_indices, sorted_logits)
    return logits

Top-k (fixed count)

Always considers exactly k tokens. When the model is highly confident, k=40 still samples from tokens with <0.001% probability. Can let in noise.

Top-p (adaptive count)

Considers however many tokens are needed to reach p probability mass. Naturally contracts to 1–2 tokens when the model is confident, expands to many when uncertain. More robust across different text types.

Experiment

Implement apply_top_p and add a top_p parameter to your generate function. Try p=0.9 (common default) and p=0.95. Compare the outputs to top-k at k=40. For most character-level models, top-k and top-p produce similar output quality , the difference becomes more pronounced with large vocabulary word-level models.

Combining top-k and temperature: order matters

Top-k and temperature are often applied together, and the order in which you apply them matters. The correct pipeline is: apply temperature first (scale the logits), then apply top-k (filter the scaled logits), then apply softmax to get probabilities. If you apply top-k on raw logits and then temperature on the filtered set, you get different (and subtly wrong) results.

def sample_next_token(logits, temperature=0.8, top_k=40):
    """Correct pipeline: temperature → top-k → softmax → sample."""

    # Step 1: Temperature scaling (modifies the distribution shape)
    logits = logits / temperature

    # Step 2: Top-k filtering (removes low-probability tokens)
    if top_k > 0:
        values, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < values[:, [-1]]] = float("-inf")

    # Step 3: Softmax (convert to probabilities , must come after both filters)
    probs = torch.softmax(logits, dim=-1)

    # Step 4: Sample
    return torch.multinomial(probs, num_samples=1)

# Why order matters:
# Temperature changes the relative gaps between logits.
# Top-k selects by rank, not absolute value.
# If you apply top-k first, temperature can't affect which tokens are kept.
# Applying temperature first, then top-k, then softmax is the standard.

p=0.9

Most common top-p value in production LLM APIs (OpenAI default)

k=40

Common top-k default for character-level models; k=50 for word-level

▶

Interactive companion tool

Explore top-k, top-p, and the order-matters problem with temperature. See exactly how each sampling strategy changes the candidate token set.

sampling-playground.html ↗

Your action Add apply_top_k to your generate pipeline and wire up a top_k parameter. Generate samples at k=5, k=20, and k=0 (no filtering), holding temperature fixed at T=0.8. For your character-level model with ~65-token vocab, note the difference between k=5 (very restrictive) and k=0. Write down which value produces the best trade-off between safety and variety.

Key Takeaway: Top-k sampling prevents the model from sampling extremely unlikely tokens by restricting the choice to the k most probable ones. For character-level models, top_k=40 is a good starting point.

Day 26

The Full Generate Function

Combining temperature + top-k, torch.multinomial for sampling, @torch.no_grad()

Now we combine everything: temperature scaling, top-k filtering, and multinomial sampling into a full generate function. Day 26 also introduces @torch.no_grad() for memory-efficient inference, handles the details of prompt encoding (including unknown characters), and shows how to generate multiple independent samples in one call so you can compare different outputs from the same prompt.

Why @torch.no_grad()?

@torch.no_grad() disables gradient computation during inference. When generating text, we don't need to compute gradients (we're not training), so disabling them saves memory and speeds up generation. Without it, PyTorch would track all operations for potential backpropagation, which wastes memory and compute. Always use it for inference!

@torch.no_grad()  # Disables gradient tracking for efficiency
def generate(model, prompt, stoi, itos,
             max_new_tokens=200, temperature=0.8, top_k=40):
    """Full generate function with temperature and top-k.

    Pipeline for each token:
    1. Run model -> get logits for next position
    2. Apply temperature scaling
    3. Filter with top-k (remove unlikely tokens)
    4. Convert to probabilities with softmax
    5. Sample from distribution with multinomial
    6. Append sampled token and repeat
    """
    device = next(model.parameters()).device

    # Encode the prompt
    tokens = [stoi[c] for c in prompt if c in stoi]
    idx = torch.tensor([tokens], dtype=torch.long, device=device)

    model.eval()  # Set to evaluation mode (disables dropout)

    for _ in range(max_new_tokens):
        # Only use last block_size tokens for context
        idx_cond = idx[:, -model.config.block_size:]

        # Forward pass
        logits, _ = model(idx_cond)

        # Focus on last token: (batch, vocab_size)
        logits = logits[:, -1, :]

        # Step 2: Apply temperature
        logits = logits / temperature

        # Step 3: Apply top-k filtering
        if top_k > 0:
            values, _ = torch.topk(logits, top_k)
            min_top_k = values[:, -1:]
            logits[logits < min_top_k] = float("-inf")

        # Step 4: Convert to probabilities
        probs = torch.softmax(logits, dim=-1)

        # Step 5: Sample from distribution
        # multinomial samples from probs, weighted by their values
        next_token = torch.multinomial(probs, num_samples=1)

        # Step 6: Append and continue
        idx = torch.cat([idx, next_token], dim=1)

    # Decode back to text
    return "".join([itos[i] for i in idx[0].tolist()])

Experiment

Test the full generate function with different combinations of temperature and top_k. Try (T=0.8, k=40) as the default, then experiment: (T=0.5, k=20) for more focused output, (T=1.2, k=0) for wild creativity. Notice how temperature and top_k work together: temperature controls the shape of the distribution, top_k controls the number of candidates.

torch.multinomial for sampling

torch.multinomial(probs, num_samples=1) samples one token from the probability distribution. Unlike argmax which always picks the highest probability, multinomial samples according to the probabilities: higher probability = more likely to be sampled, but not guaranteed. This is what makes the output varied and interesting.

# Example: sampling from a distribution
probs = torch.tensor([0.6, 0.3, 0.1])  # 3 tokens

# argmax always returns index 0 (highest prob)
print(probs.argmax())  # Always: tensor(0)

# multinomial samples according to probabilities
print(torch.multinomial(probs, num_samples=1))
# Could be 0 (60% chance), 1 (30% chance), or 2 (10% chance)

# This is the key to non-deterministic generation!
# Same prompt + different samples = different output

Prompt encoding and handling unknown characters

The generate function receives a string prompt and must convert it to token IDs before the model can process it. For character-level models this is simple , one character = one token , but there are edge cases you must handle:

Unknown characters: The model only knows characters seen in the training corpus. If your prompt contains a character that wasn't in training (e.g., an emoji or a non-ASCII letter), the lookup will fail. You must either skip, replace, or raise an error.
Empty prompt: If all prompt characters are unknown, the token list is empty. A common fallback is to start from a newline character ("\n") or the first token in the vocabulary.
Prompt longer than block_size: Truncate to the last block_size characters before encoding.

def encode_prompt(prompt, stoi, block_size):
    """Encode a string prompt into token IDs, handling edge cases."""
    # Filter unknown characters (silently drop them)
    known_chars = [c for c in prompt if c in stoi]

    if not known_chars:
        # Fallback: start from newline if prompt is fully unknown
        print("Warning: prompt contained no known characters, using '\\n'")
        known_chars = ["\n"] if "\n" in stoi else [list(stoi.keys())[0]]

    # Truncate if longer than context window
    if len(known_chars) > block_size:
        known_chars = known_chars[-block_size:]
        print(f"Warning: prompt truncated to last {block_size} characters")

    # Convert to token IDs and add batch dimension
    token_ids = [stoi[c] for c in known_chars]
    return torch.tensor([token_ids], dtype=torch.long)

# Usage
idx = encode_prompt("To be or not", stoi, model.config.block_size)
print(f"Prompt encoded to {idx.shape[1]} tokens")  # (1, 14)

Generating multiple samples at once

The generate function takes a batch dimension , the first axis of idx. If you pass a batch of size N, you get N independent samples from the same prompt in one forward pass per token. This is much faster than calling generate N times, because the GPU can process the batch in parallel.

@torch.no_grad()
def generate_n_samples(model, prompt, stoi, itos, n=5,
                        max_new_tokens=200, temperature=0.8, top_k=40):
    """Generate n independent samples from the same prompt."""
    device = next(model.parameters()).device

    # Encode prompt once, then repeat for the batch
    tokens = [stoi[c] for c in prompt if c in stoi]
    idx = torch.tensor([tokens], dtype=torch.long, device=device)
    idx = idx.repeat(n, 1)  # Shape: (n, prompt_len)

    model.eval()

    for _ in range(max_new_tokens):
        idx_cond = idx[:, -model.config.block_size:]
        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # (n, vocab_size)

        if top_k > 0:
            values, _ = torch.topk(logits, top_k)
            logits[logits < values[:, [-1]]] = float("-inf")

        probs = torch.softmax(logits, dim=-1)
        next_tokens = torch.multinomial(probs, num_samples=1)  # (n, 1)
        idx = torch.cat([idx, next_tokens], dim=1)

    # Decode each sample in the batch
    samples = []
    for i in range(n):
        text = "".join([itos[t] for t in idx[i].tolist()])
        samples.append(text)
    return samples

# Usage: generate 5 samples and print each
samples = generate_n_samples(model, "To be or not", stoi, itos, n=5)
for i, s in enumerate(samples):
    print(f"\n--- Sample {i+1} ---\n{s}")

Experiment

Generate 5 samples at once and read them all. You'll notice each goes in a different creative direction from the same prompt. This is the power of sampling. Now pick your favourite sample , that's the kind of output that's impossible to get from greedy decoding. Compare the wall-clock time of generate_n_samples(n=5) vs calling your single-sample generate 5 times in a loop.

Your action Combine temperature and top-k into one generate function decorated with @torch.no_grad(). Time a 500-token generation with and without the decorator using time.time(). Measure peak GPU or CPU memory with torch.cuda.max_memory_allocated() (or tracemalloc on CPU). Record the wall-clock and memory difference , this makes the cost of gradient tracking concrete.

Key Takeaway: The full generate function combines temperature scaling, top-k filtering, and multinomial sampling. Use @torch.no_grad() for memory-efficient inference. This is the function you'll use to make your model write text.

Day 27

CLI & Reproducibility

argparse for command-line interface, seeds for reproducibility, loading checkpoints

Turn your generate function into a proper command-line tool using argparse. Day 27 covers building a CLI that loads checkpoints, accepts prompts and generation parameters, and supports reproducibility with random seeds. A well-structured CLI also means you can run generation experiments from shell scripts, compare outputs across checkpoints without touching Python, and share a single command with someone else and get the exact same result.

Building the CLI with argparse

A command-line interface makes your generate function easy to use. argparse lets you define arguments like the checkpoint path, prompt, temperature, top_k, and seed. This allows users to run generation from the terminal without modifying code. The key is to save all necessary data (config, stoi, itos) in the checkpoint during training.

if __name__ == "__main__":
    import argparse

    # Set up argument parser
    parser = argparse.ArgumentParser(
        description="Generate text from a trained GPT checkpoint"
    )

    # Required argument: checkpoint path
    parser.add_argument("checkpoint",
                        help="Path to checkpoint file (e.g. checkpoint_final.pt)")

    # Optional arguments with defaults
    parser.add_argument("--prompt", default="To be or not",
                        help="Starting text for generation")
    parser.add_argument("--max_new_tokens", type=int, default=200,
                        help="Number of tokens to generate")
    parser.add_argument("--temperature", type=float, default=0.8,
                        help="Sampling temperature (lower = more deterministic)")
    parser.add_argument("--top_k", type=int, default=40,
                        help="Only sample from top-k most likely tokens")
    parser.add_argument("--seed", type=int, default=None,
                        help="Random seed for reproducibility")

    args = parser.parse_args()

    # Set seed if provided (for reproducibility)
    if args.seed is not None:
        torch.manual_seed(args.seed)
        print(f"Seed set to {args.seed}")

Experiment

Run the CLI with different arguments: python generate.py checkpoint_final.pt --prompt "Hello" --temperature 0.9 --seed 42. Try the same seed multiple times - you should get identical output. Try different seeds - you'll get different but equally valid outputs. This is the power of reproducible generation.

Loading checkpoints and running generation

Checkpoints should contain everything needed for generation: the model config, trained weights, and the character mappings (stoi/itos). Load the checkpoint, reconstruct the model, and run generation. This makes your trained model portable and easy to use.

    # Load the checkpoint
    print(f"Loading checkpoint: {args.checkpoint}")
    checkpoint = torch.load(args.checkpoint, weights_only=False)

    # Extract config and character mappings
    config = checkpoint["config"]
    stoi = checkpoint["stoi"]  # string to index (char -> token ID)
    itos = checkpoint["itos"]  # index to string (token ID -> char)

    # Reconstruct the model
    model = GPT(config)
    model.load_state_dict(checkpoint["model_state_dict"])
    print(f"Model loaded: {sum(p.numel() for p in model.parameters())} parameters")

    # Generate text!
    print(f"\nGenerating with prompt: '{args.prompt}'")
    print(f"Temperature: {args.temperature}, Top-k: {args.top_k}\n")

    output = generate(model, args.prompt, stoi, itos,
                      max_new_tokens=args.max_new_tokens,
                      temperature=args.temperature,
                      top_k=args.top_k)

    print("-" * 50)
    print(output)
    print("-" * 50)

Reproducibility with seeds

Generation involves random sampling (torch.multinomial), so the same prompt produces different output each time. To get reproducible results, set a random seed with torch.manual_seed(seed). This ensures that the random number generator produces the same sequence of numbers, making generation deterministic for a given seed.

# Reproducible generation
torch.manual_seed(42)
print(generate(model, "To be or not", stoi, itos, temperature=0.8))
# Same output every time with seed=42!

# From the command line:
# python generate.py checkpoint_final.pt --prompt "To be or not" --seed 42

# Why seeds matter:
# 1. Debugging: reproduce issues with specific outputs
# 2. Comparison: compare model versions with same input
# 3. Demos: show consistent results to others
# 4. Testing: write tests that expect specific output

Saving outputs and running generation sweeps

Once you have a CLI you can run generation sweeps: systematically testing many combinations of temperature, top-k, and prompt, and saving all outputs for comparison. A simple shell loop is enough. The outputs should be saved with metadata (checkpoint, seed, parameters) in the filename or a header so you can reconstruct what produced each sample later.

# In generate.py, add --output_file argument
parser.add_argument("--output_file", default=None,
                    help="If set, write output to this file instead of stdout")
parser.add_argument("--num_samples", type=int, default=1,
                    help="Number of independent samples to generate")

# In the generation section:
results = []
for i in range(args.num_samples):
    if args.seed is not None:
        torch.manual_seed(args.seed + i)  # different seed per sample
    output = generate(model, args.prompt, stoi, itos,
                      max_new_tokens=args.max_new_tokens,
                      temperature=args.temperature, top_k=args.top_k)
    results.append(output)

# Save or print
if args.output_file:
    with open(args.output_file, "w") as f:
        header = f"# checkpoint={args.checkpoint} T={args.temperature} k={args.top_k}\n\n"
        f.write(header)
        for i, r in enumerate(results):
            f.write(f"--- Sample {i+1} ---\n{r}\n\n")
    print(f"Saved {args.num_samples} samples to {args.output_file}")
else:
    for i, r in enumerate(results):
        print(f"\n--- Sample {i+1} ---\n{r}")

# Shell script: sweep temperatures and save all outputs
#!/bin/bash
for T in 0.5 0.7 0.9 1.1; do
  python generate.py checkpoint_final.pt \
    --prompt "To be or not" \
    --temperature $T \
    --top_k 40 \
    --num_samples 3 \
    --seed 42 \
    --output_file "samples_T${T}.txt"
done
# Creates: samples_T0.5.txt, samples_T0.7.txt, samples_T0.9.txt, samples_T1.1.txt

⚠

Checkpoint compatibility Always save stoi, itos, and the full GPTConfig inside the checkpoint file during training, not just the model weights. Without them, you cannot reconstruct the model or decode the output. A weights-only checkpoint is unusable for generation.

Your action Create generate.py with an argparse CLI that accepts checkpoint, --prompt, --temperature, --top_k, --max_new_tokens, and --seed. Run it twice with --seed 42 and confirm the output is identical. Then run it twice without --seed and confirm the outputs differ. Share one of the seeded outputs , this is the first output from your model that someone else can reproduce exactly.

Key Takeaway: A CLI makes your model easy to use from the terminal. Save config, weights, and character mappings in checkpoints. Use random seeds for reproducible generation - same seed = same output every time.

Day 28

Week 4 Wrap-up

What to expect at different training steps, trying different settings, key takeaways

Week 4 complete! Your model can now write text. Day 28 shows what to expect at different stages of training, how to interpret generated samples, the most common generation pitfalls and how to diagnose them, and a full summary of the key concepts from this week. By the time you finish Day 28, you'll be able to look at any generated sample and diagnose whether the issue is the model, the sampling parameters, or training progress.

What to expect: samples at different training steps

The quality of generated text improves as training progresses. Here are real samples from a training run (6L/6H/384D on Shakespeare). Early in training, output is random characters. Mid-training, words and structure emerge. Late training, you get plausible Shakespeare-like text. But beware: after too many steps, the model overfits and regurgitates memorized training data.

# Step 200 (val loss ~3.5) - Random characters
# The model hasn't learned anything meaningful yet
"To be or notis p ce mei odorethleedetire'ilethed ye m arkesothir fnon b tigb'i."

# Step 1000 (val loss 1.64) - Words and structure emerging
# The model is learning words and basic grammar
"To be or nothing are good men,
The profent of little, our actory.

CORIOLANUS:
Is it now of your many death?"

# Step 2400 (val loss ~1.60) - Peak quality
# Plausible Shakespeare! But notice: it's starting to
# memorize specific phrases from the training data.
"To be or not to be some of you shall know
That everlature by Romeo: what news,
Which you had knock'd my part to speak"

# Step 5000+ (overfitting) - Memorized text
# The model regurgitates training data verbatim.
# This is why you should save checkpoints and compare!

Experiment

Generate samples at different checkpoints during training (e.g., every 500 steps). Keep a log of the outputs. You'll see the progression from gibberish to words to coherent text. Also try different temperature/top_k combinations at each checkpoint - early checkpoints benefit from higher temperature (more exploration), while later checkpoints might need lower temperature (more focus).

Trying different settings

Different generation settings produce dramatically different output. Temperature controls randomness, top_k controls the candidate pool. Experiment with combinations to find what works best for your model and use case. Here are some starting points for a trained Shakespeare model.

# Load your trained model
checkpoint = torch.load("checkpoint_final.pt", weights_only=False)
config = checkpoint["config"]
stoi = checkpoint["stoi"]
itos = checkpoint["itos"]
model = GPT(config)
model.load_state_dict(checkpoint["model_state_dict"])

# Deterministic, repetitive (T=0.1, close to greedy)
print("=== Low temperature (0.1) ===")
print(generate(model, "To be or not to be", stoi, itos,
              temperature=0.1, top_k=40))

# Balanced (T=0.8, default)
print("\n=== Balanced (0.8) ===")
print(generate(model, "To be or not to be", stoi, itos,
              temperature=0.8, top_k=40))

# Creative, potentially incoherent (T=1.5)
print("\n=== High temperature (1.5) ===")
print(generate(model, "To be or not to be", stoi, itos,
              temperature=1.5, top_k=40))

# Very restrictive (T=0.8, top_k=10)
print("\n=== Restrictive top-k (k=10) ===")
print(generate(model, "To be or not to be", stoi, itos,
              temperature=0.8, top_k=10))

Week 4 Key Takeaways

You've built a complete text generation pipeline! Your model can now write text using autoregressive generation with temperature sampling and top-k filtering. Here are the key concepts to remember as you move forward.

Autoregressive generation: Predict one token, append, repeat. This is the foundation of how GPT models write text.

Greedy decoding is deterministic and repetitive: Always picking the most probable token leads to boring, repetitive output.

Temperature controls randomness: T=0.7-0.9 is usually the sweet spot for coherent but varied text.

Top-k removes extremely unlikely tokens: Prevents the model from sampling nonsense by restricting to the k most probable tokens.

Use @torch.no_grad() for inference: Disables gradient computation to save memory and speed up generation.

Reproducibility with seeds: Set a random seed to get the same output every time. Useful for debugging and demos.

Watch your model learn: Generate samples during training to see the progression from gibberish to coherent text. The best output is usually around step 1500-2500 before overfitting sets in.

Common generation pitfalls and how to diagnose them

When generated text looks wrong, the cause is almost always one of five things. Here is a diagnostic guide:

Symptom: pure gibberish, no words

Cause: Model hasn't trained long enough, or val loss is still very high (>3.0). Fix: Train more steps. Check that training loss is actually decreasing.

Symptom: words but no structure

Cause: Early-to-mid training. The model has learned vocabulary but not grammar or context. Fix: Normal , keep training. This phase typically ends by val loss ~2.0.

Symptom: coherent but endlessly repetitive

Cause: Temperature too low, or greedy/near-greedy sampling. Fix: Raise temperature to 0.7–0.9. Add or raise top-k.

Symptom: sounds like training data verbatim

Cause: Model is overfitting , it has memorised training examples. Fix: Load an earlier checkpoint (before val loss started rising). Increase dropout.

Symptom: valid characters but nonsensical words

Cause: Temperature too high, or top-k too large (sampling from the long tail). Fix: Lower temperature to 0.6–0.8. Set top-k=20–40.

Target: varied, coherent, style-consistent

The output reads like a plausible new sample from the training domain. Different each run but recognisably in the right style. Val loss typically 1.5–1.8 for Shakespeare at this model size.

# Quick diagnostic: check val loss before adjusting sampling params
# If val_loss > 3.0  → more training, sampling params don't matter yet
# If val_loss 1.8-2.5 → mid-quality model, temperature is the main lever
# If val_loss < 1.5  → good model, fine-tune sampling for best output
# If val_loss rising  → overfitting, load earlier checkpoint

# Always check these three things first:
# 1. model.eval() is called before generate
# 2. @torch.no_grad() is active (speeds up and prevents memory issues)
# 3. Correct checkpoint loaded (stoi/itos match the model's training data)

~1.5

Val loss target for good Shakespeare generation (6L/6H/384D model)

~2500

Training steps to peak quality before overfitting on TinyShakespeare at this size

Your action Pick your best checkpoint and generate a 500-character sample with your tuned temperature and top-k settings. Save it to a file called sample_week4.txt. Then load two other checkpoints , one from earlier in training and one from later , and generate the same prompt with each. Put all three samples side by side in the file. Label them with their step number and val loss. This is your week 4 artefact: proof that your model can write, and a record of how generation quality evolves with training.

Week 4 Complete! Your model can now write text. Next week: putting it all together into an end-to-end pipeline, training a complete model, and seeing what it can do.

5WK

Putting It All Together

Train on real data, monitor progress, and experiment with model configurations

Week 5 Focus: Train → Generate → Experiment

0 / 7 complete

Day 29

Project Structure

Organize model.py, train.py, generate.py, and understand how the files interact

Weeks 1–4 built every component of a GPT model piece by piece. Week 5 assembles those pieces into a real, runnable project. Before writing a single line of training code, it is worth thinking hard about project layout. A clean file structure makes your code modular, debuggable, and easy to share. The pattern used here , one file per concern , is the same pattern used by serious ML research codebases, just without the extra complexity.

The four-file layout

The project is split into four Python files plus a data directory and a checkpoints directory. Each file has a single, clear responsibility.

llm-from-scratch/
├── data/
│   └── input.txt          # Training corpus (e.g. Shakespeare, ~1 MB)
├── model.py               # GPT class , nothing but architecture
├── train.py               # Training loop, optimizer, checkpointing
├── generate.py            # Load checkpoint and sample text
├── utils.py               # Data loading, batching, tokenization helpers
└── checkpoints/
    └── ckpt_step5000.pt   # Saved model weights + optimizer state

The separation matters because it enforces what machine-learning engineers call the train/infer split: the code path for training is completely separate from the code path for inference. Mixing them in one script leads to bugs where training-only operations (dropout, gradient accumulation) accidentally run during generation.

model.py , architecture only

model.py should contain exactly one thing: the GPTLanguageModel class built in Weeks 1–4. No training loop, no data loading, no file I/O. When another file needs the model, it does from model import GPTLanguageModel and that is all.

# model.py (skeleton)
import torch
import torch.nn as nn

class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout):
        super().__init__()
        self.token_embedding  = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head, block_size, dropout)
                                       for _ in range(n_layer)])
        self.ln_f  = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)

    def forward(self, idx, targets=None):
        # idx: (B, T) token IDs  →  returns logits (B, T, V) and optional loss
        ...

Keeping all hyperparameters as constructor arguments (not module-level globals) means you can instantiate multiple model sizes in the same Python session , essential for the experiments in Days 33 and 34.

train.py , the training engine

train.py is responsible for everything that happens during the learning phase: loading data, constructing batches, running the forward/backward pass, updating weights, logging losses, and saving checkpoints. It imports GPTLanguageModel from model.py and helpers from utils.py.

# train.py (top-level structure)
from model import GPTLanguageModel
from utils import get_batch, load_data, encode

# 1. Hyperparameters
config = { 'n_layer': 6, 'n_head': 6, 'n_embd': 384,
           'block_size': 256, 'batch_size': 64,
           'lr': 3e-4, 'max_steps': 5000 }

# 2. Data
train_data, val_data, vocab_size = load_data('data/input.txt')

# 3. Model + optimizer
model = GPTLanguageModel(vocab_size, **config).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=config['lr'])

# 4. Training loop
for step in range(config['max_steps']):
    xb, yb = get_batch(train_data, config['block_size'], config['batch_size'])
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

generate.py and utils.py

generate.py loads a saved checkpoint, reconstructs the model with the same hyperparameters used during training, and calls model.generate(). It never touches an optimizer or computes gradients , it puts the model in eval() mode first.

utils.py holds the glue code that both files need: reading input.txt, building the character vocabulary, encoding/decoding strings to integer IDs, and the get_batch() function that slices random windows from the training data.

Experiment

Add a config.py file that defines the TINY, SMALL, and MEDIUM configuration dicts and import them in both train.py and generate.py. This ensures the model you generate with is always the same size as the model you trained , a bug that is easy to make and hard to debug if the configs live in separate files.

Your action Create the four-file project structure on your local machine or Colab. Write the skeleton of each file: model.py with the GPTLanguageModel class stub, train.py with the config dict and training loop structure, generate.py with checkpoint loading logic, and utils.py with load_data() and get_batch() stubs. Verify you can import from one file to another without errors.

Key Takeaway

One file, one responsibility. model.py defines the architecture; train.py runs the learning loop; generate.py runs inference; utils.py holds shared helpers. This separation lets you swap out any single piece without touching the others.

Day 30

Google Colab Setup

Configure a GPU runtime, install dependencies, upload your files, and verify everything works

Training on a CPU is possible for tiny models, but even a 10 M-parameter model will take hours per run. Google Colab gives you free access to an NVIDIA T4 GPU, which can be 30–100× faster for the matrix operations that dominate transformer training. Day 30 walks through every step of getting Colab ready so that Day 31's training run goes smoothly.

Why a GPU matters for training

The core operation in every transformer layer is a matrix multiply: Q @ K.T and attention_weights @ V. A CPU executes matrix multiplies sequentially on a small number of cores. A GPU has thousands of tiny cores optimised to run the same arithmetic on thousands of values in parallel.

CPU (M3 Pro, 12 cores)

~45 minutes per 5 000-step run on the medium config. Fine for testing one config, painful for experiments.

T4 GPU (Colab free tier)

~6–8 minutes per 5 000-step run. Fast enough to iterate through four or five hyperparameter combinations in a single session.

Enable the GPU runtime

Colab defaults to a CPU runtime. You must switch before running any code:

Click Runtime in the top menu bar
Select Change runtime type
Set Hardware accelerator to T4 GPU
Click Save , the runtime will restart

Verify the GPU is active by running this in the first cell:

import torch
print(torch.cuda.is_available())       # True
print(torch.cuda.get_device_name(0))  # NVIDIA Tesla T4
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# VRAM: 15.8 GB

Install dependencies and create the project layout

Colab has PyTorch pre-installed, but you still need tiktoken and a few utilities. Run this in a cell:

!pip install tiktoken --quiet

import os, urllib.request

# Create project directories
for d in ['data', 'checkpoints']:
    os.makedirs(d, exist_ok=True)

# Download Shakespeare training data (~1 MB)
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "data/input.txt")

with open("data/input.txt") as f:
    text = f.read()
print(f"Corpus: {len(text):,} characters, ~{len(text)//1000} KB")
# Corpus: 1,115,394 characters, ~1115 KB

Upload your Python files

Colab notebooks are ephemeral , the filesystem is wiped when the runtime disconnects. The cleanest workflow is to keep your .py files in a GitHub repo and clone them at the start of each session:

# Option A , clone from GitHub (recommended)
!git clone https://github.com/YOUR_USERNAME/llm-from-scratch.git
%cd llm-from-scratch

# Option B , upload files manually via Colab's file browser
# (Left sidebar → Files icon → Upload)

# Option C , write files directly in notebook cells using %%writefile
# %%writefile model.py
# import torch ...

Experiment

Check how much RAM and disk you have available: !free -h (RAM) and !df -h / (disk). Colab free tier gives you ~12 GB RAM and ~100 GB disk. Knowing these limits helps you decide how large a model and dataset you can use before running out of memory.

Your action Set up a Google Colab notebook with a T4 GPU runtime. Run the verification code to confirm torch.cuda.is_available() returns True and the device name shows "Tesla T4". Install tiktoken, create the project directories, and download the Shakespeare corpus. Print the corpus length to confirm everything works before moving to training.

Key Takeaway

Always verify the GPU is active with torch.cuda.is_available() before starting a long training run. A training job that silently falls back to CPU will appear to work but will be 30–100× slower , you will only notice hours later when you check the elapsed time.

Day 31

Running Training

Walk through train.py line by line, understand gradient clipping, and interpret the training output

Training a language model is a loop that runs thousands of times. Each iteration is called a step. Within each step, four operations happen in strict order: forward pass (compute predictions), loss calculation (measure error), backward pass (compute gradients), and optimizer step (update weights). Understanding what each line of train.py does , and why it is in that order , will help you debug anything that goes wrong.

The complete training loop, explained

Here is the core training loop with every line annotated:

model.train()  # activates dropout , must be set for training

for step in range(max_steps):

    # ── 1. Sample a random batch ────────────────────────────────────
    xb, yb = get_batch('train')
    # xb shape: (batch_size, block_size)   , input token IDs
    # yb shape: (batch_size, block_size)   , target token IDs (xb shifted by 1)

    # ── 2. Forward pass ─────────────────────────────────────────────
    logits, loss = model(xb, yb)
    # logits: (B, T, vocab_size)   , raw scores before softmax
    # loss: scalar cross-entropy between logits and yb

    # ── 3. Zero out stale gradients ─────────────────────────────────
    optimizer.zero_grad(set_to_none=True)
    # set_to_none=True is faster than zeroing: frees the memory entirely

    # ── 4. Backward pass ────────────────────────────────────────────
    loss.backward()
    # PyTorch walks the computation graph and fills .grad on every parameter

    # ── 5. Gradient clipping ────────────────────────────────────────
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    # Rescales gradients so their global norm ≤ 1.0
    # Prevents a single bad batch from causing a catastrophic weight update

    # ── 6. Weight update ────────────────────────────────────────────
    optimizer.step()
    # AdamW adjusts each parameter using its gradient + per-param momentum

Why gradient clipping?

Without clipping, a single batch that produces an unusually large gradient can push the model weights into a region where the loss explodes , you will see the loss suddenly jump to nan or a very large number. Clipping at max_norm=1.0 rescales the entire gradient vector (not individual gradients) so its magnitude is at most 1.0. This is cheap to compute and makes training dramatically more stable.

Without clipping

Rare but catastrophic: one unlucky batch sends loss to nan. Training has to be restarted from the last checkpoint.

With clipping (norm ≤ 1.0)

Worst-case update size is bounded. Training continues smoothly through hard batches. Small constant cost per step.

Reading the training output

Every 500 steps, train.py evaluates on the validation set and prints a status line. Here is what each field means:

step    500 | train 4.312 | val 4.358 | lr 3.00e-04 | 2.1s/step
step   1000 | train 3.641 | val 3.702 | lr 3.00e-04 | 2.0s/step
step   2000 | train 2.894 | val 2.983 | lr 2.41e-04 | 2.0s/step
step   3000 | train 2.341 | val 2.476 | lr 1.64e-04 | 2.0s/step
step   5000 | train 1.821 | val 2.094 | lr 5.00e-05 | 2.0s/step

train / val: Cross-entropy loss on training and validation splits. Both should decrease. A widening gap signals overfitting.
lr: Current learning rate after the cosine decay schedule. Starts high, ends low.
s/step: Seconds per training step. Should be stable; a spike suggests memory pressure or a CPU-bound bottleneck.

Checkpointing

Save a checkpoint periodically so you can resume training after a Colab disconnect or load a specific model state for generation experiments:

def save_checkpoint(model, optimizer, step, loss, path):
    torch.save({
        'step':            step,
        'model_state':     model.state_dict(),
        'optimizer_state': optimizer.state_dict(),
        'val_loss':        loss,
        'config':          model_config,   # save hyperparams too!
    }, path)

# Call every 1000 steps
if step % 1000 == 0:
    save_checkpoint(model, optimizer, step, val_loss,
                    f'checkpoints/ckpt_{step:05d}.pt')

Experiment

Run a short training session of just 200 steps with max_steps=200 before committing to a full run. Verify the loss decreases, the checkpoint saves, and the generated sample text is at least recognisable as attempting to produce English. Only then increase max_steps to 5000.

Your action Run a short 200-step training session using the SMALL config. Watch the training output: verify the loss decreases, the learning rate schedule is working (print lr each step), and a checkpoint is saved. Load the checkpoint back and confirm model.eval() works. Only then increase max_steps to 5000 for a full run.

Key Takeaway

The training loop is always: zero gradients → forward → loss → backward → clip → step. Order matters: if you call backward() before zero_grad(), gradients from the previous step accumulate and your updates will be wrong. Always save the model config alongside the weights so you can reconstruct the exact architecture for inference later.

Day 32

Text Generation

Load a checkpoint, understand temperature and top-k sampling, and generate text with different strategies

A trained model produces a probability distribution over the next token at each position. Sampling strategy decides how to turn that distribution into an actual token. The choice matters: pure greedy decoding (always pick the most probable token) produces repetitive, boring text. Pure random sampling produces gibberish. The parameters temperature and top_k let you navigate the tradeoff between diversity and coherence.

Loading a checkpoint

The checkpoint saved by train.py contains both model weights and the config dict. generate.py must reconstruct the same architecture before loading the weights, otherwise PyTorch will raise a shape mismatch error:

import torch
from model import GPTLanguageModel

ckpt = torch.load('checkpoints/ckpt_05000.pt', map_location='cpu')
config = ckpt['config']   # saved alongside the weights

model = GPTLanguageModel(**config)
model.load_state_dict(ckpt['model_state'])
model.eval()  # disables dropout , critical for inference
              
print(f"Loaded step {ckpt['step']}, val loss {ckpt['val_loss']:.4f}")

Temperature: controlling randomness

Before sampling, the model's raw output (logits) is divided by a temperature scalar. This changes the sharpness of the probability distribution:

# Inside model.generate()
logits = logits[:, -1, :] / temperature   # (B, vocab_size)
probs  = torch.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)

temperature = 1.0 , unchanged distribution, baseline behaviour
temperature < 1.0 (e.g. 0.5) , distribution sharpened; model becomes more conservative, picks high-probability tokens more often. Less creative, fewer mistakes.
temperature > 1.0 (e.g. 1.5) , distribution flattened; model becomes more adventurous. More creative, more likely to produce nonsense.

Top-k sampling: cutting off the long tail

With a vocabulary of 50 000 tokens, even very low-probability tokens can occasionally be sampled. Top-k sampling zeroes out all but the k highest-probability tokens before sampling, eliminating outlier garbage:

def top_k_sample(logits, k, temperature=1.0):
    logits = logits / temperature
    # Keep only the top-k logits; set the rest to -inf
    top_vals, _ = torch.topk(logits, k)
    threshold = top_vals[..., -1, None]  # kth largest value
    logits = logits.masked_fill(logits < threshold, float('-inf'))
    probs = torch.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)

Typical values: top_k=40 for creative writing, top_k=10 for more focused outputs. Setting top_k=1 is equivalent to greedy decoding.

Experiment

Generate five outputs from the same prompt using temperature=0.7, top_k=40, then five more with temperature=1.2, top_k=200. Compare quality and diversity. Which setting gives more coherent Shakespearean language? Which gives more surprising , if sometimes broken , outputs?

Observing quality at different training steps

Load checkpoints from different points in training and observe how output quality evolves. This concretely shows what "the model is learning" means:

for step_num in [500, 1000, 2000, 5000]:
    ckpt = torch.load(f'checkpoints/ckpt_{step_num:05d}.pt')
    model.load_state_dict(ckpt['model_state'])
    model.eval()
    generated = model.generate(prompt_ids, max_new_tokens=100)
    print(f"\n--- Step {step_num} ---")
    print(decode(generated[0].tolist()))

Step 500: Mostly repeated characters or common bigrams , "ee ee ee" , the model has learned letter frequencies but not words.
Step 1000–2000: Real English words appear. Punctuation starts to look right. No semantic coherence yet.
Step 5000+: Coherent phrases, correct iambic stress, character-appropriate dialogue , still Shakespeare-like rather than truly intelligent.

Your action Load your trained checkpoint from Day 31 and generate text with three different settings: (1) temperature=0.5, top_k=10 for conservative output, (2) temperature=0.8, top_k=40 for balanced output, and (3) temperature=1.2, top_k=200 for creative output. Compare the results , which setting produces the most coherent Shakespearean language? Save your favorite output.

Key Takeaway

Always call model.eval() before generating , forgetting this leaves dropout active, which randomly drops connections during inference and makes outputs non-deterministic in the wrong way. Temperature controls how peaked the distribution is; top-k controls which tokens are even considered. Together they give you full control over the diversity-quality tradeoff.

Day 33

Experiment: Model Size vs Quality

Calculate parameter counts, measure training speed, and compare output quality across tiny, small, and medium models

"Bigger is better" is true in ML , up to the point where you run out of compute or memory. The practical question is: how much quality do you get for each unit of compute? Today you run three model sizes side-by-side and build intuition for the parameter-count/quality tradeoff that governs every production LLM decision.

How to count parameters

The dominant cost is the embedding table and the attention + MLP blocks. A rough formula for a GPT-style model:

# Approximate parameter count
embedding_params = vocab_size * n_embd           # token embed table
position_params  = block_size * n_embd           # position embed table
per_block = (
    4 * n_embd * n_embd +   # Q, K, V, out projections in attention
    8 * n_embd * n_embd     # two linear layers in MLP (4x expansion)
)
total = embedding_params + position_params + n_layer * per_block

# Verify against PyTorch:
total_actual = sum(p.numel() for p in model.parameters())
print(f"{total_actual / 1e6:.2f} M parameters")

Three configurations to compare

Run these three configurations sequentially and record the results:

TINY = {
    'n_layer': 2, 'n_head': 2, 'n_embd': 128,
    'block_size': 128, 'batch_size': 32, 'max_steps': 3000
}   # ~0.8 M params, ~2 min on T4

SMALL = {
    'n_layer': 4, 'n_head': 4, 'n_embd': 256,
    'block_size': 256, 'batch_size': 64, 'max_steps': 5000
}   # ~4.5 M params, ~6 min on T4

MEDIUM = {
    'n_layer': 6, 'n_head': 6, 'n_embd': 384,
    'block_size': 256, 'batch_size': 64, 'max_steps': 5000
}   # ~10.7 M params, ~8 min on T4

▶

Interactive companion tool

See how your model compares to the Chinchilla-informed scaling frontier—find the optimal model size for your training budget.

chinchilla.html ↗

Memory and throughput

GPU memory is consumed by three things: model weights, activations (all intermediate tensors kept for backprop), and optimizer state (AdamW stores two extra tensors per parameter). A rough estimate:

# Memory estimate in GB (float32)
weights_gb    = total_params * 4 / 1e9     # 4 bytes per float32
optimizer_gb  = weights_gb * 2             # Adam m1 + m2 states
activation_gb = batch_size * block_size * n_embd * n_layer * 4 / 1e9

# MEDIUM config on T4 (15.8 GB VRAM)
# weights: ~0.04 GB, optimizer: ~0.08 GB, activations: ~0.6 GB
# Total: well within 15.8 GB , room to increase batch_size

If you get an OutOfMemoryError, the first levers to pull are: reduce batch_size by half, then reduce block_size, then reduce n_embd.

Experiment

After training all three sizes, generate the same 100-token sequence from each with temperature=0.8, top_k=40. Score them subjectively on: (1) correct English words, (2) correct punctuation, (3) stylistic coherence with Shakespeare. Does the MEDIUM model's quality justify its 2× training time relative to SMALL?

Your action Calculate the parameter count for TINY, SMALL, and MEDIUM configurations using the formula provided. Then train all three sizes for 3000 steps each (or use checkpoints from previous runs). Generate 100 tokens from each with identical settings and compare: (1) correct English words, (2) correct punctuation, (3) stylistic coherence with Shakespeare. Does the MEDIUM model justify its 2× training time?

Key Takeaway

Parameter count scales as O(n_embd² × n_layer). Doubling n_embd quadruples the parameter count. In practice the SMALL model (4.5 M params) achieves most of the quality gains of MEDIUM (10.7 M params) for the Shakespeare task because the dataset is small enough that even SMALL can model it well. Bigger models pay off on larger, more diverse datasets.

Day 34

Experiment: Context Length & Learning Rate

Understand the quadratic cost of attention, learning rate warmup and cosine decay, and how to read loss curves

Two hyperparameters shape training more than any other: block_size (how many tokens the model sees at once) and learning rate (how aggressively it updates weights). Both have non-linear effects , too small and you leave capability on the table; too large and training destabilises. Today you measure these effects directly by running targeted experiments and reading the resulting loss curves.

The quadratic cost of context length

Self-attention computes a score for every token pair in the context. For a sequence of length T, that is T² pairs per head. Doubling block_size quadruples the attention computation and roughly doubles total memory.

# Context length experiment (hold model size constant)
for ctx in [64, 128, 256, 512]:
    config = {**SMALL, 'block_size': ctx, 'max_steps': 3000}
    train_and_log(config, tag=f'ctx_{ctx}')

# Expected val loss at 3000 steps (approximately):
# ctx=64:  2.81  (short context, misses long-range dependencies)
# ctx=128: 2.54
# ctx=256: 2.31  (sweet spot for Shakespeare)
# ctx=512: 2.28  (marginal gain, 2× memory cost)

For the Shakespeare corpus, passages rarely require more than ~200 characters of context to make sense. Beyond block_size=256 the improvement in validation loss becomes small relative to the memory and compute cost.

Learning rate warmup and cosine decay

A fixed learning rate is rarely optimal. The standard schedule used in modern LLMs has two phases:

Warmup (steps 0 → warmup_steps): LR increases linearly from 0 to lr_max. Early in training, weights are random and gradients are large and noisy. A high LR at step 0 would cause chaotic updates. Warmup lets the model stabilise first.
Cosine decay (steps warmup_steps → max_steps): LR decreases following a cosine curve from lr_max to lr_min. As training converges the model is close to a good solution; smaller steps make finer adjustments without overshooting.

def get_lr(step, warmup_steps, max_steps, lr_max, lr_min):
    if step < warmup_steps:
        return lr_max * step / warmup_steps         # linear warmup
    decay = (step - warmup_steps) / (max_steps - warmup_steps)
    cosine = 0.5 * (1 + math.cos(math.pi * decay))
    return lr_min + cosine * (lr_max - lr_min)   # cosine decay

# Apply at each step:
lr = get_lr(step, warmup_steps=100, max_steps=5000,
            lr_max=3e-4, lr_min=3e-5)
for g in optimizer.param_groups:
    g['lr'] = lr

Learning rate range experiment

Run three training jobs for 2000 steps each and observe the loss curves:

# Conservative: slow but stable
python train.py --lr-max 1e-4 --warmup 50  --tag lr_conservative

# Recommended: fast convergence, stable
python train.py --lr-max 3e-4 --warmup 100 --tag lr_default

# Aggressive: fastest early drop, risk of instability
python train.py --lr-max 1e-3 --warmup 200 --tag lr_aggressive

Too low (1e-4): Loss decreases very slowly. Needs 2–3× more steps to reach the same quality. Safe but inefficient.
Sweet spot (3e-4): Fast early progress, clean convergence. The loss curve is smooth with no spikes.
Too high (1e-3): Loss drops fast initially but may spike or plateau early as the optimizer overshoots the minimum. With warmup, this is usually manageable.

Experiment

Try a 2×2 grid: block_size ∈ {128, 256} × lr_max ∈ {1e-4, 3e-4}. Train each for 3000 steps and record the final validation loss. Which combination gives the best loss? Does the optimal learning rate change when you increase the context length?

Your action Run the 2×2 grid experiment: block_size ∈ {128, 256} × lr_max ∈ {1e-4, 3e-4}. Train each configuration for 3000 steps and record the final validation loss. Which combination gives the best loss? Does the optimal learning rate change when you increase the context length?

▶

Interactive companion tool

Visualize learning rate schedules interactively. Experiment with warmup length, peak LR, and decay type to understand their effect on training.

lr-schedule.html ↗

Key Takeaway

Attention cost is O(T²) in the context length T , increasing block_size from 128 to 512 uses 16× more compute for the attention layers alone. Always use a learning rate schedule: warmup prevents early instability; cosine decay squeezes out the last fraction of quality at the end of training. For small datasets like Shakespeare, lr_max=3e-4 with 100 warmup steps is a reliable default.

Day 35

Monitoring Training & Next Steps

Plot loss curves with matplotlib, diagnose training problems, understand perplexity, and preview Week 6

A loss number is useful; a loss curve is invaluable. Plotting both train and validation loss over time lets you see when things went wrong, not just that they did. Day 35 covers visualisation, the perplexity metric, and a systematic approach to diagnosing common training failures , giving you everything you need to submit your best model to the Week 6 competition.

Logging losses during training

Save a JSON log during training so you can plot it later without re-running anything:

import json

log = []

for step in range(max_steps):
    # ... training step ...
    if step % eval_interval == 0:
        train_loss = estimate_loss('train')
        val_loss   = estimate_loss('val')
        log.append({'step': step, 'train': train_loss, 'val': val_loss})

# Save after training finishes
with open('checkpoints/loss_log.json', 'w') as f:
    json.dump(log, f)

Use a separate estimate_loss() function (not the batch loss) that averages over 200 random batches on each split. This gives a much more stable estimate than the noisy single-batch loss from the training loop.

Plotting with matplotlib

import json, matplotlib.pyplot as plt

with open('checkpoints/loss_log.json') as f:
    log = json.load(f)

steps      = [e['step']  for e in log]
train_loss = [e['train'] for e in log]
val_loss   = [e['val']   for e in log]

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(steps, train_loss, label='Train loss', lw=2)
ax.plot(steps, val_loss,   label='Val loss',   lw=2, linestyle='--')
ax.set_xlabel('Step'); ax.set_ylabel('Cross-entropy loss')
ax.set_title('Training dynamics')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout(); plt.savefig('loss_curve.png', dpi=150)
plt.show()

What the curve should look like , and diagnosing problems

A healthy run has both curves decreasing together and levelling off at similar values. Here are the four failure patterns to watch for:

Loss spikes to NaN: Learning rate too high, or a bad batch slipped through. Add gradient clipping (max_norm=1.0) and lower lr_max by 3×.
Both losses plateau early (high): Model is underfitting. Either increase model size, increase max_steps, or check that the data pipeline is working correctly.
Train loss falls, val loss rises: Overfitting. The model has memorised training examples. Fix by increasing dropout (try 0.2 → 0.3), reducing model size, or augmenting the training data.
Val loss oscillates wildly: eval_batches count is too low , increase to 200+ batches so the estimate is stable enough to be informative.

Perplexity: an interpretable metric

Cross-entropy loss is convenient for training but hard to interpret: is a loss of 2.1 good or bad? Perplexity converts loss into an intuitive number: it measures the average number of tokens the model is "confused between" at each position.

perplexity = math.exp(val_loss)

# Examples:
# val_loss = 4.0  →  perplexity ≈ 55   (confused between ~55 tokens)
# val_loss = 2.5  →  perplexity ≈ 12   (confused between ~12 tokens)
# val_loss = 1.5  →  perplexity ≈ 4.5  (almost certain, ~4-5 candidates)

# GPT-2 small achieves ~30 perplexity on WikiText-103
# A random baseline would be vocab_size ≈ 65 (char-level Shakespeare)

A character-level model on Shakespeare with val_loss ≈ 1.5 (perplexity ≈ 4.5) is performing well: from 65 possible characters, it narrows it down to about 4 or 5 at each step.

Final Challenge

Combine everything from this week: train your best model configuration (chosen from Days 33–34 experiments), log the full loss curve, compute perplexity at the end, and generate a 500-token sample. Save the checkpoint. This is your Week 6 competition entry , lowest validation perplexity wins.

▶

Interactive companion tool

Convert between cross-entropy loss and perplexity. See where your model sits on the scale from random to very good, with benchmark references for common checkpoints.

perplexity.html ↗

Your action Log your training loss to a JSON file and plot the curve with matplotlib. Identify which of the four failure patterns (NaN spike, early plateau, overfitting, wild oscillation) best describes your run , or confirm you have a healthy curve. Compute perplexity from your final validation loss. Then complete the Final Challenge: train your best model configuration, save the checkpoint, and prepare it for the Week 6 competition.

Week 5 Summary

You now know how to structure a complete LLM training project, set up a GPU environment, run and interpret a training loop, generate text with controlled randomness, and systematically experiment with model size, context length, and learning rate. Week 6 puts it all together: optimizing your model with better data, architecture, and generation settings, then extending your skills toward fine-tuning and the broader LLM landscape.

6WK

Putting It All Together

Optimize your model, fine-tune on custom data, and chart your path forward

Week 6 Focus: Expanding your experience

0 / 7 complete

Day 36

Better Data

Why data quality is the highest-leverage variable, and how to curate aggressively

Data quality is the single highest-leverage variable when starting out. A model trained on 1MB of clean, well-curated text will consistently outperform one trained on 10MB of noisy, mixed-quality data , because bad patterns get memorized just as efficiently as good ones. Before you touch the model architecture, get your data right.

Why quality beats quantity

Every token in your training set is a teaching example. If 20% of your data is headers, metadata, encoding errors, or off-domain text, your model spends 20% of its capacity learning those patterns. The gradient doesn't distinguish signal from noise , it optimizes for whatever is in the file.

10MB raw scraped text

Headers, footers, navigation menus, repeated boilerplate, encoding artifacts, mixed languages. Model learns to generate any of these with equal probability.

1MB curated clean text

Consistent format, high semantic density, no noise. Model's capacity is entirely spent on the patterns you actually want.

Where to find good data

The best sources for small-scale training are collections with consistent format, professional editorial quality, and a clear domain:

Project Gutenberg , public domain literature; consistent prose, well-formatted, easy to filter by genre
Wikipedia dumps , factual prose with clear structure; good for general language modeling
GitHub repositories , code-only files in a single language; excellent consistency
Domain-specific archives , arXiv abstracts, news APIs, lyrics datasets; curated by the community

Aim for a single domain. A model trained on Shakespeare + Python code + Twitter will be worse at each than one trained on just Shakespeare.

A practical cleaning pipeline

Most datasets need the same four cleaning steps, in this order:

import re

def clean_corpus(text):
    # 1. Strip non-content lines (headers, footers, page numbers)
    lines = [l for l in text.splitlines() if not looks_like_metadata(l)]

    # 2. Normalize whitespace (collapse multiple blank lines)
    text = re.sub(r'\n{3,}', '\n\n', '\n'.join(lines))

    # 3. Filter by document length (remove stubs and walls of text)
    docs = text.split('\n\n')
    docs = [d for d in docs if 50 < len(d.split()) < 2000]

    # 4. Deduplicate (exact match is enough at small scale)
    seen = set()
    unique = []
    for d in docs:
        h = hash(d[:100])
        if h not in seen:
            seen.add(h)
            unique.append(d)

    return '\n\n'.join(unique)

def looks_like_metadata(line):
    line = line.strip()
    return (len(line) < 4 or
            line.startswith('***') or
            line.isupper() and len(line) < 40)

Benchmark

Run this pipeline on a raw text download. Print token count before and after. If you're keeping more than 85% of tokens, your filter is probably too loose. If you're keeping less than 40%, it's too aggressive.

Your action Download a raw text dataset in a domain you care about (Project Gutenberg works well). Write a cleaning script that prints token count before and after. Then train a small model (6 layers, 384 dim) for 5 epochs on the raw data, and again on the cleaned data. Compare validation loss. The gap is your data quality dividend.

Key takeaway

Curating your dataset is the highest-impact thing you can do before touching model size, learning rate, or architecture. Garbage in, garbage out , but at small scale, "garbage" means anything that isn't your target domain.

Day 37

Bigger Models

How to match model capacity to data size , and why getting it wrong in either direction costs you

Bigger models can learn more expressive patterns , but only when the dataset is large enough to use that capacity. A 85M parameter model on 1MB of data doesn't learn better; it memorizes faster. Model size and data size need to grow together, and the relationship between them is more principled than it looks.

What drives parameter count

Three hyperparameters control the vast majority of parameter count: n_embd (embedding dimension), n_layer (transformer blocks), and n_head (attention heads). The dominant term scales as roughly 12 × n_embd² × n_layer:

def count_params(n_embd, n_layer, vocab_size=50257):
    embed = vocab_size * n_embd + 1024 * n_embd
    per_block = 4 * n_embd * n_embd + 8 * n_embd * n_embd
    return embed + per_block * n_layer

print(count_params(384, 6))    # ~22M
print(count_params(512, 8))    # ~50M
print(count_params(768, 12))   # ~117M (GPT-2 small)

▶

Interactive companion tool

Experiment with model dimensions and see how n_embd, n_layer, and d_ff drive parameter count. Compare tiny toy models to GPT-2 scale architectures.

parameter-counter.html ↗

The Chinchilla insight: most models are undertrained

The 2022 Chinchilla paper showed that for a given compute budget, a smaller model trained on more data beats a larger model trained for fewer steps. The optimal ratio is roughly 20 tokens of training data per parameter:

10M parameter model → ~200M tokens (~150MB text) to train optimally
25M parameter model → ~500M tokens (~375MB text)
85M parameter model → ~1.7B tokens (~1.3GB text)

At small scale (1–20MB datasets), this means you should use a much smaller model than you might think. The right move is to get more data, not a bigger model.

▶

Interactive companion tool

Enter any model size and see where it falls on the Chinchilla optimal frontier. Find the ideal training token count or the optimal model size for your compute budget.

chinchilla.html ↗

Practical configurations

Start small and scale up only when val loss has plateaued with more training steps:

# ~10M parameters , good for 1-5MB curated datasets
config_small  = dict(n_layer=6,  n_head=6,  n_embd=384, block_size=256)

# ~25M parameters , good for 5-20MB datasets
config_medium = dict(n_layer=8,  n_head=8,  n_embd=512, block_size=512)

# ~85M parameters (GPT-2 Small) , needs 20MB+ of clean data
config_large  = dict(n_layer=12, n_head=12, n_embd=768, block_size=1024)

If val loss is still dropping at the end of training, you need more data or epochs , not a bigger model. If val loss diverges from train loss early, you need more data or a smaller model.

Your action Train config_small and config_medium on your curated dataset for the same number of steps. Plot both train and val loss curves on the same graph. Which model is still improving at the end? Which has started to overfit? Use this to decide whether to scale up or get more data first.

Key takeaway

Model capacity and data volume are coupled , optimize one without the other and you're wasting compute. When in doubt, start smaller and scale up after you've confirmed your data pipeline is solid.

Day 38

Better Tokenizer

How tokenizer choice determines sequence length, attention cost, and training speed

The tokenizer is often treated as an afterthought, but it determines sequence length , which determines attention cost, which scales as O(n²). Switching from character-level to BPE on the same dataset can cut sequence length by 4–5×, reducing attention compute by 16–25×. That's a free performance win before you change a single training hyperparameter.

The sequence length problem

Transformer attention is quadratic in sequence length. Double the sequence length and attention takes 4× as long and uses 4× the memory. This means the tokenizer choice has a direct, measurable impact on training speed:

Character-level on "Hello, world!"

13 tokens. Vocabulary: ~100 chars. A 5MB corpus → ~5M tokens → very long sequences. Simple to implement.

GPT-2 BPE on "Hello, world!"

4 tokens. Vocabulary: 50,257. A 5MB corpus → ~1.25M tokens. 4× shorter sequences = 16× less attention compute.

Using a pretrained BPE tokenizer

For most projects, use a pretrained tokenizer rather than training your own. The GPT-2 BPE tokenizer is compact, well-tested, and handles Unicode cleanly:

import tiktoken
import numpy as np

enc = tiktoken.get_encoding("gpt2")  # 50,257 vocab size

with open("corpus.txt") as f:
    text = f.read()

tokens = enc.encode(text)
print(f"Characters: {len(text):,}")
print(f"Tokens:     {len(tokens):,}")
print(f"Ratio:      {len(text)/len(tokens):.1f} chars per token")

# Save as binary for fast DataLoader access
arr = np.array(tokens, dtype=np.uint16)
arr.tofile("corpus.bin")

Benchmark

A typical English text compresses to ~4 characters per token. Code is closer to 3. If you're getting below 2, your text may have unusual characters or encoding issues worth investigating before training.

Training a custom BPE tokenizer

For specialized domains (chemistry notation, a specific programming language, musical scores), a custom BPE tokenizer learns the most common subwords in your actual corpus and can outperform a general-purpose one:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

trainer = trainers.BpeTrainer(
    vocab_size=5000,
    special_tokens=["<|endoftext|>"],
    min_frequency=2
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)
tokenizer.save("tokenizer.json")

# Verify round-trip integrity
enc = tokenizer.encode("Hello, world!")
assert tokenizer.decode(enc.ids) == "Hello, world!"

Your action Tokenize your corpus with both character-level and GPT-2 BPE. For each, print: vocabulary size, total token count, and average sequence length at block_size=256. Then calculate how many training steps per epoch each tokenizer requires at batch_size=32. The step-count ratio is how much faster BPE training would be for the same number of passes over your data.

Key takeaway

For datasets over 5MB, switching from character-level to BPE is one of the easiest wins available. The compression ratio translates directly to shorter sequences, less attention compute, and faster training , with no change to the model architecture.

Day 39

Training Tweaks

Squeeze the last performance out of training: context length, dropout, early stopping, and learning rate

The architecture is set, the data is clean. Now extract the last performance with training discipline. Each of these tweaks is individually modest, but they compound: the difference between a carelessly-trained model and a carefully-trained one on the same data and architecture is often 10–20% in validation loss.

Context window (block_size)

A larger context window lets the model condition on more history , better for long-range structure. But it scales attention compute quadratically, so it costs real training time. Set it based on your actual document lengths:

import numpy as np

# Measure your corpus before choosing block_size
doc_lengths = [len(doc.split()) for doc in your_documents]
print(f"Median doc length:    {np.median(doc_lengths):.0f} words")
print(f"95th percentile:      {np.percentile(doc_lengths, 95):.0f} words")
# Set block_size near the 95th percentile , no point paying for unused context

# Rough guidelines:
# block_size=256  → 1-5MB datasets, short documents
# block_size=512  → 5-20MB datasets, paragraph-length documents
# block_size=1024 → GPT-2 setting, needs 20MB+ data

Dropout for regularization

Dropout randomly zeros activations during training, forcing the model to learn redundant representations and reducing overfitting. Use a low rate , high dropout can prevent learning entirely:

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.drop = nn.Dropout(config.dropout)  # 0.1 is a safe default

    def forward(self, idx):
        tok_emb = self.transformer.wte(idx)
        pos_emb = self.transformer.wpe(pos)
        x = self.drop(tok_emb + pos_emb)  # Applied after embedding sum
        # ... rest of forward

Always call model.eval() before generating , this disables dropout so inference is deterministic and uses the full model capacity.

Early stopping: save the best checkpoint

Training to the final step is almost never optimal. Val loss improves, plateaus, then rises (overfitting). Track it explicitly and save the best checkpoint:

best_val_loss = float('inf')
patience, max_patience = 0, 5

for step in range(max_steps):
    train_step(model, optimizer, batch)

    if step % eval_interval == 0:
        val_loss = evaluate(model, val_data)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience = 0
            torch.save(model.state_dict(), 'checkpoint_best.pt')
        else:
            patience += 1
            if patience >= max_patience:
                print(f"Early stopping at step {step}")
                break

# Always load best checkpoint before generating
model.load_state_dict(torch.load('checkpoint_best.pt'))

Learning rate selection

The learning rate is the most sensitive hyperparameter. Use warmup for the first 1–5% of steps, then cosine decay to 10% of the peak LR:

Small models (~10M): peak LR 3e-4 to 5e-4
Medium models (~25M): peak LR 1e-4 to 3e-4
Large models (~85M): peak LR 6e-5 to 1e-4

Sanity check

At step 0, your loss should be approximately log(vocab_size) , for a 50k BPE vocab that's ~10.8, for char-level with 100 chars it's ~4.6. If your initial loss is much lower, you may have a data leak. If it's much higher, check your loss computation.

Your action Add train and val loss logging to your training loop, evaluating every 200 steps. After training, plot both curves on the same graph. Identify the exact step where val loss stops improving. Then run training again with early stopping that checkpoints at that step. Compare text generated from the final step vs the best checkpoint , you should see a noticeable quality difference.

▶

Interactive companion tool

Experiment with learning rate schedules to see how warmup, decay, and peak LR affect training dynamics and convergence.

lr-schedule.html ↗

Key takeaway

A well-tuned training loop consistently beats a bigger model trained carelessly on the same data. Context window, dropout, early stopping, and learning rate are not cosmetic , each has a measurable effect on the final model.

Day 40

Generation Tricks

Temperature, top-k, prompting, and how to develop taste for your model's output

The trained model is only half the equation. Sampling strategy determines whether a capable model produces coherent, interesting output or repetitive noise. The same model weights with different generation settings can feel like completely different models , and this is entirely in your control.

Temperature: scaling the probability distribution

Dividing logits by temperature before softmax changes how "peaked" the distribution is. Low temperature concentrates probability on likely tokens (focused, repetitive). High temperature flattens it (creative, incoherent). Temperature is not a preference setting , it changes the math:

@torch.no_grad()
def generate(model, prompt_ids, max_new_tokens, temperature=0.8, top_k=40):
    model.eval()
    x = torch.tensor([prompt_ids], dtype=torch.long)

    for _ in range(max_new_tokens):
        logits = model(x[:, -model.config.block_size:])[0, -1, :]
        logits = logits / temperature  # <1.0 sharpens, >1.0 flattens

        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[-1]] = float('-inf')

        probs = torch.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        x = torch.cat([x, next_token.unsqueeze(0)], dim=1)

    return x[0].tolist()

Typical ranges: 0.6–0.8 for structured text (code, templates), 0.8–1.0 for balanced creative text, 1.0–1.3 for maximum variety (risks incoherence).

Top-k: eliminating the long tail

Even at reasonable temperatures, the lowest-probability tokens in a 50k vocabulary are garbage: rare Unicode, mangled words, encoding artifacts. Top-k filtering zeros out everything outside the top-k candidates before sampling:

# top_k=1:   greedy decoding , always picks the most likely token
# top_k=10:  very focused, low variety
# top_k=40:  GPT-2's default, good balance for most text
# top_k=200: broad sampling, approaches no filtering

# For char-level vocab (~100 tokens): top_k=5-10 makes sense
# For BPE vocab (~50k tokens): top_k=40-100 is typical

Top-k and temperature work together: top-k eliminates garbage tokens; temperature tunes creativity within the remaining candidates.

▶

Interactive companion tool

Visualise how temperature, top-k, and top-p reshape a probability distribution. See why applying temperature before filtering is critical.

sampling-playground.html ↗

Prompt engineering: the context window is your lever

The prompt tokens you provide are conditioning , the model continues from whatever context you give it. A well-chosen prompt can dramatically shift the style, topic, and structure of output:

No prompt , model picks its own starting point; high variance, good for exploring defaults
Title or genre cue , "Chapter 1:", "def fibonacci(", "[Sonnet]" , steers domain and structure
First sentence , strongest conditioning; model continues in the established style and tone
Register cue , "The following is a formal scientific description of:" , shifts vocabulary and style

Taste test

Generate 5 outputs with no prompt, 5 with a single word, and 5 with a full sentence. Notice how the coherence and on-topic consistency of outputs improves dramatically as you provide more context. The model's uncertainty about style drops with each additional token of conditioning.

Your action Generate 20 samples from your best checkpoint , 4 at each of these temperatures: 0.5, 0.7, 0.9, 1.1, 1.3. Use the same prompt and top_k=40 for all. Read every output. Write down which temperature produces the best balance of coherence and variety for your domain. That's your model's "personality setting" , the value you'll use for all future generation from this checkpoint.

Key takeaway

Generation settings are not cosmetic. Temperature controls the entropy of the output distribution; top-k eliminates the garbage tail; prompt engineering shifts the conditioning. The same model weights with the wrong settings can produce output that feels 10× worse than it actually is.

Day 41

Fine-tuning on Custom Data

Adapt a pretrained model to a new domain without training from scratch

Training from scratch is the right way to understand LLMs , which is why this course covered it. In practice, you almost never train from scratch. Fine-tuning starts from a model that already understands language and adapts it to your specific domain, style, or task in a fraction of the time and compute. This is how most real-world LLM applications are built.

Why fine-tuning works

A model pretrained on hundreds of billions of tokens has already internalized grammar, facts, code syntax, and reasoning patterns. Fine-tuning steers that compressed knowledge toward your domain rather than relearning it from zero:

Training from scratch on domain data

Model must learn basic language statistics before learning your domain. Needs 100MB+ for good results. Slow and compute-intensive.

Fine-tuning a pretrained model

Language knowledge is already there. Only domain adaptation needs to be learned. 1–10MB of data often enough. Fast.

Data preparation and format

For domain adaptation, format data as plain continuation text. For instruction-following, use a consistent prompt/response template:

# Domain adaptation: plain continuation
with open("fine_tune_data.txt", "w") as f:
    for doc in your_documents:
        f.write(doc.strip() + "\n<|endoftext|>\n")

# Instruction format: for Q&A or chat behavior
template = "<|user|>{question}\n<|assistant|>{answer}<|endoftext|>"
with open("fine_tune_data.txt", "w") as f:
    for q, a in qa_pairs:
        f.write(template.format(question=q, answer=a) + "\n")

Same curation rules as Day 36 apply: 1–10MB of high-quality data beats 100MB of noise.

Full fine-tuning

Fine-tuning is continued training at a much lower learning rate , low enough that the model adapts without forgetting what it already knows (catastrophic forgetting):

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Key: 10-100× lower LR than pretraining (GPT-2 used ~6e-4)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.1)

for batch in fine_tune_dataloader:
    input_ids = batch["input_ids"]
    outputs = model(input_ids, labels=input_ids)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print(f"loss: {loss.item():.4f}")

LoRA: parameter-efficient fine-tuning

LoRA freezes the original weights and inserts tiny trainable matrices alongside the attention projections. You get ~95% of full fine-tuning quality with less than 1% of the trainable parameters , feasible even on a laptop GPU:

from peft import get_peft_model, LoraConfig, TaskType

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                        # Rank: higher = more capacity, more params
    lora_alpha=32,              # Scaling factor (keep at 4× r)
    target_modules=["c_attn"],  # GPT-2's combined QKV projection
    lora_dropout=0.05,
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable: 294,912 / 124,734,720 total (0.24%)

# Training loop is identical to full fine-tuning
model.save_pretrained("gpt2-lora-adapter/")  # Saves only the tiny adapter

Compare

Train full fine-tuning and LoRA on the same data. Compare: training time per step, GPU memory usage, and generated sample quality after 3 epochs. LoRA should be noticeably faster with similar output quality.

Your action Load GPT-2 (124M) from Hugging Face. Fine-tune it for 3 epochs on a domain corpus of your choice , song lyrics, Wikipedia articles on one topic, or Python code. Generate 5 samples from the base model and 5 from your fine-tuned model using the same prompt. Read them side by side. The shift in style and domain vocabulary should be clearly visible.

Key takeaway

Fine-tuning, especially with LoRA, is the practical path for most LLM applications. Training from scratch taught you how it works , fine-tuning is how you make it work for your specific problem, with the resources you actually have.

Day 42

The Road Ahead

Advanced sampling, model distillation, RLHF, essential papers, and where to go next

You've built a language model from raw text to generation, understood every layer, and trained it by hand. That foundation is enough to understand everything in the field , because every frontier model is built on the same conceptual stack. This final day maps what's next.

Advanced sampling: top-p and beam search

Top-p (nucleus) sampling adapts dynamically: instead of always considering exactly k tokens, it takes the smallest set whose cumulative probability exceeds p. When the model is confident, the nucleus is small. When it's uncertain, the nucleus expands:

def top_p_sample(logits, p=0.9, temperature=0.8):
    logits = logits / temperature
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumprobs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    # Remove tokens where cumulative prob exceeds p
    remove = cumprobs - F.softmax(sorted_logits, dim=-1) > p
    sorted_logits[remove] = float('-inf')
    probs = F.softmax(sorted_logits, dim=-1)
    idx = torch.multinomial(probs, 1)
    return sorted_indices[idx]

Beam search maintains the top-k most probable sequences at each step. Produces coherent, grammatical output , but feels generic and repetitive for creative tasks. Use it for summarization and translation; use sampling for open-ended generation.

Model distillation: smaller models that punch above their weight

Distillation trains a small "student" to mimic a large "teacher." Instead of hard labels (one-hot targets), the student trains on the teacher's soft probability distribution , which contains far richer information about near-misses:

T = 4.0  # Soften the teacher's distribution
with torch.no_grad():
    teacher_logits = teacher(input_ids) / T

student_logits = student(input_ids) / T

soft_loss = F.kl_div(
    F.log_softmax(student_logits, dim=-1),
    F.softmax(teacher_logits, dim=-1),
    reduction="batchmean"
) * (T ** 2)  # Rescale gradients by T²

hard_loss = F.cross_entropy(student_logits * T, labels)
loss = 0.5 * soft_loss + 0.5 * hard_loss

DistilGPT-2 (82M) matches ~97% of GPT-2 (124M) quality at 2× inference speed. This is how models are made deployable without retraining from scratch at smaller size.

RLHF and alignment: how models become assistants

RLHF (Reinforcement Learning from Human Feedback) transforms a text-completion model into an assistant that follows instructions. Three stages:

Supervised fine-tuning (SFT) , fine-tune on human-written demonstrations of good assistant responses
Reward model (RM) , train a classifier to predict which of two responses a human would prefer; this becomes the optimization signal
RL optimization (PPO/GRPO) , use the reward model as a scalar signal to update the SFT model toward higher-scoring responses via policy gradient

DPO (Direct Preference Optimization) skips the reward model and optimizes preferences directly using a reparameterized loss. It's simpler, more stable, and has become the standard for smaller-scale alignment work. If you want to explore alignment, start with DPO.

Essential papers and tools

Read these in order , each builds on the last, and you'll recognize every concept:

"Attention Is All You Need" (Vaswani et al., 2017) , the original transformer; short, readable, every diagram will click now
GPT-2 (Radford et al., 2019) , shows that scale alone produces capability; defines the autoregressive LLM recipe
GPT-3 (Brown et al., 2020) , introduces few-shot prompting and emergent capabilities at scale
Chinchilla (Hoffmann et al., 2022) , revises scaling wisdom; most models are undertrained, not undersized
LoRA (Hu et al., 2021) , explains mathematically why the low-rank adaptation from Day 41 works
InstructGPT (Ouyang et al., 2022) , the RLHF paper that created ChatGPT; full alignment pipeline

Key tools: Hugging Face Transformers (standard model library), PEFT (LoRA and friends), vLLM (high-throughput serving), llama.cpp (4-bit inference on CPU/Apple Silicon), nanoGPT (Karpathy's clean reference implementation).

Your action Two tasks. First: implement top-p sampling and add it as a --top_p flag to your generate script alongside --temperature and --top_k. Test it at p=0.9 and compare outputs to top-k at k=40. Second: read the abstract, introduction, and conclusion of "Attention Is All You Need." Write one sentence summarizing the problem it was solving when it was published in 2017. The habit of reading papers is how you stay current as the field moves.

Key takeaway

You've gone from raw text to tokenization, through the transformer architecture, training loops, and generation , the same conceptual stack that powers every frontier model. The field moves fast, but these fundamentals don't change. Everything new is built on what you now understand.

Build a Large Language Model From Scratch

Tokenization

Character-Level Tokenization vs BPE

Data Loading and Batch Creation

Embedding Layer and Position Encodings

Training Loop Basics

Model Architecture Preview

Experiments and Testing Approaches

Putting It All Together: Your First End-to-End Language Model

The Transformer

The GPT Architecture at a Glance

Self-Attention Mechanism

Multi-Head Attention

The MLP Block

LayerNorm and Residual Connections

Weight Tying and Output Layer

Week 2 Wrap-up

The Training Loop

The Training Objective

Data Loading

Device Setup & Data Loading

Learning Rate Schedule

The Optimizer

The Full Training Loop

Watching the Model Learn

Text Generation

The Generate Function

Greedy Decoding

Temperature Sampling

Top-k Sampling

The Full Generate Function

CLI & Reproducibility

Week 4 Wrap-up

Putting It All Together

Project Structure

Google Colab Setup

Running Training

Text Generation

Experiment: Model Size vs Quality

Experiment: Context Length & Learning Rate

Monitoring Training & Next Steps

Putting It All Together

Better Data

Bigger Models

Better Tokenizer

Training Tweaks

Generation Tricks

Fine-tuning on Custom Data

The Road Ahead