Non-overlapping chunks:
samples_per_epoch = dataset_size ÷ seq_len
batches_per_epoch = samples_per_epoch ÷ batch_size
Sliding window (stride=1):
samples_per_epoch = dataset_size − seq_len + 1
Memory per batch (FP32):
bytes = 2 × batch_size × seq_len × 4 bytes
(2 tensors: inputs + targets)
Tokens seen after N steps:
tokens = batch_size × seq_len × N
epochs = tokens ÷ dataset_size
Key insight: With non-overlapping chunks, each token appears in exactly one input position per epoch. With stride=1, each token appears in seq_len different samples per epoch, giving the model more opportunities to learn from each token but at the cost of seeing highly correlated samples.