Training Batch Shape Debugger

See exactly how your dataset gets sliced into training batches. Compute samples, tokens, and memory for any configuration.

Dataset Configuration

1.00M
128
32
1000
Instead of non-overlapping chunks, slide by 1 token for maximum samples per epoch.

Batch Statistics

7,812
Samples per Epoch
sequences in the dataset
244
Batches per Epoch
at current batch size
4.1M
Tokens after N steps
32 × 128 × 1000 steps
0.5 MB
Input/ target tensor memory
FP32, one batch
Tokens per batch: 4,096
Input shape: (32, 128)
Target shape: (32, 128)
Epochs after N steps: 4.1

Chunking Visualizer

See how a sample text gets sliced into input and target chunks with the one-position offset. Each cell is one token ID.

Input tokens (x)
Target tokens (y)
Both (overlap)
i
How to read this: Each row is one sample. The blue cells are the model's input (x), the green cells are the targets (y). Notice how y is x shifted left by one position: the model predicts the next token at every position simultaneously. Purple cells appear where a token serves as both input (for predicting the next position) and target (predicted from the previous position).

Formulas & Details

Non-overlapping chunks:
samples_per_epoch = dataset_size ÷ seq_len
batches_per_epoch = samples_per_epoch ÷ batch_size

Sliding window (stride=1):
samples_per_epoch = dataset_size − seq_len + 1

Memory per batch (FP32):
bytes = 2 × batch_size × seq_len × 4 bytes
(2 tensors: inputs + targets)

Tokens seen after N steps:
tokens = batch_size × seq_len × N
epochs = tokens ÷ dataset_size

Key insight: With non-overlapping chunks, each token appears in exactly one input position per epoch. With stride=1, each token appears in seq_len different samples per epoch, giving the model more opportunities to learn from each token but at the cost of seeing highly correlated samples.