Plot and compare learning rate schedules. Understand how warmup, decay, and schedule shape affect training convergence.
How different LR schedules affect training loss. Lighter line = underlying noise, bold = smoothed trend. Bad schedules cause instability, slow convergence, or early plateau.
Click any card to load its configuration and see why it fails.
Constant LR at 3e-3. Loss often diverges (NaN) within the first 100 steps. The optimizer overshoots and never recovers.
DivergesConstant LR at 1e-5. Training is stable but painfully slow. Needs 5-10x more steps to reach the same loss as a good schedule.
SlowCosine decay with 0 warmup steps. The first few updates are chaotic with noisy gradients at full LR, causing early instability.
Unstable50% of steps spent in warmup. Most of the training budget is wasted at sub-optimal LRs when the model could be learning faster.
Wasted compute100 warmup steps + cosine decay to 10% of max. Fast early progress, smooth convergence. The gold standard for transformers.
RecommendedDecays to 0 in very few steps. The model gets stuck early with minimal LR, converging to a poor local minimum.
Too fast decayCopy this Python function directly into your training script.