
Stanford Study: AdamW Maintains Stability in Optimizer Competition
Since its proposal in 2014, Adam and its improved version AdamW have dominated the pre-training of open-weight language models, maintaining stability and achieving fast convergence on large datasets.
As model scales rapidly expand, pre-training has become a typical representative of computationally intensive tasks, often incurring the most significant computational cost in large-model research and development. In this context, the design of optimizers is directly related to convergence speed and computational cost.
A study by the Percy Liang team at Stanford University indicates that despite many alternatives claiming significant acceleration, AdamW remains a robust first choice for pre-training, while matrix-type methods show clear advantages under specific data-to-model ratios.
Researchers believe this phenomenon may stem from two key methodological flaws: baseline models are often under-tuned, and fixing shared hyperparameters does not ensure fairness in comparison. Most tests only use small-scale models or follow the 1-fold data ratio proposed in the Chinchilla paper. What will the results be for larger-scale models or higher data ratios?
The research conducted a systematic comparative study covering eleven different deep-learning optimizers. The findings suggest that the optimal choice is also related to specific scenarios: under the standard Chinchilla data ratio, Muon performs best; when the data volume to model scale ratio increases to more than 8 times, Soap becomes a better choice.