Explore a comprehensive research seminar that investigates efficient training algorithms for Transformer-based language models, focusing on the computational challenges and effectiveness of various optimization methods. Learn about three key categories of algorithms: dynamic architectures (layer stacking, layer dropping), batch selection (selective backprop, RHO loss), and efficient optimizers (Lion, Sophia). Discover the findings when pre-training BERT and T5 models with fixed computation budgets, and understand the proposed evaluation protocol using reference system time. Delve into potential pitfalls, experimental setups, and practical implications for model training efficiency. Gain insights from speakers Jean Kaddour and Oscar Key as they present their research findings, supported by publicly available code and their published paper. Master concepts including model stacking, selected backdrop, efficient optimizers, and understand the overheads and conclusions drawn from their extensive experimentation.
Read more
No Train No Gain - Revisiting Efficient Training Algorithms for Transformer-based Language Models