Toronto Machine Learning Series (TMLS)
Efficient Inference of Extremely Large Transformer Models
Explore techniques for optimizing large transformer models, including compression, efficient attention, and parallelism for faster, smaller, and cost-effective inference.