Play all

LLM Full fine-tuning with lower VRAM

Video Overview

Understanding Optimisers

Stochastic Gradient Descent SGD

AdamW Optimizer and VRAM requirements

AdamW 8-bit optimizer

Adafactor optimiser and memory requirements

GaLore - reducing gradient and optimizer VRAM

LoRA versus GaLoRe

Better and Faster GaLoRe via Subspace Descent

Layerwise gradient updates

Training Scripts

How gradient checkpointing works to reduce memory

AdamW Performance

AdamW 8bit Performance

Adafactor with manual learning rate and schedule

Adafactor with default/auto learning rate

Galore AdamW

Galore AdamW with Subspace descent

Using AdamW8bit and Adafactor with GaLoRe

Notebook demo of layerwise gradient updates

Running with LoRa

Inferencing and Pushing Models to Hub

Single GPU Recommendations

Multi-GPU Recommendations

Resources

Description:

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Explore advanced techniques for full fine-tuning of large language models with limited GPU resources in this comprehensive video tutorial. Dive deep into optimizer strategies, including Stochastic Gradient Descent (SGD), AdamW, and Adafactor, while learning about their VRAM requirements and performance implications. Discover the innovative GaLore method for reducing gradient and optimizer VRAM usage, and compare it with LoRA. Gain insights into layerwise gradient updates, gradient checkpointing, and the implementation of various optimizers. Follow along with practical demonstrations, including a notebook demo of layerwise gradient updates and running models with LoRA. Learn how to inference and push models to the hub, and receive valuable recommendations for both single and multi-GPU setups. Access a wealth of resources, including slides, GitHub repositories, and additional support channels to enhance your understanding of advanced fine-tuning techniques.

Full Fine-tuning LLMs with Lower VRAM: Optimizers, GaLore, and Advanced Techniques

Trelis Research

Add to list

#Computer Science #Machine Learning #Fine-Tuning #Gradient Descent #Deep Learning #Stable Diffusion #LoRA (Low-Rank Adaptation)