Главная
Study mode:
on
1
Intro
2
In this video
3
What are transformers and attention?
4
Attention explained simply
5
Attention used in CNNs
6
Transformers and attention
7
What vision transformer ViT does differently
8
Images to patch embeddings
9
1. Building image patches
10
2. Linear projection
11
3. Learnable class embedding
12
4. Adding positional embeddings
13
ViT implementation in python with Hugging Face
14
Packages, dataset, and Colab GPU
15
Initialize Hugging Face ViT Feature Extractor
16
Hugging Face Trainer setup
17
Training and CUDA device error
18
Evaluation and classification predictions with ViT
19
Final thoughts
Description:
Explore the groundbreaking Vision Transformer (ViT) model in this comprehensive tutorial video. Dive into the intuition behind ViT's functionality, understanding how it bridges the gap between vision and language processing in machine learning. Learn about attention mechanisms, image patch embeddings, and the key components that make ViT effective. Follow along with a hands-on Python implementation using the Hugging Face transformers library for image classification tasks. Gain insights into setting up the environment, initializing the ViT Feature Extractor, configuring the Hugging Face Trainer, and evaluating model performance. Perfect for those interested in cutting-edge developments in computer vision and natural language processing.

Vision Transformers Explained + Fine-Tuning in Python

James Briggs
Add to list
0:00 / 0:00