Play all

Intro

In this video

What are transformers and attention?

Attention explained simply

Attention used in CNNs

Transformers and attention

What vision transformer ViT does differently

Images to patch embeddings

1. Building image patches

2. Linear projection

3. Learnable class embedding

4. Adding positional embeddings

ViT implementation in python with Hugging Face

Packages, dataset, and Colab GPU

Initialize Hugging Face ViT Feature Extractor

Hugging Face Trainer setup

Training and CUDA device error

Evaluation and classification predictions with ViT

Final thoughts

Description:

Explore the groundbreaking Vision Transformer (ViT) model in this comprehensive tutorial video. Dive into the intuition behind ViT's functionality, understanding how it bridges the gap between vision and language processing in machine learning. Learn about attention mechanisms, image patch embeddings, and the key components that make ViT effective. Follow along with a hands-on Python implementation using the Hugging Face transformers library for image classification tasks. Gain insights into setting up the environment, initializing the ViT Feature Extractor, configuring the Hugging Face Trainer, and evaluating model performance. Perfect for those interested in cutting-edge developments in computer vision and natural language processing.

Vision Transformers Explained + Fine-Tuning in Python

James Briggs

Add to list

#Computer Science #Artificial Intelligence #Computer Vision #Machine Learning #Deep Learning #Programming #Programming Languages #Python #Image Classification #Transformers #Attention Mechanisms #Vision Transformers

0:00 / 0:00