Play all

Intro

Vision Transformer (Vit) and its Applications

Why it matters?

Human Visual Attention

Attention is Dot Product between 2 Features

In Natural Language Processing

Image to Patches

Linear Projection - Patches to Features

Vision Transformer is Invariant to Position de Patches

Position Embedding

Learnable Class Embedding

Why Layer Norm?

Why Skip Connection?

Why Multi-Head Self-Attention?

A Transformer Encoder is Made of L Encode Modules Stacked Together

Version based on Layers, MLP size, MSA heaus

Pre-training on a large dataset, fine-tune or the target dataset

Training by Knowledge Distillation (Deit)

Sematic Segmentation (mlou: 50.3 SETR vs baseline PSPNet on ADE20k)

Semantic Segmentation (mlou: 84.4 Segformer vs 82.2 SETR on Cityscapes)

Vision Transformer for STR (VITSTR)

Parameter, FLOPS, Speed Efficient

Medical Image Segmentation (DSC: 77.5 TransUnet vs 71.3 R50-Vit baseline)

Limitations

Recommended Open-Source Implementations of Vit

Description:

Explore a 35-minute talk on Vision Transformer and its applications in computer vision. Delve into the breakthrough model architecture, focusing on self-attention and its role in vision. Examine various implementations utilizing Vision Transformer as the main backbone, including applications in recognition, detection, segmentation, multi-modal learning, and scene text recognition. Discover the potential of self-attention beyond transformers in building general-purpose model architectures capable of processing diverse data formats such as text, audio, image, and video. Learn about training techniques, including pre-training on large datasets and knowledge distillation. Investigate the model's performance in semantic segmentation, medical image segmentation, and its parameter, FLOPS, and speed efficiency. Understand the limitations of Vision Transformers and gain insights into recommended open-source implementations.

Vision Transformer and Its Applications

Open Data Science

Add to list

#Computer Science #Artificial Intelligence #Computer Vision #Image Processing #Deep Learning #Self-Attention #Vision Transformers

0:00 / 0:00