Vision Transformer is Invariant to Position de Patches
10
Position Embedding
11
Learnable Class Embedding
12
Why Layer Norm?
13
Why Skip Connection?
14
Why Multi-Head Self-Attention?
15
A Transformer Encoder is Made of L Encode Modules Stacked Together
16
Version based on Layers, MLP size, MSA heaus
17
Pre-training on a large dataset, fine-tune or the target dataset
18
Training by Knowledge Distillation (Deit)
19
Sematic Segmentation (mlou: 50.3 SETR vs baseline PSPNet on ADE20k)
20
Semantic Segmentation (mlou: 84.4 Segformer vs 82.2 SETR on Cityscapes)
21
Vision Transformer for STR (VITSTR)
22
Parameter, FLOPS, Speed Efficient
23
Medical Image Segmentation (DSC: 77.5 TransUnet vs 71.3 R50-Vit baseline)
24
Limitations
25
Recommended Open-Source Implementations of Vit
Description:
Explore a 35-minute talk on Vision Transformer and its applications in computer vision. Delve into the breakthrough model architecture, focusing on self-attention and its role in vision. Examine various implementations utilizing Vision Transformer as the main backbone, including applications in recognition, detection, segmentation, multi-modal learning, and scene text recognition. Discover the potential of self-attention beyond transformers in building general-purpose model architectures capable of processing diverse data formats such as text, audio, image, and video. Learn about training techniques, including pre-training on large datasets and knowledge distillation. Investigate the model's performance in semantic segmentation, medical image segmentation, and its parameter, FLOPS, and speed efficiency. Understand the limitations of Vision Transformers and gain insights into recommended open-source implementations.