Главная
Study mode:
on
1
Intro
2
Grounded Visual Question Answering
3
Limitations of Existing VQA Systems
4
Grounded VQA Systems
5
Problem Setup
6
Transformers with Capsules
7
Approach
8
Capsule-based Tokens
9
Input to Intermediate Transformer layers
10
Text-based Residual Connection
11
Pre-training Tasks
12
Masked Language Modeling (MLM)
13
Image Text Matching
14
Pre-training Datasets
15
Fine-tuning on Downstream Task
16
Qualitative comparison - GQA
17
Evaluation Metrics
18
Results - GQA
19
Conclusion and Future Work
Description:
Explore the concept of Grounded Visual Question Answering (VQA) in this 22-minute lecture from the University of Central Florida. Delve into the limitations of existing VQA systems and discover how grounded VQA systems aim to overcome these challenges. Learn about the problem setup, including the use of transformers with capsules, capsule-based tokens, and text-based residual connections. Examine pre-training tasks such as Masked Language Modeling (MLM) and Image Text Matching, along with the datasets used for pre-training. Investigate the fine-tuning process for downstream tasks and analyze qualitative comparisons using the GQA dataset. Review evaluation metrics and results before concluding with insights into future work in this rapidly evolving field of artificial intelligence and computer vision.

Visual Question Answering: Grounded Systems and Transformer Capsules

University of Central Florida
Add to list
0:00 / 0:00