Главная
Study mode:
on
1
Intro
2
Challenge for Automatic Speech Recognition
3
A Perspective on Spoken Language Processing Most (-9%) of the worlds languages have not been addressed by resource and expert intensive supervised
4
Crossing the Vision Language Boundary
5
Learning an Audio/Visual Embedding Space?
6
Joint Audio-Visual Analysis Architecture
7
Crowdsourcing Audio-Visual Data
8
Evaluation: Image and Search Annotation
9
Evaluating via Image Search
10
Evaluating via Image Annotation
11
Time-varying Audio-Visual Affiliation
12
Audio-Visual Grounding for Localization
13
Examples of Audio-Visual Clusters
14
Cluster Analysis
15
Spatial Distribution of Speech Clusters
16
Final Message
Description:
Explore the cutting-edge research on unsupervised learning of spoken language using visual context in this 34-minute talk by Jim Glass from MIT. Delve into the challenges of automatic speech recognition and the potential of audio-visual embedding spaces to revolutionize language learning. Discover how deep learning models can associate images with spoken descriptions, creating word-like units from unannotated speech. Examine the experimental evaluation and analysis demonstrating the model's ability to cluster visual objects and their spoken counterparts. Learn about crowdsourcing audio-visual data, evaluation techniques for image search and annotation, and time-varying audio-visual affiliation. Gain insights into audio-visual grounding for localization, spatial distribution of speech clusters, and the broader implications for advancing speech recognition capabilities across the world's languages.

Unsupervised Learning of Spoken Language with Visual Context

MITCBMM
Add to list
0:00 / 0:00