A Perspective on Spoken Language Processing Most (-9%) of the worlds languages have not been addressed by resource and expert intensive supervised
4
Crossing the Vision Language Boundary
5
Learning an Audio/Visual Embedding Space?
6
Joint Audio-Visual Analysis Architecture
7
Crowdsourcing Audio-Visual Data
8
Evaluation: Image and Search Annotation
9
Evaluating via Image Search
10
Evaluating via Image Annotation
11
Time-varying Audio-Visual Affiliation
12
Audio-Visual Grounding for Localization
13
Examples of Audio-Visual Clusters
14
Cluster Analysis
15
Spatial Distribution of Speech Clusters
16
Final Message
Description:
Explore the cutting-edge research on unsupervised learning of spoken language using visual context in this 34-minute talk by Jim Glass from MIT. Delve into the challenges of automatic speech recognition and the potential of audio-visual embedding spaces to revolutionize language learning. Discover how deep learning models can associate images with spoken descriptions, creating word-like units from unannotated speech. Examine the experimental evaluation and analysis demonstrating the model's ability to cluster visual objects and their spoken counterparts. Learn about crowdsourcing audio-visual data, evaluation techniques for image search and annotation, and time-varying audio-visual affiliation. Gain insights into audio-visual grounding for localization, spatial distribution of speech clusters, and the broader implications for advancing speech recognition capabilities across the world's languages.
Unsupervised Learning of Spoken Language with Visual Context