Play all

Intro

Challenge for Automatic Speech Recognition

A Perspective on Spoken Language Processing Most (-9%) of the worlds languages have not been addressed by resource and expert intensive supervised

Crossing the Vision Language Boundary

Learning an Audio/Visual Embedding Space?

Joint Audio-Visual Analysis Architecture

Crowdsourcing Audio-Visual Data

Evaluation: Image and Search Annotation

Evaluating via Image Search

Evaluating via Image Annotation

Time-varying Audio-Visual Affiliation

Audio-Visual Grounding for Localization

Examples of Audio-Visual Clusters

Cluster Analysis

Spatial Distribution of Speech Clusters

Final Message

Description:

Explore the cutting-edge research on unsupervised learning of spoken language using visual context in this 34-minute talk by Jim Glass from MIT. Delve into the challenges of automatic speech recognition and the potential of audio-visual embedding spaces to revolutionize language learning. Discover how deep learning models can associate images with spoken descriptions, creating word-like units from unannotated speech. Examine the experimental evaluation and analysis demonstrating the model's ability to cluster visual objects and their spoken counterparts. Learn about crowdsourcing audio-visual data, evaluation techniques for image search and annotation, and time-varying audio-visual affiliation. Gain insights into audio-visual grounding for localization, spatial distribution of speech clusters, and the broader implications for advancing speech recognition capabilities across the world's languages.

Unsupervised Learning of Spoken Language with Visual Context

MITCBMM

Add to list

#Computer Science #Machine Learning #Unsupervised Learning #Artificial Intelligence #Deep Learning

0:00 / 0:00