Play all

New NVIDIA HYMBA LLM

Inference run w Test time training

Transformer in parallel w MAMBA

Metatoken introduced

Task specific Metatoken

MetaTokens explained in detail

NVIDIA Hymba beats Llama 3.2 3B

Attention map Entropy per Head

Key Value Cache in Transformer & Mamba

My crazy idea of Metatoken and ICL NVIDA

Description:

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Explore a 32-minute technical video detailing NVIDIA's innovative Hymba model, a hybrid-head architecture for small language models that combines transformer attention mechanisms with state-space models. Learn about the parallel processing capabilities that integrate attention heads for high-resolution memory recall with SSM heads for efficient global context summarization. Discover the groundbreaking meta tokens concept - learnable embeddings that serve as task-specific initializations to optimize attention distribution and mitigate the "attention sink" effect. Examine the advanced memory optimizations, including cross-layer key-value cache sharing and partial sliding window attention, that achieve an 11.67× reduction in cache size and 3.49× improvement in throughput compared to larger models. Follow along as the presentation demonstrates Hymba's superior performance across various benchmarks, showcasing how this sub-2B parameter model outperforms conventional approaches in accuracy, throughput, and memory efficiency, setting new standards for resource-efficient language models. Read more

NVIDIA HYMBA: A Hybrid-Head Architecture for Small Language Models with MetaTokens

Discover AI

Add to list

#Computer Science #Machine Learning #Language Models #Artificial Intelligence #Neural Networks #Deep Learning #Attention Mechanisms #Transformer Architecture

0:00 / 0:00