Главная
Study mode:
on
1
New NVIDIA HYMBA LLM
2
Inference run w Test time training
3
Transformer in parallel w MAMBA
4
Metatoken introduced
5
Task specific Metatoken
6
MetaTokens explained in detail
7
NVIDIA Hymba beats Llama 3.2 3B
8
Attention map Entropy per Head
9
Key Value Cache in Transformer & Mamba
10
My crazy idea of Metatoken and ICL NVIDA
Description:
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Explore a 32-minute technical video detailing NVIDIA's innovative Hymba model, a hybrid-head architecture for small language models that combines transformer attention mechanisms with state-space models. Learn about the parallel processing capabilities that integrate attention heads for high-resolution memory recall with SSM heads for efficient global context summarization. Discover the groundbreaking meta tokens concept - learnable embeddings that serve as task-specific initializations to optimize attention distribution and mitigate the "attention sink" effect. Examine the advanced memory optimizations, including cross-layer key-value cache sharing and partial sliding window attention, that achieve an 11.67× reduction in cache size and 3.49× improvement in throughput compared to larger models. Follow along as the presentation demonstrates Hymba's superior performance across various benchmarks, showcasing how this sub-2B parameter model outperforms conventional approaches in accuracy, throughput, and memory efficiency, setting new standards for resource-efficient language models. Read more

NVIDIA HYMBA: A Hybrid-Head Architecture for Small Language Models with MetaTokens

Discover AI
Add to list
0:00 / 0:00