Главная
Study mode:
on
1
Introducing Daniel Varoli from Zapata.ai
2
The Problem with LLMs Today
3
How we Can Solve These Problems
4
Normal vs. Speculative Architecture
5
Speculative Decoding Example
6
Introducing Medusa
7
Medusa’s Decoding Heads
8
Generating Tokens With Medusa Heads
9
Verifying Candidates With Medusa
10
What if we Mess Up?
11
Rejecting Sampling For Accepting Candidates
12
Considering Many Completion Candidates at Once
13
Tree Attention Diagrams
14
How to Integrate Medusa Into a LLM
15
Results
Description:
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Explore a comprehensive 52-minute technical presentation that delves into the Medusa framework for accelerating Large Language Model (LLM) inference through parallel token prediction. Learn from Daniel Varoli of Zapata.ai as he explains current LLM challenges and introduces Medusa's innovative solution using multiple decoding heads and tree-based attention mechanisms. Understand the differences between normal and speculative architecture, examine practical examples of speculative decoding, and discover how Medusa's unique approach to token generation and candidate verification works. Master the concepts of rejection sampling, multiple completion candidates, and tree attention diagrams while gaining insights into integrating Medusa with existing LLMs. The presentation concludes with performance results and practical implementation details, making it valuable for AI researchers and developers working on LLM optimization.

Understanding Medusa: A Framework for LLM Inference Acceleration with Multiple Decoding Heads

Oxen
Add to list
0:00 / 0:00