Play all

How to pick a GPU and software for inference

Video Overview

Effect of Quantization on Quality

Effect of Quantization on Speed

Effect of GPU bandwidth relative to model size

Effect of de-quantization on inference speed

Marlin Kernels, AWQ and Neural Magic

Inference Software - vLLM, TGI, SGLang, NIM

Deploying one-click templates for inference

Testing inference speed for a batch size of 1 and 64

SGLang inference speed

vLLM inference speed

Text Generation Inference Speed

Nvidia NIM Inference Speed

Comparing vLLM, SGLang, TGI and NIM Inference Speed.

Comparing inference costs for A40, A6000, A100 and H100

Inference Setup for Llama 3.1 70B and 405B

Running inference on Llama 8B on A40, A6000, A100 and H100

Inference cost comparison for Llama 8B

Running inference on Llama 70B and 405B on A40, A6000, A100 and H100

Inference cost comparison for Llama 70B and 405B

OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B

Final Inference Tips

Resources

Description:

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Dive into a comprehensive video tutorial on selecting the right GPU and inference engine for machine learning projects. Learn about the impact of quantization on model quality and speed, the relationship between GPU bandwidth and model size, and the effects of de-quantization on inference speed. Explore advanced topics like Marlin Kernels, AWQ, and Neural Magic. Compare popular inference software including vLLM, TGI, SGLang, and NIM, and discover how to deploy one-click templates for inference. Analyze detailed performance comparisons across various GPUs (A40, A6000, A100, H100) and model sizes (Llama 3.1 8B, 70B, 405B), including cost considerations. Gain insights into OpenAI GPT4 inference costs compared to Llama models. Conclude with valuable tips for optimizing inference setups and access additional resources for further learning.

How to Pick a GPU and Inference Engine for Large Language Models

Trelis Research

Add to list

#Computer Science #Machine Learning #Model Optimization #Artificial Intelligence #Natural Language Processing (NLP) #LLM (Large Language Model) #GPT-4 #LLaMA (Large Language Model Meta AI) #Quantization #vLLM #NVIDIA NIM

0:00 / 0:00