Play all

Intro

Massive Deep Models are Great

The Neural Network Pruning Problem

The Mathematics of Compression

One-Shot Compression of GPT Models

The General Approach

Our Approach: Quantization Version

Experimental Validation

Combining Sparsity and Quantization

Exploiting with DeepSparse

Software Beats Hardware (continued)

Transforming the Pareto Frontier

Enabling Anyone to Run

Enabling Anyone to Sparsify

Questions

Description:

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Explore SparseGPT, a groundbreaking machine learning model optimization technique, in this 42-minute video presentation by Neural Magic. Learn how to prune and quantize large language models (LLMs) in a single step, enabling deployment on standard CPUs at GPU-like speeds. Gain insights into the mathematics of compression, one-shot compression of GPT models, and the combination of sparsity and quantization. Discover how SparseGPT transforms the Pareto frontier, making it possible for anyone to run and sparsify LLMs. Examine deployment benchmarks for LLMs on CPU hardware and understand how software optimization can outperform hardware solutions. Delve into topics such as the neural network pruning problem, experimental validation, and the DeepSparse exploitation technique. Conclude with a Q&A session to further enhance your understanding of this innovative approach to faster and more efficient LLMs.

Unlock Faster and More Efficient LLMs with SparseGPT - Neural Magic

Neural Magic

Add to list

#Computer Science #Machine Learning #Model Optimization #Quantization #Model Compression