Play all

Introduction

Computing system design

Transformer architecture

Uniform quantization

Uniform quantization scheme

Uniform continuation limits

Is it still useful

BCQ

Example

Critical problems

Lookup table

Transformer structure

Quantizing embedding layers

Mixed precision quantization

Encoder and Decoder

Retraining

Quantitation Results

Latency Improvements

Quantization

Q A

Strategic Partners

Description:

Explore extremely low-bit quantization techniques for Transformers in this tinyML Asia 2021 conference talk. Delve into the challenges of deploying Transformer architecture on resource-limited devices and learn about effective quantization strategies. Discover how different Transformer blocks contribute to model accuracy and inference computations, and understand the varying impacts of individual words within embedding blocks. Examine a proposed mixed precision quantization approach for representing Transformer weights using fewer than 3 bits, including a method for assigning different quantization bits to each word in an embedding block based on statistical properties. Gain insights into a novel matrix multiplication kernel that eliminates the need for dequantization steps. Cover topics such as computing system design, uniform quantization schemes, critical problems in quantization, and the Transformer structure. Explore quantization results, latency improvements, and participate in a Q&A session to deepen your understanding of this cutting-edge approach to optimizing Transformer models for mobile and edge devices. Read more

Extremely Low-Bit Quantization for Transformers - tinyML Asia 2021

tinyML

Add to list

#Computer Science #Machine Learning #Quantization #Engineering #Electrical Engineering #Embedded Systems #Programming #Cloud Computing #Edge Computing #Transformer Models #Mathematics #Algebra #Linear Algebra #Matrix Multiplication #Model Compression

0:00 / 0:00