Play all

Intro

OUTLINE

4-BIT QUANTIZATION

QUANTIZATION FOR INFERENCE

BINARY NEURAL NETWORKS

USING TENSOR CORES

QUANTIZED NETWORK ACCURACY

MAINTAINING SPEED AT BEST ACCURACY

SCALE-ONLY QUANTIZATION

PER-CHANNEL SCALING

TRAINING FOR QUANTIZATION

CONCLUSION

POST-TRAINING CALIBRATION

MIXED PRECISION NETWORKS

THE ROOT CAUSE

BRING YOUR OWN CALIBRATION

SUMMARY

INT PERFORMANCE

ALSO IN TensorRT

TF-TRT RELATIVE PERFORMANCE

OBJECT DETECTION - NMS

USING THE NEW NMS OP

NOW AVAILABLE ON GITHUB

TENSORRT HYPERSCALE INFERENCE PLATFORM

INEFFICIENCY LIMITS INNOVATION

NVIDIA TENSORRT INFERENCE SERVER

CURRENT FEATURES

AVAILABLE METRICS

DYNAMIC BATCHING

CONCURRENT MODEL EXECUTION-RESNET 50

NVIDIA RESEARCH AI PLAYGROUND

NV LEARN MORE AND DOWNLOAD TO USE

ADDITIONAL RESOURCES

Description:

Explore advanced techniques for AI inference and quantization in this session from the NVIDIA AI Tech Workshop at NeurIPS Expo 2018. Dive into quantized inference, NVIDIA TensorRT™ 5 and TensorFlow integration, and the TensorRT Inference Server. Learn about 4-bit quantization, binary neural networks, tensor cores, and strategies for maintaining speed while optimizing accuracy. Discover post-training calibration techniques, mixed precision networks, and the benefits of per-channel scaling. Gain insights into object detection with NMS, the TensorRT hyperscale inference platform, and the NVIDIA TensorRT Inference Server's features including dynamic batching and concurrent model execution. Access additional resources and tools to enhance your AI inference capabilities.

Inference and Quantization for AI - Session 3

Nvidia

Add to list

#Computer Science #Machine Learning #Quantization #Deep Learning #Artificial Intelligence #Neural Networks #TensorFlow #Computer Vision #Object Detection #Model Optimization #TensorRT #Programming #Cloud Computing #Hyperscale Computing

0:00 / 0:00