Главная
Study mode:
on
1
Intro
2
Training in Distributed ML Systems
3
Parameters in Distributed ML Systems
4
Issues with Empirical Parameter Tuning
5
Proposals for Automatic Parameter Adaptation
6
Open Challenges
7
Existing Approaches for Adaptation
8
KungFu Overview
9
Adaptation Policies
10
Example: Adaptation Policy for GNS
11
Embedding Monitoring Inside Dataflow Problem: High monitoring cost reduces adaptation benefit Idea: Improve efficiency by adding monitoring operators to dataflow graph
12
Challenges of Dataflow Collective Communication
13
Making Collective Communication Asynchronous Idea: Use asynchronous collective communication
14
Issues When Adapting System Parameters
15
Distributed Mechanism for Parameter Adaptation
16
How Effectively Does KungFu Adapt?
17
Conclusions: Kung Fu
Description:
Explore the innovative KungFu distributed machine learning library for TensorFlow, designed to enable adaptive training in this OSDI '20 conference talk. Dive into the challenges of configuring numerous parameters in distributed ML systems and discover how KungFu addresses these issues through high-level Adaptation Policies (APs). Learn about the library's ability to dynamically adjust hyper-parameters and system parameters during training based on real-time monitored metrics. Understand the implementation of monitoring and control operators embedded in the dataflow graph, and the efficient asynchronous collective communication layer that ensures concurrency and consistency. Gain insights into the effectiveness of KungFu's adaptive approach, its mechanisms for distributed parameter adaptation, and the potential impact on improving the efficiency and performance of distributed machine learning training.

KungFu - Making Training in Distributed Machine Learning Adaptive

USENIX
Add to list