Tutorial: High Performance Distributed Deep Learning (PPoPP 2020 - Workshops and Tutorials)

Who

Dhabaleswar K. Panda, Ammar Ahmad Awan, Hari Subramoni

Track

PPoPP 2020 Workshops and Tutorials

Time Zone

The program is currently displayed in (GMT-08:00) Tijuana, Baja California.

Use conference time zone: (GMT-08:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 22 Feb 2020 13:00 - 17:00 - Tutorial: High Performance Distributed Deep Learning(Riviera)

Abstract

Aim of the tutorial

The recent advances in Deep Learning (DL) have led to many exciting challenges and opportunities for CS and AI researchers alike. Modern DL frameworks like TensorFlow, PyTorch, and several others have emerged that offer ease of use and flexibility to train, and deploy various types of Deep Neural Networks (DNNs). In this tutorial, we will provide an overview of interesting trends in DNN design and how cutting-edge hardware architectures and high-performance interconnects are playing a key role in moving the field forward. We will also present an overview of different DNN architectures and DL frameworks. Most DL frameworks started with a single-node design. However, approaches to parallelize the process of DNN training are also being actively explored. The DL community has moved along different distributed training designs that exploit communication runtimes like gRPC, MPI, and NCCL. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU and GPU architectures to efficiently support large-scale distributed DNN training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU and GPU architectures available on modern HPC clusters. Finally, we include hands-on exercises to enable the attendees to gain first-hand experience of running distributed DNN training experiments on a modern GPU cluster.

Outline

Introduction

The Past, Present, and Future of Deep Learning (DL)

Brief History and Current/Future Trends
DL Resurgence in the Many-core Era

What are Deep Neural Networks?

Brief Introduction
Training and Inference

Diverse Applications of Deep Learning

Vision, Speech, Text, and Autonomous Driving

Deep Learning Frameworks

Why we need DL frameworks? Define-by-run Frameworks vs. Define-and-run
Caffe, Caffe2, Cognitive Toolkit, Chainer, Pytorch, and TensorFlow

Overview of Execution Environments

Where do we run our DL Framework? (Conventional vs. Upcoming Execution Environments)
Holistic Performance Characterization – DL Frameworks and Underlying (BLAS/DNN) Libraries

Parallel and Distributed DNN Training

The Need for Parallel and Distributed Training
Parallelization Strategies, Communication Runtimes, and Scale-up and Scale-out

Latest Trends in HPC Technologies

HPC Hardware

Interconnects (InfiniBand, RoCE, and Omni-Path)
GPUs, Multi-/Many-cores, FPGAs, TPUs, and Intelligence Processing Unit (IPU)
Storage - NVMe, SSDs, Burst Buffers, etc.

Communication Middleware

Message Passing Interface (MPI), NVIDIA NCCL/NCCL2, Facebook Gloo, and Intel MLSL

Challenges in Exploiting HPC Technologies

Large Batch and Model Size, Accuracy, and Scalability
Exploiting GPUs and CUDA-Aware MPI
Co-design of Communication Runtimes and DL Frameworks
Efficient Collective Communication for DL Workloads

Solutions and Case Studies

NVIDIA NCCL/NCCL2, LLNL Aluminum, Baidu-allreduce, and Facebook Gloo
Co-design MPI Runtimes and DL Frameworks
Distributed Training for TensorFlow
Scaling DNN Training on Multi-/Many-core CPUs
PowerAI Distributed Deep Learning

Hands-on Exercises
Open Issues and Challenges
Conclusion

Prerequisite Knowledge

There is no fixed prerequisite. As long as the attendee has a general knowledge in HPC and Networking, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. The content level will be as follows: 60% beginner, 30% intermediate, and 10% advanced.

Dhabaleswar K. Panda

Ohio State University

Ammar Ahmad Awan

Ohio State University

United States

Hari Subramoni

The Ohio State University