Tutorial: High Performance Distributed Deep Learning
Aim of the tutorial
The recent advances in Deep Learning (DL) have led to many exciting challenges and opportunities for CS and AI researchers alike. Modern DL frameworks like TensorFlow, PyTorch, and several others have emerged that offer ease of use and flexibility to train, and deploy various types of Deep Neural Networks (DNNs). In this tutorial, we will provide an overview of interesting trends in DNN design and how cutting-edge hardware architectures and high-performance interconnects are playing a key role in moving the field forward. We will also present an overview of different DNN architectures and DL frameworks. Most DL frameworks started with a single-node design. However, approaches to parallelize the process of DNN training are also being actively explored. The DL community has moved along different distributed training designs that exploit communication runtimes like gRPC, MPI, and NCCL. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU and GPU architectures to efficiently support large-scale distributed DNN training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU and GPU architectures available on modern HPC clusters. Finally, we include hands-on exercises to enable the attendees to gain first-hand experience of running distributed DNN training experiments on a modern GPU cluster.
Outline
- Introduction
- The Past, Present, and Future of Deep Learning (DL)
- Brief History and Current/Future Trends
- DL Resurgence in the Many-core Era
- What are Deep Neural Networks?
- Brief Introduction
- Training and Inference
- Diverse Applications of Deep Learning
- Vision, Speech, Text, and Autonomous Driving
- Deep Learning Frameworks
- Why we need DL frameworks? Define-by-run Frameworks vs. Define-and-run
- Caffe, Caffe2, Cognitive Toolkit, Chainer, Pytorch, and TensorFlow
- Overview of Execution Environments
- Where do we run our DL Framework? (Conventional vs. Upcoming Execution Environments)
- Holistic Performance Characterization – DL Frameworks and Underlying (BLAS/DNN) Libraries
- Parallel and Distributed DNN Training
- The Need for Parallel and Distributed Training
- Parallelization Strategies, Communication Runtimes, and Scale-up and Scale-out
- Latest Trends in HPC Technologies
- HPC Hardware
- Interconnects (InfiniBand, RoCE, and Omni-Path)
- GPUs, Multi-/Many-cores, FPGAs, TPUs, and Intelligence Processing Unit (IPU)
- Storage - NVMe, SSDs, Burst Buffers, etc.
- Communication Middleware
- Message Passing Interface (MPI), NVIDIA NCCL/NCCL2, Facebook Gloo, and Intel MLSL
- Challenges in Exploiting HPC Technologies
- Large Batch and Model Size, Accuracy, and Scalability
- Exploiting GPUs and CUDA-Aware MPI
- Co-design of Communication Runtimes and DL Frameworks
- Efficient Collective Communication for DL Workloads
- Solutions and Case Studies
- NVIDIA NCCL/NCCL2, LLNL Aluminum, Baidu-allreduce, and Facebook Gloo
- Co-design MPI Runtimes and DL Frameworks
- Distributed Training for TensorFlow
- Scaling DNN Training on Multi-/Many-core CPUs
- PowerAI Distributed Deep Learning
- Hands-on Exercises
- Open Issues and Challenges
- Conclusion
Prerequisite Knowledge
There is no fixed prerequisite. As long as the attendee has a general knowledge in HPC and Networking, he/she will be able to understand and appreciate it. The tutorial is designed in such a way that an attendee gets exposed to the topics in a smooth and progressive manner. The content level will be as follows: 60% beginner, 30% intermediate, and 10% advanced.
Sat 22 FebDisplayed time zone: Tijuana, Baja California change
13:00 - 17:00 | Tutorial: High Performance Distributed Deep Learning(Riviera)Workshops and Tutorials | ||
13:00 4hDemonstration | Tutorial: High Performance Distributed Deep Learning Workshops and Tutorials Dhabaleswar K. Panda Ohio State University, Ammar Ahmad Awan Ohio State University, Hari Subramoni The Ohio State University |