Poster: Revisiting Linpack Algorithm on Large-scale CPU-GPU Heterogeneous Systems (PPoPP 2020 - Brief Announcements)

Sat 22 - Wed 26 February 2020 San Diego, California, United States

Who

Chaoyang Shui, yuxianzhi , Yujin Yan, Yinshan Wang, Ke Meng, Guangming Tan

Track

PPoPP 2020 Brief Announcements

Abstract

As the widening gap between GPU computing capability and other components (CPU, PCIe bus and communication network), it’s increasingly challenging to design high performance parallel algorithms for large CPU-GPU heterogeneous systems. There are mainly two reasons. Firstly, simply offloading the kernel library to GPU incurs large volume data transfer through low-speed PCIe bus. Secondly, communication overheads through network severely affects scalability. To solve the above issues, we advocate a paradigm shift to GPU-centric and fine-grained pipelining algorithm design. By taking Linpack benchmark as a case study, the new algorithm design paradigm shows its effectiveness. Our optimized Linpack program achieves $63.79$PFlops on 16384 GPUs. Its floating-point efficiency outperforms the NVIDIA proprietary counterparts by $5%$ on average.

Chaoyang Shui

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences

yuxianzhi

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences

Yujin Yan

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences

Yinshan Wang

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences

Ke Meng

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences