Poster: Revisiting Linpack Algorithm on Large-scale CPU-GPU Heterogeneous Systems
As the widening gap between GPU computing capability and other components (CPU, PCIe bus and communication network), it’s increasingly challenging to design high performance parallel algorithms for large CPU-GPU heterogeneous systems. There are mainly two reasons. Firstly, simply offloading the kernel library to GPU incurs large volume data transfer through low-speed PCIe bus. Secondly, communication overheads through network severely affects scalability. To solve the above issues, we advocate a paradigm shift to GPU-centric and fine-grained pipelining algorithm design. By taking Linpack benchmark as a case study, the new algorithm design paradigm shows its effectiveness. Our optimized Linpack program achieves $63.79$PFlops on 16384 GPUs. Its floating-point efficiency outperforms the NVIDIA proprietary counterparts by $5%$ on average.