Write a Blog >>

As the widening gap between GPU computing capability and other components (CPU, PCIe bus and communication network), it’s increasingly challenging to design high performance parallel algorithms for large CPU-GPU heterogeneous systems. There are mainly two reasons. Firstly, simply offloading the kernel library to GPU incurs large volume data transfer through low-speed PCIe bus. Secondly, communication overheads through network severely affects scalability. To solve the above issues, we advocate a paradigm shift to GPU-centric and fine-grained pipelining algorithm design. By taking Linpack benchmark as a case study, the new algorithm design paradigm shows its effectiveness. Our optimized Linpack program achieves $63.79$PFlops on 16384 GPUs. Its floating-point efficiency outperforms the NVIDIA proprietary counterparts by $5%$ on average.