Wed 26 Feb 2020 10:25 - 10:50 - Concurrency and GPU
Chair(s): Ang Li

In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and validate them in scenarios where barriers are intensively used. We find that (1) order-preserving approaches without involving the bus significantly outperform other ones, and (2) the tremendous overhead mostly comes from barriers strictly following remote memory references. Usually, such barriers are inserted when threads are exchanging data, and they are used to ensure the relative order between storing the data to a shared buffer and setting a flag to inform the receiver. Based on the observations, we propose a new mechanism, Pilot, to remove such barriers by leveraging the single-copy atomicity to piggyback the flag with the data. Applying Pilot provides 10%-380% performance improvements in multiple benchmarks, which are close to the ideal performance without barriers.

09:35 - 10:50: Main Conference - Concurrency and GPU (Mediterranean Ballroom)
Chair(s): Ang LiPacific Northwest National Laboratory
Jaehoon JungSeoul National University, Daeyoung ParkSeoul National University, Youngdong DoSeoul National University, Jungho ParkSeoul National University, Jaejin LeeSeoul National University
KHALED HAMIDOUCHEAdvanced Micro Devices (AMD), Michael LeBeaneAdvanced Micro Devices (AMD)
Nian LiuShanghai Jiao Tong University, Binyu ZangShanghai Jiao Tong University, Haibo ChenShanghai Jiao Tong University