No Barrier in the Road: A Comprehensive Study and Optimization of ARM Barriers
In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and validate them in scenarios where barriers are intensively used. We find that (1) order-preserving approaches without involving the bus significantly outperform other ones, and (2) the tremendous overhead mostly comes from barriers strictly following remote memory references. Usually, such barriers are inserted when threads are exchanging data, and they are used to ensure the relative order between storing the data to a shared buffer and setting a flag to inform the receiver. Based on the observations, we propose a new mechanism, Pilot, to remove such barriers by leveraging the single-copy atomicity to piggyback the flag with the data. Applying Pilot provides 10%-380% performance improvements in multiple benchmarks, which are close to the ideal performance without barriers.
Wed 26 FebDisplayed time zone: Tijuana, Baja California change
09:35 - 10:50 | Concurrency and GPU (Mediterranean Ballroom)Main Conference Chair(s): Ang Li Pacific Northwest National Laboratory | ||
09:35 25mTalk | Overlapping Host-to-Device Copy and Computation using Hidden Unified Memory Main Conference Jaehoon Jung Seoul National University, Daeyoung Park Seoul National University, Youngdong Do Seoul National University, Jungho Park Seoul National University, Jaejin Lee Seoul National University | ||
10:00 25mTalk | GPU Initiated OpenSHMEM: Correct and Efficient Intra-Kernel Networking for dGPUs Main Conference | ||
10:25 25mTalk | No Barrier in the Road: A Comprehensive Study and Optimization of ARM Barriers Main Conference Nian Liu Shanghai Jiao Tong University, Binyu Zang Shanghai Jiao Tong University, Haibo Chen Shanghai Jiao Tong University |