Overlapping Host-to-Device Copy and Computation using Hidden Unified Memory
In this paper,we propose a runtime, called HUM, which hides host-to-device memory copy time without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. HUM provides wrapper functions of CUDA commands and executes host-to-device memory copy commands in an asynchronous manner. We also propose two runtime techniques. One checks if it is correct to make the synchronous host-to-device memory copy command asynchronous. If not, HUM makes the host computation or the kernel computation waits until the memory copy completes. The other subdivides consecutive host-to-device memory copy commands into smaller memory copy requests and schedules the requests from different commands in a round-robin manner. As a result, the kernel execution can be scheduled as early as possible to maximize the overlap. We evaluate HUM using 51 applications from Parboil, Rodinia, and CUDA Code Samples and compare their performance under HUM with that of hand-optimized implementations. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory.
Wed 26 FebDisplayed time zone: Tijuana, Baja California change
09:35 - 10:50
|Overlapping Host-to-Device Copy and Computation using Hidden Unified Memory|
|GPU Initiated OpenSHMEM: Correct and Efficient Intra-Kernel Networking for dGPUs|
|No Barrier in the Road: A Comprehensive Study and Optimization of ARM Barriers|