Overlapping Host-to-Device Copy and Computation using Hidden Unified Memory (PPoPP 2020 - Main Conference)

Who

Jaehoon Jung, Daeyoung Park, Youngdong Do, Jungho Park, Jaejin Lee

Track

PPoPP 2020 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Tijuana, Baja California.

Use conference time zone: (GMT-08:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 26 Feb 2020 09:35 - 10:00 - Concurrency and GPU (Mediterranean Ballroom) Chair(s): Ang Li

Abstract

In this paper,we propose a runtime, called HUM, which hides host-to-device memory copy time without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. HUM provides wrapper functions of CUDA commands and executes host-to-device memory copy commands in an asynchronous manner. We also propose two runtime techniques. One checks if it is correct to make the synchronous host-to-device memory copy command asynchronous. If not, HUM makes the host computation or the kernel computation waits until the memory copy completes. The other subdivides consecutive host-to-device memory copy commands into smaller memory copy requests and schedules the requests from different commands in a round-robin manner. As a result, the kernel execution can be scheduled as early as possible to maximize the overlap. We evaluate HUM using 51 applications from Parboil, Rodinia, and CUDA Code Samples and compare their performance under HUM with that of hand-optimized implementations. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory.

Jaehoon Jung

Seoul National University

Daeyoung Park

Seoul National University

Youngdong Do

Seoul National University

Jungho Park

Seoul National University

Jaejin Lee

Seoul National University

South Korea

Time Zone

The program is currently displayed in (GMT-08:00) Tijuana, Baja California.

Use conference time zone: (GMT-08:00) Tijuana, Baja CaliforniaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 26 Feb
Displayed time zone: Tijuana, Baja California change

09:35 - 10:50	Concurrency and GPU (Mediterranean Ballroom)Main Conference Chair(s): Ang Li Pacific Northwest National Laboratory

09:35 25m Talk		Overlapping Host-to-Device Copy and Computation using Hidden Unified Memory Main Conference Jaehoon Jung Seoul National University, Daeyoung Park Seoul National University, Youngdong Do Seoul National University, Jungho Park Seoul National University, Jaejin Lee Seoul National University
10:00 25m Talk		GPU Initiated OpenSHMEM: Correct and Efficient Intra-Kernel Networking for dGPUs Main Conference KHALED HAMIDOUCHE Advanced Micro Devices (AMD), Michael LeBeane Advanced Micro Devices (AMD)
10:25 25m Talk		No Barrier in the Road: A Comprehensive Study and Optimization of ARM Barriers Main Conference Nian Liu Shanghai Jiao Tong University, Binyu Zang Shanghai Jiao Tong University, Haibo Chen Shanghai Jiao Tong University