OpenMP (Offload)

Since OpenMP version 4.0, the standard has introduced support for heterogeneous systems, which consist of a host architecture and one or more external accelerator devices. The host architecture is where the program begins its execution, while the target accelerators (such as GPUs) are external devices attached to the host, capable of executing portions of the computation. As a key feature, the OpenMP offload model enables performance portability across different HPC clusters by abstracting the user from device-specific architectures.

OpenMP facilitates offloading tasks to these accelerators using the target construct, which allows both data and code to be transferred from the host to the target device for execution. Additionally, OpenMP provides a set of specialized API routines for managing operations specific to devices, such as querying device information, handling data management, and managing thread hierarchies. The standard also includes environment variables that can be set at runtime to configure how the device executes kernels.

The typical workflow for executing kernels on a device involves three main steps: 1) The host maps its data to the target device’s memory environment; 2) The host offloads OpenMP target regions to the device for execution, potentially reusing the data environment to execute multiple regions; and 3) The host retrieves the computed results from the device and transfers the data back to the host.