List of Programming Models

CUDA

CUDA - Compute Unified Device Architecture is a general purpose parallel computing platform and scalable programming model for NVIDIA graphics processing units (GPUs). It allows C/C++ and Fortran developers to design specific device functions called kernels that are executed in parallel by groups of threads and thus efficiently utilize a large number of CUDA cores available on current GPUs. The CUDA programming model comprises a hierarchy of thread groups, a hierarchy of shared/private memories with separate memory space and synchronization mechanisms. These abstractions provide fine-grained data and thread parallelism, nested within coarse-grained data and task parallelism, e.g. a problem can be partitioned into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem can be further divided into finer pieces that can be solved cooperatively in parallel by all threads within the block.

Programs: GPU-Kernel ·

HIP

Cross-platform GPU programming without vendor lock-in enables greater cost efficiency and makes applications more future-proof against potential hardware changes. Applications developed relying on the Heterogeneous-computing Interface for Portability (HIP) (https://github.com/ROCm/HIP) are capable of targeting AMD and NVIDIA GPUs through the vendor-specific ROCm and CUDA platforms. Its API is designed to closely resemble CUDA, which eases conversion from existing applications, either manually or relying on the HIPIFY translation tools, as well as the development of new applications from scratch by minimizing the learning curve for developers. Another feature of using HIP is that applications can be profiled and debugged using AMD/NVIDIA tools.

MPI

MPI - Message Passing Interface is a standardized and portable message-passing communication protocol for programming parallel computers. MPI provides communicators, point-to-point communication, collective communication, derived datatypes, and some modern concepts as one-sided communication, dynamic process management, I/O. There are several well-tested and efficient implementations of MPI, such as MPICH, Open MPI. This library uses compiler wrappers mpicc, mpic++ and mpif90 for C, C++ and Fortran, respectively.

Programs: BEM4I miniApp · BLAS Tuning · Communication computation trade-off · Communication Imbalance · Pils · DuMuX DUNE kernel · False communication-computation overlap · FFTXlib · Parallel File I/O · RankDLB · Sam(oa)² ·

OmpSs

OmpSs extends OpenMP with compiler directives for asynchronous parallelism and heterogeneous architectures (i.e., GPUs, FPGAs, accelerators). Also, it can be understood as an extension of accelerator-based APIs like CUDA or OpenCL. A detailed description can be found at https://pm.bsc.es/ompss.

oneAPI

oneAPI offers an open, unified programming model to simplify the development and deployment of data-centric workloads across CPUs, GPUs, FPGAs and other types of hardware architectures.

OpenACC

OpenACC - Open Accelerators - is a directive-based high-level programming model similar to the OpenMP but intended for accelerators. The parallel regions are decorated with compiler directives that enable portability of the code to a wide range of accelerators. The OpenACC accelerator model abstracts multiple levels of parallelism of processors and the hierarchy of memories. It allows offloading both data and computation from a host device to accelerator device, where the devices can be different but even the same architectures with separate or shared memory space.

OpenCL

OpenCL - Open Computing Language is a standard for programming heterogeneous systems, e.g. CPU and GPU, and supports data and task parallelism. It defines abstract platform, execution, memory and programming models that describe features and behaviour of the target system. The heterogeneous system consists of a host system and one or more OpenCL devices containing processing elements. The host application controls communication with devices and both in order and out of order execution of compute kernels instances in the form of threads. The kernels are written using OpenCL C/C++ extension.

Programs: CPU to GPU ·

OpenMP

OpenMP is an implementation of multithreading, where a master thread creates a specific number of child threads (slaves). The system splits a master’s computational task into smaller ones and distributes them to threads. These threads then compute concurrently. Each thread is executed on a different processor. Parts of code that should run in parallel (parallel regions) are distinguished by specific compiler directives inserted into the code.

Programs: Alya assembly · BEM4I miniApp · CalculiX solver · Pils · FFTXlib · juKKR kloop · JuPedSim · OMP Collapse · OpenMP Critical · Sam(oa)² ·

OpenMP (Offload)

Since OpenMP version 4.0, the standard has introduced support for heterogeneous systems, which consist of a host architecture and one or more external accelerator devices. The host architecture is where the program begins its execution, while the target accelerators (such as GPUs) are external devices attached to the host, capable of executing portions of the computation. As a key feature, the OpenMP offload model enables performance portability across different HPC clusters by abstracting the user from device-specific architectures.

PGAS

PGAS - Partitioned Global Address Space assumes a global memory address space that is logically partitioned and a portion of it is local to each process, thread, or processing element. This can facilitate the development of productive programming languages that can reduce the time to solution, i.e. both development time and execution time. Languages based on PGAS are Unified Parallel C, Co-Array Fortran, Titanium, X-10, Chapel and others.

Posix threads

The Portable Operating System Interface (POSIX) defines an Application Programming Interface (API) for thread programming. Implementations of this interface exist for a large number of UNIX-like operating systems (GNU/Linux, Solaris, FreeBSD, OpenBSD, NetBSD, OS X), as well as for Microsoft Windows and others. Libraries that implement this standard (and functions of this standard) are usually called Pthreads.

Programs: CalculiX solver ·