In the early stages of scientific code development, developers (possibly scientists) tend to create a naive and easy-to-read implementation of a required algorithm. Such an implementation have a great chance to be inefficient in some performance aspect. (more...)
The fundamentals of this best-practice lie in the rule: Do not reinvent the wheel. If a developer recognizes a well-known math routine that possibly causes a bottleneck in the program, it is recommended to do a short research on available libraries that implement the recognized routine and suit the program.
Many libraries are well-established with long development history usually offer highly optimized implementations of the studied algorithms. The codes can implement bleeding-edge algorithms of a problem and hardware-specific tunings that provide higher efficiencies and performance than naive implementations.
As an example, we can mention PETSc and Trilinos toolkits that offer a wide range of scientific packages, BLAS and LAPACK providing all the necessary linear algebra routines, Intel MKL (Math Kernel Library) that is optimized for Intel architectures, NVIDIA CUDA libraries (BLAS, FFT, Math, RAND, Solvers) that are GPU-accelerated, or Pandas - python data analysis library.
This best practice was applied for the SIFEL application, where we identified a hotspot in naive manual implementation of a matrix factorization. This function took a disproportionate amount of time and was consuming more than 90 % of the runtime. Since matrix factorization is a common mathematical problem available in several libraries, we replaced the original implementation with a Pardiso library part of the Intel Math Kernel Library (MKL). Such modification required minor changes related to the transformation of the matrix into the appropriate format suitable for the library.
The library implementation of the matrix factorization showed a noticeable speedup (up to almost 30 times on tested matrices with more than forty thousand rows and columns) compared to the original implementation. The speedup was mainly caused by a distinct reduction in the number of instructions and partly by other optimizations and vector instructions. We also observed a noticeable lowering of CPU frequency that is most probably caused by a higher ratio of vector instructions.
|useful duration (total) [ms]||9.46e+06||3.57e+08||(26.47)|
|useful instructions (total)||2.75e+12||1.77e+11||(21.15)|
|average frequency [GHz]||3.26||2.40||(0.74)|
Table 1: Overview of the runtime, total useful duration, total useful instructions, IPC, and frequency with the relative ratio to the original code stated in brackets in the third column.
Recommended in program(s): SIFEL kernel (master) ·
Implemented in program(s): SIFEL kernel (factorization by math library) ·