QR decomposition on GPUs in multiple double precision

Abstract:

The aim is to compensate the cost overhead of multiple double arithmetic using Graphics Processing Units (GPUs) capable of teraflop performance. Following the GPU acceleration of the blocked QR decomposition of [Kerr, Campbell, and Richards, GPGPU'09], the multiple double arithmetic from the QD library [Hida, Li, Bailey, Arith-15 2001], extended with code generated by the CAMPARY software [Joldes, Muller, Popescu, and Tucker, ICMS 2016] is applied to accelerate the QR decomposition on the NVIDIA P100 and V100 GPUs. Because the problems become compute bound, teraflop performance is already observed in double double precision for matrices of size 1,024.

SIAM Parallel Processing 2022, Friday 25 February 2022, online

slides of the talk