Advanced MRI Reconstruction =========================== an Application Case Study ------------------------- Magnetic Resonance Imaging (MRI) is a safe and noninvasive probe of the structure and function of tissues in the body. MRI consists of two phases: 1. Acquisition or scan: the scanner samples data in the spatial-frequency domain along a predefined trajectory. 2. Reconstruction of the samples into an image. The limitations of MRI are noise, imaging artifacts, long acquisition times. We have three often conflicting goals: 1. Short scan time to reduce patient discomfort. 2. High resolution and fidelity for early detection. 3. High signal-to-noise ratio (SNR). Massively parallel computing provides disruptive breakthrough. Consider the the mathematical problem formulation. The reconstructed image :math:`m({\bf r})` is .. math:: \widehat m({\bf r}) = \sum_j W({\bf k}_j) s({\bf k}_j) e^{i 2\pi {\bf k}_j \cdot {\bf r}} where * :math:`W({\bf k})` is the weighting function to account for nonuniform sampling; * :math:`s({\bf k})` is the measured *k*-space data. The reconstruction is an inverse fast Fourier Transform on :math:`s({\bf k})`. To reconstruct images, a Cartesian scan is slow and may result in poor images. A spiral scan of the data as shown schematically in :numref:`figreconstruct_mri_two` is faster and with gridding leads to images of better quality than the ones reconstructed in rectangular coordinates. .. _figreconstruct_mri_two: .. figure:: ./figreconstruct_mri_two.png :align: center A spiral scan of data to reconstruct images, LS = Least Squares. The problem is summarized in :numref:`figsodiumbrain`, copied from the textbook of Hwu and Kirk. .. _figsodiumbrain: .. figure:: ./figsodiumbrain.png :align: center Problem statement copied from Hwu and Kirk. The mathematical problem formulation leads to an iterative reconstructiion method. We start with the formulation of a linear least squares problem, as a quasi-Bayesian estimation problem: .. math:: \widehat{\rho} = {\rm arg}~\min_{\rho} \underbrace{|| {\bf F} \rho - {\bf d}||_2^2}_{\rm data~fidelity} + \underbrace{|| {\bf W} \rho ||_2^2}_{\rm prior~info}, where * :math:`\widehat{\rho}` contains voxel values for reconstructed image, * the matrix :math:`{\bf F}` models the imaging process, * :math:`{\bf d}` is a vector of data samples, and * the matrix :math:`{\bf W}` incorporates prior information, derived from reference images. The solution to this linear least squares problem is .. math:: \widehat{\rho} = \left( {\bf F}^H {\bf F} + {\bf W}^H {\bf W} \right)^{-1} {\bf F}^H {\bf d}. The advanced reconstruction algorithm consists of three primary computations: 1. :math:`\displaystyle Q({\bf x}_n) = \sum_{m=1}^M | \phi({\bf k}_m) |^2 e^{i 2 \pi {\bf k}_m \cdot {\bf x}_n}` where :math:`\phi(\cdot)` is the Fourier transform of the voxel basis function. 2. :math:`\displaystyle \left[ {\bf F}^H {\bf d} \right]_n = \sum_{m=1}^M \phi^*({\bf k}_m) {\bf d}({\bf k}_m) e^{i 2\pi {\bf k}_m \cdot {\bf x}_m}` 3. The conjugate gradient solver performs the matrix inversion to solve :math:`\left( {\bf F}^H {\bf F} + {\bf W}^H {\bf W} \right) \rho = {\bf F}^H {\bf d}`. The calculation for :math:`{\bf F}^H {\bf d}` is an excellent candidate for acceleration on the GPU because of its substantial data parallelism. Acceleration on a GPU --------------------- The computation of :math:`{\bf F}^H {\bf d}` is defined in the code below. :: for(m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] - iPhi[m]*rD[m]; for(n = 0; n < N; n++) { expFHd = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFHd); sArg = sin(expFHd); rFHd[n] += rMu[m]*cArg - iMu[m]*sArg; iFHd[n] += iMu[m]*cArg + rMu[m]*sArg; } } In the development of the kernel we consider the Compute to Global Memory Access (CGMA) ratio. A first version of the kernel follows: :: __global__ void cmpFHd ( float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int m = blockIdx.x*FHd_THREADS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] - iPhi[m]*rD[m]; for(n = 0; n < N; n++) { expFHd = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); carg = cos(expFHd); sArg = sin(expFHd); rFHd[n] += rMu[m]*cArg - iMu[m]*sArg; iFHd[n] += iMu[m]*cArg + rMu[m]*sArg; } } A first modificiation is the splitting of the outer loop. :: for(m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] - iPhi[m]*rD[m]; } for(m = 0; m < M; m++) { for(n = 0; n < N; n++) { expFHd = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFHd); sArg = sin(expFHd); rFHd[n] += rMu[m]*cArg - iMu[m]*sArg; iFHd[n] += iMu[m]*cArg + rMu[m]*sArg; } } We convert the first loop into a CUDA kernel: :: __global__ void cmpMu ( float *rPhi,iPhi,rD,iD,rMu,iMu) { int m = blockIdx * MU_THREADS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] - iPhi[m]*rD[m]; } Because :math:`M` can be very big, we will have many threads. For example, if :math:`M = 65,536`, with 512 threads per block, we have :math:`65,536/512 = 128` blocks. Then the second loop is defined by another kernel. :: __global__ void cmpFHd ( float* rPhi, iPhi, PhiMag, kx, ky, kz, x, y, z, rMu, iMu, int N ) { int m = blockIdx.x*FHd_THREADS_PER_BLOCK + threadIdx.x; for(n = 0; n < N; n++) { float expFHd = 2*PI*(kx[m]*x[n]+ky[m]*y[n] +kz[m]*z[n]); float cArg = cos(expFHd); float sArg = sin(expFHd); rFHd[n] += rMu[m]*cArg - iMu[m]*sArg; iFHd[n] += iMu[m]*cArg + rMu[m]*sArg; } } To avoid conflicts between threads, we interchange the inner and the outer loops: :: for(m=0; m>> (rPhi,iPhi,phiMag,x,y,z,rMu,iMu,M); } Due to size limitations of constant memory and cache, we will adjust the memory layout. Instead of storing the components of *k*-space data in three separate arrays, we use an array of structs: :: struct kdata { float x, float y, float z; } __constant struct kdata k[CHUNK_SZ]; and then in the kernel we use ``k[m].x``, ``k[m].y``, and ``k[m].z``. In the next optimization, we use the hardware trigonometric functions. Instead of ``cos`` and ``sin`` as implemented in software, the hardware versions ``__cos`` and ``__sin`` provide a much higher throughput. The ``__cos`` and ``__sin`` are implemented as hardware instructions executed by the special function units. We need to be careful about a loss of accuracy. The validation involves a perfect image: * a reverse process to generate scanned data; * metrics: mean square error and signal-to-noise ratios. The last stage is the experimental performance tuning. Bibliography ------------ * A. Lu, I.C. Atkinson, and K.R. Thulborn. **Sodium Magnetic Resonance Imaging and its Bioscale of Tissue Sodium Concentration**. *Encyclopedia of Magnetic Resonance*, John Wiley and Sons, 2010. * S.S. Stone, J.P. Haldar, S.C. Tsao, W.-m.W. Hwu, B.P. Sutton, and Z.-P. Liang. **Accelerating advanced MRI reconstructions on GPUs**. *Journal of Parallel and Distributed Computing* 68(10): 1307-1318, 2008. * The IMPATIENT MRI Toolset, open source software available at .