CUDA Kernel Study Vector Addition Vectorized Copy Matrix Transpose Warp Shuffle Intrinsics LDG Shared Memory Parallel Reduction Pipelining SGEMM References CUDA C Programming Guide Inline PTX Assembly