Cuda Beginning

##前言

由于一直在学习图形学，很多时候，图形学中的计算，如最近碰到的问题，计算随机点的中垂面，每个点的计算K近邻树，都是独立的。如果用GPU并行加速，可能会有好的提速效果。再加上导师最想希望我和实验室学长开始做一个基于Cuda的项目，故想开始学CUDA。

结构目录

Cuda是什么

Cuda程序的编译过程

我的第一个Cuda程序

利用GPU加速Matrix Multiplication

优化策略1 : shared memory

优化策略2 : texture memroy

总结

Cuda是什么

CUDA(Compute Unified Device Architecture)，是显卡厂商NVIDIA推出的运算平台。 CUDA™是一种由NVIDIA推出的通用并行计算架构，该架构使GPU能够解决复杂的计算问题。而我们关心的事情有2点：

Cuda是对GPU运算能力的并行利用；

Cuda很好的兼容了C/C++，在此之上做了扩展；

听到过一个不错的比喻：　
CPU可以类比于一个强壮的男人，随机现在硬件能力的不断提升，单个CPU已经拥有了很强大的能力，甚至CPU也有了多核，同样也有很多基于CPU的多线程编程。
而GPU则是一群小孩，每一个能力不大，但是由于GPU可以同时开启成千上万个线程，<线程格，线程块，线程>，因此可以同时做很多事情，极大的提升程序的运行效率，这就是并行的好处。

tips : GPU是不适合做太多的逻辑判断的。它更希望能做一些简单的计算工作。

Cuda程序的编译过程

Cuda程序一般分为两个部分，其中一部分是用Nvidia的编译器进行编译，在GPU上运行；另一部分，用原本的VS（我使用的编译器是Visual Studio）编译，在CPU上跑。

Cuda C是支持C/C++语言的。它只对C语言做了一个很小的扩展并且提供了一个C runtime library.
想要知道Cuda是怎么运行的，我们首先要知道Cuda程序的编译过程。

NVCC 的工作流主要分下面几步：

1，将程序中的host code 和 device code 区别开来。
2，将device code进行转化可装配形式（assembly form (PTX code)），进而转化成2进制流，用于交给GPU处理。
3，将Host code中不符合C语言标准的代码进行替换，然后按照正常的编译过程进行编译链接，在CPU中处理。

Kernel 核函数

核函数是作用在GPU上，用____global____定义，相当于是cuda提供给C/C++的一个接口，每次可以同时调用多个核函数，每个线程都会执行这个核函数，线程块和线程的数量可以自己根据计算的规模和硬件条件来设定。核函数是cuda编程的核心，在之后的例子中我们也能看到。

Initialization 初始化过程：

一般来说，Cuda并没有一个明确的开始标识，当第一个runtime function被调用的时候，GPU section就被初始化了。在初始化的过程中，会创建一个cuda contexet，初始创建的这个cuda环境是主要环境，被所有的host所共享。*cudaDeviceReset()*可以destory当前的上下文，直至下一个runtime function被调用时，将重新创建primary context。

Device Memory 设备内存：

设备内存相当于是GPU中的内存，一般的分配方法有两种，线性内存或者cuda arrays，其中cuda arrays是和纹理内存相关的，我们后续再谈。
线性内存通常使用*cudaMalloc()*分配，*cudaFree()*释放，*cudaMemcpy()*在主机和设备之间传递数据。

我的第一个Cuda程序

给出一个最基本的例子，也是在Visual Studio上创建一个.cu文件时，会自动生成的一个示例代码。

输入：两个长度一样的数组，假设长度为5；
输出：一个结果数组，数组里面的每一个数是2个输入数组里对应位置的和。

CPU思路，这是一道简单得不能再简单的题，通常CPU中的算法必然是定义一个循环，遍历2个输入数组并将他们对应元素的和相加，得到结果数组。

for(int i = 0; i < 5; ++i)
{
    _c[i] = a[i] + b[i];
}

可以看到，CPU中进行的是五次串行运算。

那么回到我们想要实现的GPU并行计算。之前也提到过，GPU是通过调用核函数来调用进程的。

 // Launch a kernel on the GPU with one thread for each element.
  addKernel<<<1, 5>>>(dev_c, dev_a, dev_b);
    
    __global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

可以大致想象，我们开启了5个线程，然后通过thread.x来获取所在线程的信息，同时计算结果数组C，以此来实现并行的效果。当然，在GPU中需要单独分配和释放内存，同时数据也会在CPU和GPU中来回传递，这个Cuda都已经为我们提供好了相应的接口，后续我们会继续介绍。

利用GPU加速Matrix Multiplication

1，矩阵乘法的CPU思路

在这里我是默认读者已经有基本的矩阵运算基础的。所以不难想到，对于CPU的矩阵乘法算法如下：
由于，对于C中的每一个数，都要经过n次乘法运算，因此总的时间复杂度应该是O(n^3)。
对于计算两个1024*1024的输入方阵，CPU串行计算时间大概在9000ms左右。

//CPU version 其中C是结果矩阵，AB是输入矩阵。
void MatrixMulCPU(float *_C, const float* _A, const float* _B, int WA, int HA, int WB, int HB)
{
	if (WA != HB)
	{
		printf("the matrix A and B cannot be multipled!");
			exit(0);
	}
	for (int i = 0; i < HA; ++i)
	{
		for (int j = 0; j < WB; ++j)
		{
			for (int k = 0; k < WA; ++k)
			{
				_C[i * WA + j] += _A[i * WA + k] * _B[k * WB + j];
			}
		}
	}
}

2，GPU优化思路。

借矩阵乘法的GPU算法，我在这里完整地说一下GPU调用核函数的方法。
首先申请的是CPU上的内存，也即是host memory。

   const int width_A = 1024;
const int height_A = 1024;
const int width_B = 1024;
const int height_B = 1024;
float *B = (float *)malloc(sizeof(float) * height_B * width_B);
float *A = (float *)malloc(sizeof(float) * height_A * width_A);
float *C = (float *)malloc(sizeof(float) * height_A * width_B);

我是给了A,B矩阵0-100的随机数。

//产生随机数生成器
	srand((unsigned)time(0));
	randomInit(B, height_B * width_B);
	randomInit(A, height_A * width_A);

之后申请device memory，并且把A, B所在内存拷贝过GPU上。

   float *dev_a = 0;
float *dev_b = 0;
float *dev_c = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
   cudaStatus = cudaSetDevice(0);
   
   cudaStatus = cudaMalloc((void**)&dev_c, HA * WB * sizeof(float));
   
   cudaStatus = cudaMemcpy(dev_a, a, HA * WA * sizeof(float), cudaMemcpyHostToDevice);
   cudaStatus = cudaMemcpy(dev_b, b, HB * WB * sizeof(float), cudaMemcpyHostToDevice);

到这一步后，我们的准备工作就算是做好了，之后的工作是自己来设定，我们需要在GPU上开启多少个线程，这个你可以把它也视为核函数的一个参数，并且把这些device上的内存地址当作参数传递给核函数。

在这个基本的GPU算法中，我们为C的每一行每一列都设置一个线程。

//为每一个C[i][j]设置一个线程进行计算
int block_size = 16;
dim3 Threads(block_size, block_size);
dim3 Blocks(WB / block_size, HA / block_size);
//用这个方式来调用核函数
MatrixMulGPU_1 << <Blocks, Threads >>>(dev_c, dev_a, dev_b, WA, WB);

核函数，我们通过blockIdx和threadIdx索引获得当前是哪个线程，来计算它所负责的部分。

__global__ void MatrixMulGPU_1(float *c, const float *a, const float *b, unsigned int WA, unsigned int WB)
{
	float sum = 0;
	//找出该线程所在的行和列
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	//线程Thread(row, col)负责计算C(row, col)
	for (int i = 0; i < WB; ++i)
	{
		sum += a[row * WA + i] * b[i * WB + col];
	}
	c[row * WB + col] = sum;
}

可以看到，每个线程只进行了N次运算，而每个线程之前是并行的，所以时间理论上只是O(n)，虽然在拷贝内存上还需要花费一些时间，但是从O(n^3)->O(n)这个提升也是巨大的，对时间的统计数据显示也是如此。从9000ms提升到50ms左右。

优化政策1，shared memory

那么在实现了基本的矩阵GPU算法后，我们还有没有什么方法继续优化我们的算法，使得速度进一步提高呢？结果是肯定的。在之前的算法中，我们的每一个C[i][j]都算了n次，而在这n次取数的过程中，还有我们可以做的很多事情。首先我们可以对每一行每一列进行细分，划分成更细的block_size大小（可以自己来设定），然后再计算每个block_size*block_size时，我们设计一个共享区，然后让数的计算存这个共享区中取数，就可以减少取数的时间。下面是相应的代码，可以在英伟达的标准实例代码中找到：

template<int BLOCK_SIZE> __global__ void MatrixMulGPU_2(float *c, const float *a, const float *b, unsigned int WA, unsigned int WB)
{
	// Block index
	int bx = blockIdx.x;
	int by = blockIdx.y;
	// Thread index
	int tx = threadIdx.x;
	int ty = threadIdx.y;
	// Index of the first sub-matrix of A processed by the block
	int aBegin = WA * BLOCK_SIZE * by;
	// Index of the last sub-matrix of A processed by the block
	int aEnd = aBegin + WA - 1;
	// Step size used to iterate through the sub-matrices of A
	int aStep = BLOCK_SIZE;
	// Index of the first sub-matrix of B processed by the block
	int bBegin = BLOCK_SIZE * bx;
	// Step size used to iterate through the sub-matrices of B 
	int bStep = BLOCK_SIZE * WB;
	// Csub is used to store the element of the block sub-matrix
	// that is computed by the thread
	float Csub = 0;
	// Loop over all the sub-matrices of A and B
	// required to compute the block sub-matrix
	for (int i = aBegin, j = bBegin;
		i <= aEnd;
		i += aStep, j += bStep)
	{
		// Declaration of the shared memory array As used to
		// store the sub-matrix of A
		__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
		// Declaration of the shared memory array Bs used to
		// store the sub-matrix of B
		__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
		// Load the matrices from device memory
		// to shared memory; each thread loads
		// one element of each matrix
		As[ty][tx] = a[i + WA * ty + tx];
		Bs[ty][tx] = b[j + WB * ty + tx];
		// Synchronize to make sure the matrices are loaded
		__syncthreads();
		// Multiply the two matrices together;
		// each thread computes one element
		// of the block sub-matrix
#pragma unroll
		for (int k = 0; k < BLOCK_SIZE; ++k)
		{
			Csub += As[ty][k] * Bs[k][tx];
		}
		// Synchronize to make sure that the preceding
		// computation is done before loading two new
		// sub-matrices of A and B in the next iteration
		__syncthreads();
	}
	// Write the block sub-matrix to device memory;
	// each thread writes one element
	int k = WB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
	c[k + WB * ty + tx] = Csub;
}

其中syncthreads()是用于保证同步的，每一个线程在运行到这一步时，都会停下来等待，直至所有进程都进行到这才会继续下去，以此来保证同步。

优化政策2， Texture Memory

纹理这一块，我的理解是，一般的内存都是线性存放的，不论是二维或者是三维，但存放的本质就是串行线性。故在访问例如 a[1][2]和a[1][3]时，比访问a[1][2]和a[2][2]时，后者的耗时要多一些。但对于纹理，就不会出现这样的问题，以此来提高访问效率。

首先，先声明纹理

1 2	texture<float, 2, cudaReadModeElementType> texA; texture<float, 2, cudaReadModeElementType> texB;

然后以cudaArray为介质，将内存中的数放到纹理中

//GPU mode3 with texture memory
		cudaChannelFormatDesc channelDescA = cudaCreateChannelDesc((int)sizeof(float) * 8, 0, 0, 0, cudaChannelFormatKindFloat);
		cudaChannelFormatDesc channelDescB = cudaCreateChannelDesc((int)sizeof(float) * 8, 0, 0, 0, cudaChannelFormatKindFloat);
		cudaArray* mat_A;
		cudaArray* mat_B;
		cudaMallocArray(&mat_A, &channelDescA, width_A, height_A);
		cudaMallocArray(&mat_B, &channelDescB, width_B, height_B);
		cudaMemcpyToArray(mat_A, 0, 0, A, sizeof(float) * height_A * width_A, cudaMemcpyHostToDevice);
		cudaMemcpyToArray(mat_B, 0, 0, B, sizeof(float) * height_B * width_B, cudaMemcpyHostToDevice);

接着设置纹理的相关属性

texA.addressMode[0] = cudaAddressModeWrap;
texA.addressMode[1] = cudaAddressModeWrap;
texA.filterMode = cudaFilterModePoint;
texA.normalized = false;
texB.addressMode[0] = cudaAddressModeWrap;
   texB.addressMode[1] = cudaAddressModeWrap;
texB.filterMode = cudaFilterModePoint;
texB.normalized = false;

最后纹理的相关核函数如下:

__global__ void MatrixMul(float *c, unsigned int w, unsigned int h)
{
	float sum = 0;
	//找出该线程所在的行和列
	int row = blockIdx.y * blockDim.y + threadIdx.y;
	int col = blockIdx.x * blockDim.x + threadIdx.x;
	/*
	//计算纹理坐标
	float u = row / (float)w;
	float v = col / (float)h;
	//线程Thread(row, col)负责计算C(row, col)
	u -= 0.5f;
	v -= 0.5f;
	*/
	for (int i = 0; i < w; ++i)
	{
		sum += tex2D(texA, i, row) * tex2D(texB, col, i);
	}
	c[row * w + col] = sum;
}

可以看到用2维纹理来访问我们相关的矩阵，是非常方便的，只需要用tex2D()就行。

最后是这些方法的截图。

注：在release版本中，使用texture优化的时间比正常的GPU算法要慢，这个现象由于矩阵乘法本身对取数是有一定规律性的，而texture优化对一些随机访问内存算法比较有效，在此我们不深究，只为学习纹理内存用。