原子操作 | Han's Blog Cell

1.为什么需要使用原子操作

对于一些（Read-Modify-Write）操作。在多个线程同时对同一地址进行操作时，根据线程的调度方式的不同，可能产生不同的结果（有的正确，有的错误）。而我们对于这种不确定性是绝对不能容忍的。

2.exmaple

在计算直方图问题中，如果直接采取GPU并行统计出现的次数，并将数据直接原子加到全局变量中。如下：

__global__ void histo_kernel( unsigned char *buffer,
                              long size,
                              unsigned int *histo ) {
    // calculate the starting index and the offset to the next
    // block that each thread will be processing
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &histo[buffer[i]], 1 );
        i += stride;
    }
}

性能是相当慢的，（比CPU的还慢了几倍）。
为了避免这种大量的线程往同一块较小内存中写数据的情况发生，我们利用shared memory提升性能。这种技巧并没有减少原子操作的次数，相反是增大了，同时又共享内存原子操作和全局内存原子操作。

__global__ void histo_kernel( unsigned char *buffer,
                              long size,
                              unsigned int *histo ) {
    // clear out the accumulation buffer called temp
    // since we are launched with 256 threads, it is easy
    // to clear that memory with one write per thread
    __shared__  unsigned int temp[256];
    temp[threadIdx.x] = 0;
    __syncthreads();
    // calculate the starting index and the offset to the next
    // block that each thread will be processing
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &temp[buffer[i]], 1 );
        i += stride;
    }
    // sync the data from the above writes to shared memory
    // then add the shared memory values to the values from
    // the other thread blocks using global memory
    // atomic adds
    // same as before, since we have 256 threads, updating the
    // global histogram is just one write per thread!
    __syncthreads();
    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}

正如在直方图计算中看到得，有时候依赖原子操作会带来性能问题，并且这些问题只能通过算法的某些部分进行重构来加以解决。但是在直方图实例中，我们使用了一种两阶段算法，该算法降低了在全局内存访问上竞争程度。通常，这种降低内存竞争程度的策略总能带来不错的效果，因此在使用原子操作的时候，要记住这种策略。