GPU基本概念详解

§1 个 multiprocessor <-> 1个instruction unit <-> 8 个processor <-> 在一个warp中执行 <-> 32条threads

1个processor处理1条thread，所以1个warp在1个multiprocessor中需要4个clock cycles执行完成；

1个multiprocessor可以处理多个block，但是一个block只能放在一个multiprocessor中；

一个block里头有shared memory，这些shared memory分成16个banks，刚好让half-warp （一个warp有32条threads）操作。当多条processor同时要同一个bank里头的数据时，bank会broadcast，让这些thread排队，这其实是sequential的，就与CPU没有区别，没有了GPU的优点，所以要尽量避免这种情况发生。

一个processor一次只能处理一条thread，之所以比CPU快，是因为GPU有多个processor。

coalescing：global memory transactions （基本单位 half-warp：16条threads）

的时候，第一条thread对应的address必须是64bytes的倍数，然后之后31条threads 一一对应，只有这样才能批量的global memory transaction,否则thread就要一条一条做global memory transaction，效率很低。

úEach active thread is allocated some registers for the entire lifetime of the thread

úEach active block is allocated shared memory for the entire lifetime of the block

§Very fast on-chip memory

§Can be used to avoid non-coalesced global memory accesses

§Can be used to reduce global memory accesses

§Shared memory is organized into 16 banks, where successive 4-byte words are assigned to successive banks

§Memory load or store of n addresses by a half-warp that span n distinct memory banks can be serviced simultaneously

§If multiple addresses map to the same memory bank, the accesses are serialized

§If multiple requests for the same memory address, a broadcast occurs

同一个block中的threads共用shared memory，同一个block中的thread才能进行__syncthreads()。这个是shared memory的概念。global memory有thread的概念，但是没有block的概念，即没有将thread组织成block，让同一个block中的thread相互协作。

Shared memory is organized into 16 banks

在device的code中，如果是unqualified的变量是register memory的。