Device Memory Spaces

CUDA devices use several memory spaces, which have different characteristics that reflect their distinct usages in CUDA applications. These memory spaces include global, local, shared, texture, and registers, as shown in Figure 1.

Figure 1. Memory Spaces on a CUDA Device

Of these different memory spaces, global and texture memory are the most plentiful; see Section F.1 of the CUDA C Programming Guide for the amounts of memory available in each memory space at each compute capability level. Global, local, and texture memory have the greatest access latency, followed by constant memory, registers, and shared memory.

The various principal traits of the memory types are shown in Table 1.

Table 1. Salient Features of Device Memory
Memory Location on/off chip Cached Access Scope Lifetime
Register On n/a R/W 1 thread Thread
Local Off R/W 1 thread Thread
Shared On n/a R/W All threads in block Block
Global Off R/W All threads in block + host Host Allocation
Constant Off Yes R All threads in block + host Host Allocation
Texture Off Yes R All threads in block + host Host Allocation
Note: † Cached only on devices of compute capability 2.x.

In the case of texture access, if a texture reference is bound to a linear (and, as of version 2.2 of the CUDA Toolkit, pitch-linear) array in global memory, then the device code can write to the underlying array. Reading from a texture while writing to its underlying global memory array in the same kernel launch should be avoided because the texture caches are read-only and are not invalidated when the associated global memory is modified.