The NVIDIA nvcc compiler driver converts .cu files into C for the host system and CUDA assembly or binary instructions for the device. It supports a spate of switches, of which the following are especially useful for optimization and related best practices:
- -arch=sm_13 or higher is required for double precision. See Single vs. Double Precision.
- –maxrregcount=N specifies the maximum number of registers kernels can use at a per-file level. See Register Pressure. (See also the __launch_bounds__ qualifier discussed in Section B.17 of the CUDA C Programming Guide to control the number of registers used on a per-kernel basis.)
- --ptxas-options=-v or -Xptxas=-v lists per-kernel register, shared, and constant memory usage.
- –use_fast_math compiler option of nvcc coerces every functionName() call to the equivalent __functionName() call. This makes the code run faster at the cost of slightly diminished precision and accuracy. See Math Libraries.