This Visual Profiler Optimization Guide is a manual to help developers obtain the best performance from the NVIDIA® CUDA™ architecture using version 4.2 of the NVIDIA Visual Profiler. It presents established optimization techniques and explains coding metaphors and idioms that can greatly simplify programming for the CUDA architecture.
This guide refers to and relies on several other documents that you should have at your disposal for reference, all of which are available at no cost from the CUDA website http://www.nvidia.com/object/cuda_develop.html. The following documents are especially important resources:
The CUDA C Best Practices Guide is an especially valuable resource, as it provides much of the content that is in this optimization guide along with additional advice. Be sure to download the correct manual for the CUDA Toolkit version and operating system you are using.
Throughout this guide, specific recommendations are made regarding the design and implementation of CUDA C code. These recommendations are categorized by priority, which is a blend of the effect of the recommendation and its scope. Actions that present substantial improvements for most CUDA applications have the highest priority, while small optimizations that affect only very specific situations are given a lower priority.
Before implementing lower priority recommendations, it is good practice to make sure all higher priority recommendations that are relevant have already been applied. This approach will tend to provide the best results for the time invested and will avoid the trap of premature optimization.
The criteria of benefit and scope for establishing priority will vary depending on the nature of the program. In this guide, they represent a typical case. Your code might reflect different priority factors. Regardless of this possibility, it is good practice to verify that no higher-priority recommendations have been overlooked before undertaking lower-priority items.
Code samples throughout the guide omit error checking for conciseness. Production code should, however, systematically check the error code returned by each API call and check for failures in kernel launches (or groups of kernel launches in the case of concurrent kernels) by calling cudaGetLastError().