The latest version of the CUDA Toolkit, 11.8, is released by NVIDIA.

The latest version of the CUDA Toolkit, 11.8, is released by NVIDIA. The primary goals of this version are to speed up CUDA applications and improve the programming model.

NVIDIA Hopper and Ada Lovelace's new architecture-specific features are originally made available through library and framework updates. Starting with the CUDA Toolkit 12 series, the complete programming model improvements for the NVIDIA Hopper architecture will be made available.

There are numerous crucial enhancements in CUDA 11.8. An overview of the main capabilities is provided in this post.

Support for the NVIDIA Hopper and NVIDIA Ada architectures

Enhanced memory bandwidth, higher clock rates, and increased streaming multiprocessor (SM) count in new GPU types are all instantly advantageous to CUDA applications.

New performance optimizations based on improvements to the GPU hardware architecture are revealed by CUDA and its libraries.

lazily loading modules

NVIDIA extended lazy loading to the CPU module side, building on the lazy kernel loading capability introduced in 11.7. This results in speedier function and library loading times on the CPU and, occasionally, significant memory footprint savings. The trade-off is a negligible degree of delay when the functions are first loaded in the programme. Overall, this delay is less than the latency with lazy loading.

To be qualified for lazy loading, all libraries used with it must be created using 11.7 or above.

In this version, the CUDA stack's default setting for lazy loading is disabled. Run it with the environment option CUDA MODULE LOADING=LAZY set to test it out for your application.

Enhanced MPS signal processing

Any apps operating in MPS settings can now be terminated with SIGINT or SIGKILL without impacting other processes. Although not full error isolation, this improvement makes it possible to regulate applications more precisely, which is very useful in bare-metal data centre scenarios.

Installation of NVIDIA JetPack is made simpler

On Jetson systems, NVIDIA JetPack offers a complete development environment for hardware-accelerated AI at the edge. Users of NVIDIA JetPack 5.0 and subsequent Jetson devices can upgrade to the most recent CUDA versions without also upgrading their Jetson Linux BSPs to keep up with CUDA desktop releases as of CUDA Toolkit 11.8.

Updates to CUDA developer tools

In order to assist you in locating and resolving performance issues, compute developer tools are created in perfect sync with the CUDA environment.

To aid in the optimization of CUDA kernels, Nsight Compute allows you to disclose low-level performance information, debug API calls, and analyse workloads. In order to facilitate performance tweaking activities on the NVIDIA Hopper architecture, new compute functionalities are being provided in CUDA 11.8.

NVIDIA Hopper thread block clusters, which accelerate the speed and give you more control over the GPU, can now be profiled and debugged. The Tensor Memory Accelerator (TMA), the NVIDIA Hopper quick data transmission technology between global and shared memory, is being launched alongside cluster tuning and profiling functionality.

Nsight Compute for CUDA 11.8 also includes a new sample. The example offers source code as well as already gathered data that guides you through every step of the process of locating and resolving an uncoalesced memory access issue. Learn how to leverage toolkit features and tackle related problems in your own application by examining additional CUDA samples.

Nsight Systems profiling may provide light on problems including GPU depletion, pointless GPU synchronisation, inadequate CPU parallelization, and costly algorithms across CPUs and GPUs. You may adjust your models and settings to improve overall single or multi-GPU utilisation by understanding these patterns and the load of deep learning frameworks like PyTorch and TensorFlow.

The first version to support GPUs from NVIDIA Hopper and NVIDIA Ada Lovelace, In addition to device-side kernels, lazy module loading is now supported for CPU-side modules as well. Updates to the CUDA development tools, simplified installation of NVIDIA JetPack, and improved MPS signal handling for halting and terminating programmes.

The latest version of the CUDA Toolkit, 11.8, is released by NVIDIA.