GPUDirect RDMA: How it Works and More
What is NVIDIA GPUDirect® RDMA?
NVIDIA GPUDirect RDMA (remote direct memory access) is a part of the NVIDIA Magnum IO™ family of technologies and allows users to rapidly transfer data between GPUs and other devices without involving the CPU itself or any host memory.
In a traditional data transfer scenario, data moves from the GPU to the system memory, to the NIC, and finally to the destination device. This process involves making multiple copies of data and involving the CPU, which increases latency and reduces overall system efficiency.In contrast, the GPUDirect RDMA data transfer process is optimized so the data can be transferred directly from the GPU memory regions on the NIC or other RDMA-capable devices without going through the system memory or CPU. This direct path to satellite device memory significantly reduces latency and frees up the CPU for other tasks.
This offers a few basic benefits we’ll discuss in more detail below:
- Lower latency. Since it does not pass through the CPU or system memory, data transfer demands minimal time.
- Higher bandwidth. By eliminating unnecessary data copies, the system can achieve higher data throughput.
- Reduced CPU overhead. Excluding the CPU from data transfer frees it up to handle other tasks — a special advantage in parallel computing environments.
GPUDirect RDMA ethernet is particularly useful for high-performance computing (HPC), data analytics, and AI workloads. It is designed to run on Tesla and Quadro-class GPUs.
GPUDirect RDMA Design
What is required for GPUDirect RDMA NVIDIA to directly transfer data between GPU memory and other RDMA-capable devices such as NICs and storage controllers?
Here is a basic overview of design, hardware, and software components for NVIDIA GPUDirect RDMA.
Hardware-level integrations for GPUDirect RDMA include:
- GPU and NIC. Specialized hardware integration between NVIDIA GPUs and RDMA-capable NICs supports Peripheral Component Interconnect Express (PCIe) with NV peer-to-peer memory and communication capabilities (P2P).
- Memory mapping. GPU memory is mapped using memory-mapped I/O (MMIO) and advanced PCIe features so the NIC can access it directly without system memory involvement.
- Coherent data transfer. The hardware ensures data is correctly synchronized across all components.
Software stack requirements for GPUDirect RDMA networking include:
- Compute Unified Device Architecture (CUDA) drivers. NVIDIA CUDA drivers enable critical memory mapping and direct data transfer capabilities and support GPUDirect RDMA.
- RDMA APIs. The system supports RDMA, typically with a stack such as OpenFabrics Enterprise Distribution (OFED). RDMA APIs allow applications to initiate and manage direct memory access transfers.
- Operating system support for necessary GPU drivers and APIs. Linux is commonly used in environments that leverage GPUDirect RDMA due to its robust support for HPC and GPU computing.
System requirements for GPUDirect RDMA include:
- Compatible NVIDIA GPUs. The system must support GPUDirect RDMA, typically with the high-performance GPUs used in data centers, such as the NVIDIA A100, V100, or similar models.
- RDMA-capable NIC. A network interface card (NIC) that supports RDMA is required because NICs must directly interact with the GPU memory.
- PCI express (PCIe) configuration. The system must support PCIe with P2P communication capabilities to allow the GPU and NIC to communicate directly. PCIe is closely related to GPUDirect RDMA and facilitates high-performance computing, particularly in environments where GPUs are heavily used, such as in data centers, scientific computing, and AI training.
- Compatible motherboard and CPU. The system should have sufficient PCIe lanes and bandwidth to handle high-speed data transfers without bottlenecks. Although not directly involved in the data transfer, the CPU should still be powerful enough to manage the overall system efficiently.
- Software environment. A Linux distribution is required, typically as described above, with kernel module support for RDMA and NVIDIA GPUDirect RDMA drivers.
- Application support. Applications need to be designed or modified to take advantage of GPUDirect RDMA. This usually involves using specific libraries or APIs that support RDMA and direct GPU memory access.
- Network Interface Cards (NICs). GPUDirect RDMA requires RDMA-capable NICs such as those using InfiniBand, RoCE (RDMA over Converged Ethernet), or iWARP protocols.
- Operating System (OS). Linux is the most common OS used for GPUDirect RDMA, particularly distributions like Ubuntu, CentOS, or Red Hat. The OS must support RDMA and the necessary drivers.
Workflow in a GPUDirect RDMA System:
- Registration. This initial registration step with the RDMA NIC makes memory accessible for direct transfers.
- Initiation. An application or process triggers the data transfer using RDMA APIs.
- Direct data transfer. The NIC directly reads or writes data to the GPU memory over the PCIe bus, bypassing the CPU and system memory.
- Completion handling. The application is notified of the transfer completion, and the data can be used by the GPU or another process.
GPUDirect RDMA CUDA
GPUDirect RDMA CUDA (Compute Unified Device Architecture) integrates RDMA technology into the parallel NVIDIA computing platform and programming model, CUDA, to facilitate direct interactions between CUDA-based applications and RDMA-capable devices.
How GPUDirect RDMA CUDA works:
CUDA Integration
CUDA applications can leverage GPUDirect RDMA with specific CUDA APIs and libraries that support this functionality. This enables the application to initiate and manage RDMA operations directly from the GPU, bypassing the CPU for data transfers.
For example, a CUDA application can use RDMA APIs to register GPU memory with an RDMA-capable NIC. Once registered, the NIC can directly read from or write to GPU memory, enabling efficient data exchange between the GPU and other devices or remote GPUs across a network.
Use cases for GPUDirect RDMA CUDA applications:
Applications include multi-node training setups for distributed machine learning and scientific simulations for high-performance computing (HPC) environments. Other uses include applications that demand real-time data processing, such as financial trading systems or real-time video analytics.
GPUDirect RDMA Benchmark
There are several key GPUDirect RDMA benchmarks to understand, and multiple factors that can affect benchmarking outcomes:
Typical GPUDirect RDMA benchmark results:
- GPUDirect RDMA latency. For small messages, latencies can be as low as 2-5 microseconds.
- GPUDirect RDMA bandwidth. This ranges from 10 to 25 GB/s depending on the PCIe generation and the specific hardware configuration.
- Message size vs. latency/throughput. High message rates—especially for small messages, which is expected—indicate efficient handling of many transactions per second, a crucial outcome for distributed workloads.
Multiple factors influence benchmark results:
- PCIe generation. PCIe Gen4 offers higher bandwidth compared to Gen3, improving performance.
- GPU and NIC model. The specific GPU and NIC models, along with their firmware and driver versions, can significantly impact performance.
- System configuration. The overall system setup, including the number of GPUs, network topology, and CPU architecture, will influence benchmark outcomes.
GPUDirect RDMA Examples
NVIDIA GPUDirect RDMA is employed in various HPC and data-intensive applications to which minimizing latency and maximizing data throughput are critical. Here are some GPUDirect RDMA samples to think about:
- Distributed deep learning and AI training. In large-scale deep learning tasks, models are often trained across GPUs over nodes in a distributed cluster. GPUDirect RDMA efficiently shares model parameters and gradients between GPUs across these nodes for training deep neural networks (DNNs) on tasks such as image recognition or natural language processing.
- Scientific simulations and HPC. Climate modeling and fluid or molecular dynamics often require the processing power of multiple GPUs. These simulations involve large datasets that must be shared and processed in parallel across different computing nodes. GPUDirect RDMA can transfer particle data directly between GPUs on different nodes, so scientists can simulate even complex molecular systems more quickly.
- Real-time data analytics and financial services. Applications such as high-frequency trading require ultra-low-latency processing of market data to make split-second decisions. GPUDirect RDMA can transfer data directly from network interfaces so that incoming data can be processed rapidly and trades can be executed faster relative to competitors.
- Telecommunications and 5G networks. Advanced systems that support high-bandwidth applications like video streaming, augmented reality, and IoT require fast, efficient processing of network data.
- Autonomous vehicles and robotics. These use GPUDirect RDMA to process sensor data from cameras, LIDAR, and RADAR in real-time for object detection, path planning, and decision-making.
GPUDirect RDMA vs GPUDirect Storage
Both GPUDirect Storage and GPUDirect RDMA improve performance without burdening the CPU, but they achieve that goal differently.
GPUDirect Storage specifically enables direct data transfers between storage devices and GPU memory to optimize access for GPU-powered applications. It functions without regard to storage location.In contrast, GPUDirect RDMA is a broader technology that facilitates high-speed data transfers by enabling direct memory access between GPUs and devices and storage systems across the network. It functions based on which device or storage location it is accessing.
GPUDirect RDMA vs NVLink
Like NVIDIA GPUDirect RDMA, NVLink also enhances data transfer and communication in systems utilizing NVIDIA GPUs. However, it is also different in a few significant ways:
- While GPUDirect RDMA communicates directly with other devices over the PCIe bus, it cannot achieve the same bandwidth; NVLink is an alternative to PCIe for GPU-to-GPU or CPU-to-GPU communication within the same node.
- NVLink is designed specifically for high-speed, low-latency communication, effectively creating a “mesh” of GPUs that can share memory and data rapidly. The mesh enables peer-to-peer allocation between multiple GPUs with very high bandwidth and low latency and communicates at much higher speeds compared to PCIe.
Benefits of GPUDirect RDMA
The benefits of GPUDirect remote direct memory access include:
- Reduced latency. Bypassing the CPU in data transfers reduces latency and 2-8 times more bandwidth, resulting in faster data access, more rapid reads and writes, and performance improvement.
- Increased throughput. GPUDirect RDMA optimizes data transfer paths, allowing for higher throughput between storage devices and GPU memory, enhancing overall system efficiency.
- Improved storage scale. GPUDirect RDMA enables efficient data access in multi-GPU and distributed computing environments, supporting large-scale applications.
- Lower CPU overhead. Direct data transfers between storage and GPU memory reduce CPU overhead, free up CPU resources for other tasks, and improve overall system efficiency.