Microsoft Research Customer Use Case: WekaIO™ and NVIDIA® GPUDirect® Storage Results with NVIDIA DGX-2™ Servers

Bob Bakh. October 20, 2020
Microsoft Research Customer Use Case: WekaIO™ and NVIDIA® GPUDirect® Storage Results with NVIDIA DGX-2™ Servers

WekaIO™ (Weka) in partnership with Microsoft Research produced among the greatest aggregate NVIDIA® GPUDirect® Storage (GDS) throughput numbers of all storage solutions that have been tested to date. Using a single NVIDIA DGX-2 server* connected to a WekaFS™ cluster over a Mellanox InfiniBand switch the testers were able to achieve 97.9GB/s of throughput to the 16 NVIDIA A100 GPUs using GPUDirect Storage. This high-level performance was achieved and verified by running the NVIDIA GDSIO utility for more than 10 minutes and showing sustained performance over that duration.

The engineers observed throughout testing that the WekaFS system was not fully burdened. In an effort to completely maximize performance they ran additional GDSIO processes on the same client server, this time utilizing all 10 NICs and all 20 ports. Note that the single port NICs that ship with the DGX-2 were replaced with dual-port NICs in order to fully utilize the available PCIE bandwidth in the system. This put a full load on the DGX-2 GPUs and also leveraged the CPUs. This second test configuration showed that the DGX-2 server with a single mount point to the WekaFS system could achieve 113.13GB/s of throughput.

Both tests were run by Microsoft who has WekaIO filesystem (WekaFS) deployed in conjunction with multiple NVIDIA DGX-2 servers in a staging environment. They agreed to run performance measurements in their staging environment to determine what performance levels they were capable of achieving with the combination of the new WekaFS code version 3.8 and the GPUDirect Storage feature within their current DGX-2 environment. Microsoft Research was impressed with the test outcome and plans to upgrade their production environment to the newest version of the WekaIO file system, which is  generally available and fully supports the NVIDIA® Magnum IO, which contains NVIDIA GPUDirect Storage. (See “GPUDirect Storage: A Direct Path Between Storage and GPU Memory.”)

What is GPUDirect® Storage

GPUDirect Storage is a groundbreaking technology from NVIDIA that allows storage partners like WekaIO to develop solutions that offer two significant benefits. The first benefit is CPU bypass. Traditionally, the CPU loads data from storage to GPUs for processing, which can cause a bottleneck to application performance because the CPU is limited in the number of concurrent tasks it can run. GPUDirect Storage creates a direct path from storage to the GPU memory, bypassing the CPU complex and freeing the sometimes-overburdened CPU resources on GPU servers to be used for compute and not for storage, thereby potentially eliminating bottlenecks and improving real-time performance. The second benefit is an increased availability of aggregate bandwidth for storage. Using GPUDirect Storage allows storage vendors to effectively deliver considerably more throughput. As witnessed with WekaFS, using GPUDirect Storage allowed testers to achieve the highest throughput of any solution that has been tested to date.

How GPUDirect® Storage and WekaFS Impact AI Performance

Generally, when designing AI/ML environments the most relevant consideration is the overall pipeline time. This pipeline time can include the initial extract, transform, and load (ETL) phase as well as the time it takes to copy the data to the local GPU server—or possibly only the time it takes to train the model on the data. As we know, storage performance is an enabler for improving the overall pipeline time by accelerating or completely removing the need for some of these steps. Therefore, Microsoft was able to see that the GPUDirect Storage and WekaFS solution enables the GPUs on a server to ingest data at the speed they required. Moreover, the GPU server now had additional CPU cores available for its CPU workloads, whereas before these CPU cores would have been busy performing storage IOs.

 

File System Number of NICs Throughput (GBytes/Sec) IOPS (random read 4kb)
Local NVMe
RAID 0
8 Drives
Local only 20.6GB 1.9M IOPS
WekaFS 1 23 GB/sec 400,000
WekaIO with GPUDirect Storage 10 113 GB/sec 5,000,000

Table 1: Performance metrics comparing local NVMe, WekaFS, and WekaFS with GPUDirect Storage

The value for organizations has become clear. The combination of WekaFS and NVIDIA GPUDirect Storage allows customers to use their current GPU environments to their maximum potential, as well as to accelerate the performance of their future AI/ML or other GPU workloads. Data scientists and engineers can derive the full benefit from their GPU infrastructure and can concentrate on improving their models and applications without being limited by the storage performance and idle GPUs.

NVIDIA GPUDirect Storage Webinar: Register here for a replay

Weka and NVIDIA: Weka AI and NVIDIA Accelerate AI Data Pipelines

WekaFS: 10 Reasons to Deploy the WekaFS™ Parallel File System

Weka Blog: How GPUDirect Storage Accelerates Big Data Analytics

Get Information about WekaFS or Schedule a Free Trial

 

* NVIDIA DGX-2 server was a non-standard configuration with single-port NICs being replaced with dual-port NICs.

What Nuance learned from 9 years of using GPUs

Fireside chat: The infrastructure behind Siri

Register Now

Related Resources

Reference Architectures, Solution Brief
Reference Architectures, Solution Brief

Supermicro AI/HPC Solution Bundle Featuring WekaFS

Download
Webinars
Webinars

Webinar: Boost Speed and Accuracy of Your AI Data Pipelines with HPE, Weka, and Scality

Watch Now
Video
Video

AI, GPUs, and Storage Use Cases in Healthcare

Watch Now