Is Your Data Infrastructure Starving Your GPU-Driven AI? 

Robert Murphy. April 8, 2022
Is Your Data Infrastructure Starving Your GPU-Driven AI? 

Your Data Infrastructure Might be the Culprit.

 AI is rapidly becoming the de facto standard for supporting high-performance applications and workloads in modern organizations of every kind. According to a 2021 study conducted by PwC, 86% of companies surveyed said that AI is becoming a “mainstream technology” at their company. A 2021 WIRED article notes that NVIDIA GPUs were deployed in 97.4 percent of AI accelerator instances at the top four cloud providers (AWS, Google, Alibaba, and Azure) in 2019. Global independent research firm Omdia projects that the annual market revenue for AI processors will reach $37.6 billion by 2026.

Organizations have tremendous expectations for AI – and a lot is riding on getting it right. From driving better patient and healthcare outcomes to enhancing customer experiences, increasing worker productivity to physical security, and improving the safety of self-driving cars to machine maintenance, many business, and societal advancements rely on the successful deployment and application of AI. 

Why GPUs?

GPUs might be best known for their role in producing rich graphic imagery and immersive video games, but they do so much more. Programmable, general-purpose GPUs play an essential role in powering high-performance computing, satellite imagery, and life sciences innovation and discovery, to name only a few. They are especially adept at number crunching, as each of the thousands of cores in a single GPU can perform calculations simultaneously. For comparison, a high-end CPU is limited to between 8 and 128 cores. What makes GPUs particularly adept at supporting AI is their ability to train neural networks prior to deployment, which requires running vast amounts of data through a model. 

Suppose you want to train an AI system to recognize and identify dogs. In this case, you might need to show the deep learning AI model 15 million images of dogs before it converges on a reliably accurate identification solution. This data scale is a marked departure from previous GPU applications, typically involving running internal calculations from constrained data sets and outputting results.

The GPU bottleneck lurking in your data infrastructure 

Although widely debunked in recent years, it used to be widely believed that people use only 10% of their potential brain capacity at any given time. While it may come as a relief that this assertion doesn’t apply to humankind, it turns out that the old saw might have applicability in the realm of AI and neural networks. If true, this presents a significant hurdle for organizations looking to deploy GPU-accelerated AI to support critical operations.

 Once upon a time, GPUs were used only to process local datasets. Today, GPUs process massive amounts of data in disparate locations. In the case of our dog example above, your GPUs must look at a picture of a dog, identify relationships within the image and then quickly move on to the next, processing sometimes millions of images until they’re done (this is called a deep learning “epoch” for good reason). The GPUs must move so fast that they constantly demand more input. 

This shift from computationally driven applications to data-driven deep learning is where your GPU’s potential can get bogged down. Because while your data infrastructure is retrieving the next image and shunting it across the network to local storage, your GPUs are effectively doing nothing, just twiddling their thumbs and operating far below their potential. As a result, your organization is only benefiting from a relatively small percentage of the actual capabilities of its GPUs.

Achieving AI objectives will be challenging without a system that can manage and deliver the data needed to train and sustain models. Traditional storage methods are insufficient to meet the voracious demands of your AI-in-training. Numerous factors, including inadequate storage performance, data processing and management issues, data movement challenges, and the need to serve multiple data systems working with GPUs can all contribute to the problem. Sadly, an organization may not even know it isn’t maximizing the full potential of its GPUs.

The first data platform built for GPU-driven AI

Organizations have turned to local storage in the past, but this is no longer a feasible option when processing data with AI at scale. Just as organizations have moved away from traditional compute to support GPU acceleration, the time has come to move away from traditional local storage for high-performance workloads. GPU-led data science needs a data platform that is purpose-built to support it.

The WEKA Data Platform for AI collapses the conventional “multi-hop” data pipelines that starve modern workloads of GPUs into a single, zero-copy, high-performance data platform for AI. Its parallel data delivery works in harmony with the GPU’s parallel architecture, providing direct access to data to optimize utilization and dramatically reduce AI training time. 

Incorporating the WEKA Data Platform for AI into deep learning data pipelines dramatically increases data transfer rates to NVIDIA GPU systems, saturating the GPU’s cores with data and eliminating wasteful data copying and transfer times between storage silos. The result is a geometrical increase in the number of training data sets analyzed per day.

With a data platform designed to support GPUs, companies can now cost-effectively apply AI to a much wider variety of use cases. WEKA makes the entire data storage, management, and pipeline process more efficient and less complex. The net result is accelerated, optimized GPU utilization at every step, from training through deployment, ensuring that your AI applications can operate without limits and achieve their full potential.