Mo’ Better: Accelerating AI/ML Training
Andy Watson. June 7, 2019
Mo’ Better: Accelerating AI/ML Training
Accelerating AI/ML training with GPUs intensifies the I/O demands on data storage infrastructure, and by rising to the challenge with the world’s fastest filesystem, WekaIO enables better outcomes.
Employers routinely demand that we should all “do more with less.” Hypothetical efficiencies predicted by someone only interested in the bottom right cell of an over-simplified spreadsheet can drive us crazy. We can gripe about it, but it won’t change anything: since about 1980, spreadsheet software has metastasized to become the universal root of all evil.
“Artificial intelligence experts thought that it would be at least 20 years before
a computer could beat a human playing Go.
Last year, AlphaGo (a computer program from Google’s Deep Mind AI research acquisition)
absolutely crushed the world’s best player. And now it can crush and play the top 50
simultaneously and crush them all. That pace of progress is remarkable.”
— Elon Musk (2017)
Incrementalism is the death-by-a-thousand-cuts whittling away at every product, service and (by extension) every experience in the world. Nevertheless, there is hope: AI will be smarter than short-sighted human managers. That’s because AI has been ML-trained from a myriad of examples that bad things happen if you shave yet another 10% after repeated trimming with no regard to the history of prior cuts.
One important way to enhance the AI/ML training process is to provide a larger, deeper, richer data lake of examples. Then use GPU-accelerated servers to chew through that larger pool of examples faster, because it’s advantageous to maximize the data access. If only it were so straightforward….
GPU acceleration creates I/O demands which exceed the capabilities of conventional filesystems, where aggregate total throughput performance is irrelevant when the challenge is feeding data to a single mount point. When this problem was first encountered, the knee-jerk reaction was to install local NVME-flash storage onto every compute server, and by brute force make local copies of the data needed for each ML training event to consume. But there are two problems with that approach: (1) It encumbers the workflow with complexity and injects wall clock delay associated with finding and copying the next data set; and (2) Those local flash storage devices are being constantly rewritten top-to-bottom, shortening their usable lives.
WekaIO directly addresses all the above concerns by providing a shared filesystem alternative which is faster than the local NVMe flash — i.e., lower latency for file accesses with throughput that saturates 100-Gbit/s networks. One WekaIO customer (who has asked to not be named) has a large GPU-accelerated compute cluster with continuous aggregate throughput to its multi-petabyte data lake in the range of 80 GB/s (gigabytes per second). By using WekaIO’s filesystem, the data access was so dramatically improved that each ML training epoch completed 80x faster — instead of taking 14 days, each epoch began finishing in only 4 hours.
Does that mean they finish faster and go home early? No. It means they can run 80x more training epochs, resulting in significant improvements to the quality of their outcomes. Mo’ Better!
Another often-overlooked consideration is how the files are organized. Even if you could figure out a way to improve the single-mount-point performance of other alternative filesystems from decades past, they all have another problem to varying degrees. They struggle with large numbers of small files due to metadata overhead issues. A typical “best practices” advisory for such historical, relic filesystems will dictate limiting the number of files per directory to under 100,000. That may sound like plenty until you pause to realize that these modern AI/ML data lakes grow to encompass many billions of files, and that every day, tens (or hundreds) of thousands of incoming files are automatically created on a continuing basis.
Does it make sense to divert the attention of your data scientists to housekeeping chores and reorganizing files to avoid overloading directories just because of metadata threshold issues endemic to antique filesystems? Of course not! With WekaIO, trillions of files can go in the same directory with no downside, liberating your brightest minds to focus on their research.
Also, recent research indicates that it is often beneficial to allow the ML algorithms to make their own roaming choices on which files to access as the next examples for training, instead of predetermining the choices in advance. Pooling more files into common directories simplifies that free-ranging scenario.
Enlarging the Funnel
“If you don’t know what a funnel is, get Mommy to show you one.”
― Kurt Vonnegut, The Sirens of Titan (1959)
If you think of each computational process, end-to-end, as being analogous to a grain of sand in an hourglass, then the capacity of the bottom vessel in the hourglass corresponds to the minimum number of processing events that will be needed to achieve AI competency. The top of the hourglass is full of examples which can be used for training — it is the “data lake” to feed the ML training processes. And the tube joining the two bulbs is the limiting constraint: the infamous bottleneck.
Hourglass and Grains of Sand Representation of Basic AI/ML Training
Now, imagine that you are more ambitious. To go beyond minimum competency and train to the highest levels of expertise (e.g., bulletproof credit card fraud detection, unbeatable cybersecurity, uncanny tumor detection for medical radiology images, autonomous vehicle proficiency to enable cars to self-drive anywhere under any drivable conditions, etc.) then the capacity of the vessel at the bottom of the hourglass would be analogously very much larger. Let’s suppose it is 50x larger. If you didn’t change anything else, your hourglass would become awkwardly asymmetrical.
Representation of AI/ML Training for Highest Expertise
As a data scientist and AI/ML researcher, you already own the goal of advancing your own intellectual property, the secret sauce that has the potential to most significantly differentiate your autonomous system from your competition. But you also realize that you’re going to need a great many more examples for training, validation, and testing. As your modeling matures, you might embrace a diversity of file types — for example, using videos as well as images — with increased file sizes such that the total amount of data ingested during each training epoch will grow. The vessel at the top of the hourglass is going to become much larger. Maybe even disproportionately larger relative to the ambitious scale of the bottom vessel, but for now let’s just say they need to be about the same.
Representation of AI/ML Training with Huge Data Lake of Examples for Highest Expertise
You’ve still got some challenges to overcome. For one thing, as you can obviously see in the picture, that channel connecting the top to the bottom might have been adequate when your data sets were small and your goals were less ambitious. But now it’s become a serious bottleneck. So you ask your IT infrastructure team to give you 100-Gigabit networking.*
Representation of AI/ML Training for Highest Expertise with Huge Data Lake of Examples and Bigger Networking Pipe
Alas, simply having a faster network turns out not to be enough. The computational processes are not yielding results at a rate commensurate with the vastly enlarged data lake now available; so you enhance all the compute servers with GPU acceleration. (Note that the computational processes are now represented with larger diameters in our diagram.) But your GPUs are stalled, starved for I/O because the infrastructure designed to feed data to the ordinary CPUs isn’t able to keep up with GPU demand for higher data access rates.
Representation of AI/ML Training for Highest Expertise with Huge Data Lake of Examples, Bigger Networking Pipe, and GPUs
By adding WekaIO, so much data access can be achieved individually for each and every GPU that there is no longer contention, eliminating the I/O bottleneck. Even with the same physical network connectivity, Machine Learning is accomplished more effectively once GPU I/O starvation is eliminated.
Representation of AI/ML Training for Highest Expertise with Huge Data Lake of Examples, GPUs , and WekaIO
This is why being the world’s fastest filesystem is not about “speeds-and-feeds” for the sake of boastful specsmanship. This is about Q-O-O: Quality of Outcome. Mo’ Better, indeed!
To learn more about accelerating AI/ML training, click here.
For a related video, Andy Watson was interviewed on stage by Timothy Prickett-Morgan, editor at The Next Platform, at The Next AI Platform event in San Jose, Calif., and to see that video, click here.
* I’m always amazed at how many people were happy with 10-GbE 15 years ago and for some reason, still seem to think that it ought to be enough now, in 2019. This is mind-boggling because their data is now measured in petabytes instead of terabytes; but that’s a topic for another blog.