For the Want of a Nail – Part 3 of 5: AI depends on large scale storage

Liran Zvibel. January 31, 2018

For the Want of a Nail Part 1 – How Infrastructure May Be Limiting AI Adoption
For the Want of a Nail – Part 2 – Aligning Data Center Storage with the Needs of AI Workloads
For the Want of a Nail – Part 4 – Want AI? You’ll Need a Modular Approach to Maximize GPU Performance
For the Want of a Nail – Part 5 – Enabling AI For Organizations Of All Sizes

As I discussed in my previous post, AI workloads are data driven. Today’s analytics-based AI requires tremendous amounts of AI storage. Without this capacity, you would not be able to benefit from information and knowledge unleashed by AI workloads. Whether you are sequencing human genomes, reading medical imaging, or performing clinical trial research, the amount of data to process is huge. The same is true with machine learning monitoring the Internet of Things (IoT); intelligent agents enabling customer support, purchase prediction, and fraud detection; and business intelligence and other analytical applications. In each case, you can see the commonality of storage at a large, if not enormous scale. Plus, AI deployments result in even more data to digest and reprocess ad infinitum; hence the ever-growing need for more Ai storage capacity.

Unlike centralized workloads where the bulk of data resided neatly in the datacenter, AI data comes from a variety of locations. Consider IoT workloads where data tends to originate on the outer edge of the network; after all, this is where the “things” reside and the action takes place. However, the processing takes place at the central hub rather than out on the spokes. A genome comprises terabytes of raw data, and is likely a shared resource that resides away from the application server. Managing data access in these environments requires not only massive scalability but also high I/O rates combined with low latency. Without these, processing would grind to a halt as workloads would be severely I/O bound, leaving expensive GPU based servers idling as they wait for data.

Traditionally, a storage solution that would meet these criteria required expensive investments to achieve the requisite scalability and performance. As a result, all but the largest enterprises or most focused startups could not economically justify deploying AI to harvest the commercial benefits that AI could deliver. You might be thinking, “Yes, perhaps, but there are many alternatives today that were not available in the past.” That’s true. Let’s consider some commonly thought of alternatives.

Cloud Service Providers (CSP) are often viewed as an option; after all, they require minimal or no capital investment in new hardware and scale easily with a flexible pay-as-you-go pricing scheme. However, these low-cost approaches lack the configuration flexibility of onsite equipment (GPUs, networking and AI storage) and therefore can rarely meet the specialized needs of AI workloads. Additional problems associated with CSPs include noisy neighbors who may be co-resident on the same physical infrastructure, negatively affecting application performance, limited network or AI storage bandwidth that impacts latency and throughput, and the requirement to move large amounts of data to and from the cloud, which is both time consuming and costly.

Off-the-shelf NAS solutions might seem to be an easy-and-immediate solution to the AI storage dilemma. However, while perfectly acceptable at smaller scale, NAS has limited capacity scalability along with decreasing performance and increased latency at the scale required by AI storage workloads. How about a hybrid approach? Theoretically one could devise such a solution. However, remember that a hybrid approach combining multiple components would result in many moving parts including metadata servers, multiple AI storage targets, etcetera, each of which would require ongoing tuning to run at peak efficiency. Does this seem like a simple, cost-effective, and scalable solution?

After a closer examination, these choices aren’t as viable as first thought. But what is viable alternative that is also affordable? A successful solution needs to be able to prioritize fast and efficient data storage while being able to scale to petabytes of data capacity. Remember, too that scalability is bidirectional. Also, keep in mind that files are not one size fits all. A medical image is very different from a word processing document, log file, or database table. Consistent performance regardless of file size or type is an essential part of any data management and storage solution.

By now, I suspect that you are starting to appreciate the different data management and AI storage needs for demanding machine learning workloads. These workloads represent a change for most any organization, and with change comes the challenge of meeting the need while maximizing ROI. In my next blog, I’ll discuss the characteristics of the new approach to data management and AI storage infrastructure that you will need if you want to cost-effectively deploy AI. I invite you to join me there.