How to Rethink Storage for AI Workloads

WEKA. March 19, 2021

WekaIO recently hosted a webinar with Julia Palmer from Gartner in which she spoke about rethinking storage for Artificial Intelligence (AI) and Machine learning (ML) workloads. She reminded us about this quote from Stewart Brand that can invoke a somewhat startling but metaphorically accurate visual image:

“Once a new technology rolls over you, if you’re not part of the steamroller, you’re part of the road.”

It’s true. New technologies are emerging in the AI world, and if you don’t embrace them, you will be left behind. Gartner research found that 85% of companies are embarking on AI projects despite the fact that AI workloads present unique challenges, things that even data storage veterans have never faced before. This means that IT infrastructure is in a state of continuous change, which can be scary and exciting, but IT people are not easily scared, so they’re usually up for the challenges that come their way.

Let’s talk here about getting ahead of the game and learning more about both AI/ML workloads and how to rethink your storage to account for them.

Gartner asked over 3000 CEOs and CIOs what they think is the most important advanced technology in the tech market today. Of course, the overwhelming majority indicated that it is AI and ML, and it’s the #1 technology in which they’re investing resources for the future, for both internal and external business. In fact, two out of five of them already have deployed some elements of AI within their own companies, some within production environments. What was most exciting however is that over 85% of them are looking into using AI in their infrastructures during the next two years. Julia’s message to all is that you’d better know the stages of the AI/ML pipeline and get ready to jump in with your own projects because AI is coming in full force!

Machine Learning (ML) Pipeline Stages

The ML pipeline has many moving parts, especially when we consider that there is always new data entering into the pipeline. Let’s look at the different stages of this pipeline from a storage perspective and see what takes place at each stage.

Data Ingest–The data is collected and goes into data lakes, where it needs to be classified and sometimes needs to be cleaned. On the front end, this phase is very data intensive and, thus, capacity intensive.
Data Preparation–With AI/ML you have a lot of extremely high-throughput operations, and the data preparation phase requires many sequential writes, so your write speed needs to be fast enough to keep up.
Model Training–This stage is a very resource-intensive part of the AI/ML pipeline. At this point the process moves from being a data-intensive process to a latency-sensitive process. Latencies can hold up the speed of model training and the ultimate time to result for the project. This stage probably offers the most challenges because here you need high throughput and low latency.
Inference–In this stage you need extremely low latency. Maybe you have smaller IO and mixed workloads, but extremely low latency is what’s important here.
Data Staging and Archive–The last yet not the least important stage is data staging and active archiving. As we know, a large amount of data needs to be stored at this stage because a proper AI/ML setup is only as good as the quantity of data it has to test and run its models. We might be talking about a multi-petabyte scale of active archive in this stage.

The challenge for you is to determine what type of storage system you will use to handle all of this data. Will you need five different and cumbersome storage systems? Sometimes you just might if you don’t adequately think through the architecture at the beginning of your project. To make matters worse, these different systems will sometimes require some type of a data management system on top of them to move data and provide access control. So, if you think about all stages as separate systems requiring deployment of separate and complementary storage architectures, you’re going to be in a world of hurt, as the early adopters of AI/ML learned from their infrastructures.
As a result, there are now unique startups addressing these challenges, minimizing the architectural challenges and overhead and maximizing the gain from the huge data sets.

Storage Selection Requirements for AI/ML Infrastructures

There are six aspects of storage infrastructure to consider as you move forward:

1. Portability—It is becoming a requirement for most AI/ML practitioners to have the ability to deliver storage anywhere. Data centers are no longer the be-all and end-all of data orchestration and consumption. In the modern environments data and storage can be anywhere, yet most of the storage solutions out there these days are designed with a single data center in mind. Those looking for storage solutions for their AI workload should always ask about the portability factor. Data might still be deployed within the data center, but chances are good that some data storage also will be deployed at the edge or at other collection points. Obviously, the portability of the solution is a #1 requirement; however, data, itself, must also be portable. AI practitioners need to embrace the technologies that can move data from the edge to the cloud, or maybe even from the data center to the cloud, and vice versa. This is why full-on solution portability and data portability must be considered because you could consider doing model training in the public cloud, while you grew the inference at the edge, so data needs different delivery options.

2. Interoperability–This aspect is extremely important because AI/ML workloads do not take place in a silo. In fact, AI/ML need to support many new and modern libraries. An AI/ML project should be deployed alongside (and sub-provisioned by) developers or business owners, so it needs to be compatible with tools you use for provisioning. Plus, it needs to learn all of the APIs and handle the workload–in self-service mode, as well. Finally, your architecture needs to support container platforms, mostly Kubernetes oriented applications. If you’re not picking a storage system that is not fully embedded and managed and able to be deployed in Kubernetes you’ll probably find that it’s not a solution that can be supported for the long term. For that reason, software-defined architectures are popular.

3. Scalability–People are always talking about large scale, especially for unstructured data. Keep in mind, however, that the architecture also needs to change because while large scale is very important, the ability to scale down is equally important. A lot of projects start off very small, so you don’t want to invest in a multi-petabyte solution if your project’s initial stages require a small footprint. In fact, it might be that you’re just running proof of concepts for some kind of tool, and you’re trying to figure out the appropriate hardware to use, so you want to be able to start small. It can also be about starting small and growing as you go. You notice that we didn’t use the term Network Attached Storage (NAS) here. You know why? It’s because NAS is a vertical platform that cannot scale to the sizes that we need for AI/ML workloads. They might be perfectly fine for other workloads, but for AI/ML you need scale-out architectures, and you need a distributed file system because when we’re talking about multi-petabyte scale (sometimes dozens of petabytes) you need a distributed system that has the ability to support billions of files. Also, you need to think about how the metadata is going to scale. This is why the distributed, horizontal design is the way to go when you talk about scalability.

4. Performance–It’s almost unthinkable for a system not to provide performance these days. AI/ML workloads require everything you can throw at them on performance. The workloads with high-throughput, large IO, and random reads/random writes demand that the throughput workloads and latency-oriented workloads live in harmony together. It’s very difficult to do. That is why we need new protocols, we need new flash media, and we need new software that can leverage all of these demands. Also, the training phase of AI/ML includes a very expensive component: GPUs. They need fast storage to deliver the data; otherwise, the data-hungry GPU sits there waiting to be fed with the storage IO. Clearly, GPUs are highly effective data accelerators that can help increase time to results, but they’re not cost-effective to use if they’re not fed with data to their full capabilities. Also, it’s not a cost-effective solution if you use an outdated storage file system or platform. Keep in mind that some storage file systems can provide very impressive performance numbers when you test them, but make sure that the system you choose can provide sustained endurance and performance to achieve the maximum benefit. The reason is that AI/ML workflows are most often not steady. There can be spikes due to different activities happening at the same time within the same storage system, so you need to pay attention to all performance. Data performance is important, of course, but metadata performance should be guaranteed to scale out, as well.

5. Software-defined models—In terms of portability, your model needs to be deployed anywhere, so software-defined architectures are the way to go. Your software-defined architecture needs to be appliance independent yet support the latest hardware innovations. For example, when NVMe first came onto the market the first supporters of this technology were software-defined-storage vendors. They’re the first ones to embrace new technology, so they’re good partners to work with on AI/ML workloads because they can move much faster due to their hardware independence, which also makes it easier to deploy the software for data management anywhere across the edge, core, and cloud.

6. Cost Optimization–You need to right-size your storage investment. When people think about storage for AI/ML they’re probably thinking about the storage for the most demanding phase of the project, which is the training phase, maybe inference as well. However, they’re often not considering the entire data lifecycle of storage. If they deploy all of their data on a single platform it becomes pretty big and very expensive, which can be overwhelming for the practitioners when they realize they cannot sustain it. This means that practitioners need to include all phases of the work when they choose a new solution, including active archive. When data is no longer actively needed, it needs to be moved to different types of cost-effective solutions for active archiving and deep archiving. These solutions should be just as scalable but not as expensive if practitioners want to maximize their TCO for the entire AI/ML process. Software allows for this type of tiering over the long term and can be an expensive part of the solution if not thought about on day one.

Conclusion

Starting any AI project is a daunting task, yet putting in the time and effort to plan all phases of the project in the AI pipeline–including storage considerations at every step–can help to ensure success and minimize frustration for the entire team.