Five Reasons Why IBM Spectrum Scale Can’t Cut It for AI Environments

WEKA. March 19, 2018

Data is the core of artificial intelligence; without data there is no learning and the more data available to the training systems the better the accuracy. Customers have been utilizing high-performance data storage solutions that were originally built for HPC environments to address the data challenges of machine learning. One of the more prominent high-performance storage solutions is IBM’s Spectrum Scale™ file system. In this blog post I outline five specific shortfalls of Spectrum Scale that limit its ability to meet the demands of AI systems.

1. GPUs are Performance hungry, you need the fastest shared file system for NVMe

IBM Spectrum Scale (aka IBM GPFS) was developed 25 years ago to support the high throughput needs of multimedia applications. And for the subsequent 15+ years it was optimized to deliver the highest large file bandwidth performance from hard disk drives. But the world has changed dramatically in the last ten years, flash technology has revolutionized the data center and workloads have changed from large sequential to metadata intensive small File I/O intensive big data workloads. The technology designed and fine-tuned to solve streaming applications is not suitable for the age of big data and analytics. IBM continues to develop its Spectrum Scale product, yet the file system still cannot deliver performance for small file, low latency, I/O intensive workloads. Standard industry benchmarks prove that a flash native file system such as WEKA Matrix™, out-performs Spectrum Scale by a wide margin regardless of the workload.

2. Data is Your Most Valuable Asset, You Must have an Efficient Way to Protect It

Data is the most critical element for any training system, but you can’t leave it unprotected. Instead, the storage system needs to be self-protecting. IBM Spectrum Scale software lacks the following data protection features:

Durability – IBM Spectrum Scale can only protect data with either replication between nodes or triplication through its file placement optimizer feature, neither one is cost effective for today’s vast AI data sets. Durability must come from the underlying storage system using traditional RAID or local erasure coding capabilities, but data durability does not span across the storage cluster. This approach provides a lower level of overall data protection, and when an error occurs, performance is impacted by the rebuild process.
End-to-end (E2E) data integrity – IA system accuracy requires that the AI training systems have the correct data to interpret. Users that care about data integrity have to purchase IBM’s proprietary ESS appliances to get this feature.

WEKA software only solution comes standard with end-to-end data integrity.

3. Product Functionality Will Vary Depending on Deployment Type It

IBM Spectrum Scale varies its feature-set depending on how it is deployed so what you get on-premises will not be the same as what you get in the cloud. Extensive data sets will sometimes require the ability to move from on-premises to cloud deployments to take advantage of scale. Users must consider those feature-set differences when evaluating Spectrum Scale and may need to deploy additional storage solutions to achieve their operational goals. Almost every deployment is unique (network, client RAM, total storage, performance, etc) making cloud agility an impossibility.

A true software-only file-system that can seamlessly span from on-premises to the cloud based on end-user’s requirements will ensure future AI systems can deliver the outcomes required. WEKA Matrix utilizes the same software on premises or in the cloud and can seamlessly move data between the two environments.
4. You Need a Predictable Cost Efficient Storage Solution as Data Sets Grow

A useful AI storage system must be both scalable and affordable, two attributes that don’t always coexist with IBM Spectrum Scale. Why? As mentioned above, how you deploy Spectrum Scale drives the cost to end-users. Let’s review some the options:

IBM Elastic Storage Server – Pre-configured appliances that demands you purchase rigid capacity and performance configurations up front. Users may need to deploy more appliance capacity simply to achieve better performance goals.
3rd Party SAN deployment – Many Spectrum Scale deployments required complex SAN storage underneath. However, this multi-vendor option increases cost and complexity and features sets will vary from the IBM solution.
- To achieve more data locality for the AI training system, users can deploy local SSDs for a local read only cache (LROC) within Spectrum Scale systems. However, this simply increases cost to the end-user who must buy either ESS or SAN in addition to local SSD(s) in order to deploy LROC within an AI system.
IBM Spectrum Scale FPO is a deployment option similar to HDFS where data locality via triple replication is the preferred data protection. This results in a significant increase in the amount of RAW storage deployed to meet usable capacity goals.

Simply put, there is no one Spectrum Scale and depending on what your goals are, you may end up with multiple solutions to support the end-to-end workflow of machine learning applications.
5. Little Cloud Integration

With much of the AI/ML innovation occurring in the cloud, seamless cloud integration is an important requirement for new storage deployments. Spectrum Scale recently introduced cloud API (Swift and S3 emulation via Swift) Transparent Cloud Tiering (TCT) and cloud data sharing to Spectrum Scale, but it falls short in:

Data Tiering

Transparent Cloud Tiering (TCT) is designed for inactive cold data vs a data lake that may be called on in any training run. This is evident by TCT’s performance limitation of four TCT nodes per filesystem. For enterprises who have invested in hybrid IT architectures, the performance of on-premises should both simplify and fully enable cloud data tiering, not limit it.

Efficiency

Spectrum Scale’s cloud tiering is designed for cold archive data, but it is not capable of tiering a file bigger than 5TB to the cloud, resulting in a scenario where large files must be maintained on higher cost on-premises storage.
Now suppose you need to change just a small portion of a file tiered to the cloud, Spectrum Scale must recall the entire file to update just a small block range vs only recalling the portion that needs to be updated. This will result in the cloud tiering layer being both a performance bottleneck, a costly file egress cost due to the entire file being recalled. This is especially impactful for data that has been encrypted because the data must be decrypted prior to transfer, creating added latency and a potential security gap.

Scalability

Coming up on a deadline and need more resources to finish processing your AI workload? Wouldn’t it be great if you could create an environment in the cloud to get the job done on time. In order to accomplish this, complicated export/import between an on-premise Spectrum Scale cluster and Cloud Spectrum Scale cluster is required. As we highlighted earlier, there are several flavors of Spectrum Scale, each with different capabilities and limitations, so caveat emptor.

Spectrum Scale was designed in 1998 to optimize hard disk I/O, primarily for use cases that needed lots of bandwidth. However, today’s applications require a modern file system unencumbered by legacy design constraints that are based on 60-year old hard disk technology. There have been many advancements in virtualization techniques, solid-state disk technology, and hashing algorithms that were not available when this file system was designed.

WEKA Matrix is a cloud native parallel file system that seamlessly integrates on-premises and public cloud deployments to support elastic compute in the cloud.