High performance computing (HPC) for enterprise workloads has the potential to be a great solution for organizations needing on-demand cloud resources. Many of the workloads are project oriented and the investment in a dedicated HPC infrastructure is hard to justify for many enterprises. Intersect360, a leading research firm focused on the HPC industry noted in its most recent industry census that HPC in the cloud had a breakout year with 44% year-over-year growth. The rapid change in adoption was driven by the increased availability of cloud resources and a significant spike in deep learning applications.
I have previously shared HPC customer profile information as it relates to AI in the cloud and consistently AWS is the market leader by a factor of at least 2x. This blog focuses on the latest challenge with running more traditional HPC workloads in the cloud – namely that the defacto high performance parallel file system in wide deployment is the Lustre parallel file system. Lustre is an open source project that has been maintained by Intel for several years; in fact, Intel oversaw the porting of the software to AWS. The availability of a high performance file system with the backing of a great brand like Intel was significant in the adoption of HPC in the cloud.
However recent developments mean that HPC customers on Lustre are at risk of being stranded without any clear path forward. Intel has divested its engineering team that maintained the Lustre distribution, and the cloud software is no longer available to new customers.
All software products have to be updated on a regular basis to keep up with new Linux versions and feature enhancements. This is where the Lustre distribution ran into issues. The officially supported Linux compatibility matrix is very limited and requires users to compile their own Linux module for unsupported Linux distributions.
The Case for a Reliable Enterprise-Class Parallel File System in AWS
TRE ALTAMIRA is the world leader in displacement monitoring services for the exploration industry. It utilizes a satellite RADAR data processing service that provides earth imaging services to a multitude of customers including those in oil and gas, mining, civil engineering and government. Services include measuring surface deformation, earth fractures, reservoir model calibration and caprock integrity surveillance. The company can monitor gas fields to ensure reservoir pressure and surface uplift remains within safe operational limits. These services are critical to safe oil and gas exploration.
Put simply, TRE ALTAMIRA provides a critical service that ensures the safety of the mining industry and its workers and requires a reliable HPC infrastructure to meet timely service level agreements with its customers. TRE ALTAMIRA made a strategic decision to utilize the public cloud to run its complex workflows and initially turned to Lustre for its production software.
However, when TRE ALTAMIRA chose the Ubuntu Linux distribution they found that it was not officially supported by Intel Lustre on AWS. The solution required TRE ALTAMIRA to perform a difficult recompile of the source code at their own risk, which resulted in overall stability issues that affected TRE ALTAMIRA’s ability to deliver on its services. During production runs some instances would experience severe performance degradation while others would stop accessing the file system altogether. While the workloads are considered “scratch space”, a failed run amounts to wasted compute resources, lost profits from having to re-run a workload, and worst of all, delays in meeting customer service level agreements.
TRE ALTAMIRA considered alternative shared file services such as NFS services, however, the solutions utilizing the NFS protocol were not able to deliver the performance that can be achieved with a parallel file system. Luckily, they came across WekaIO at AWS Re:Invent and decided to run a pilot study comparing WekaIO Matrix™ to Lustre on AWS. The software was evaluated on performance, stability and cost. After extensive testing under load and against production data, WekaIO’s Matrix proved to be a great replacement for TRE ALTAMIRA’s Lustre based implementation. The TRE ALTAMIRA team really liked the consultative approach provided by the technical team to ensure they had minimal onboarding issues to the Matrix file system. Unlike with open-source software, a dedicated support team that will address any questions or issues is only a phone call or email away.
The WekaIO solution is available on the AWS Marketplace as an on-demand solution or through a bring-your-own-license model. The software is fully enterprise ready with support for a broad spectrum of Linux distributions. It also offers features like snapshots, tiering to S3 for lower cost storage, backup and recovery, snap-to-object for bursty applications (essentially park the data in S3 and release all of the on-demand resources, then re-hydrate at a later date when you need to run a new workload). Finally, WekaIO’s Matrix is fully supported by a passionate team of support personnel and engineers that want to make sure you have the best experience in the AWS Cloud.