The AIconics awards are the World’s only independently-judged awards championing the innovators and early adopters of artificial intelligence (AI) technologies for practical real-world applications. They are the first industry awards to recognize and showcase visionary technologies spanning image recognition and deep analytics on a world stage. Nominations are reviewed by a panel of 11 independent judges from across the industry in addition to the co-founders of AI Business to select the foremost innovators in the AI space. It was an honor for WekaIO to be shortlisted and ultimately judged the winner of the AIconics award for “Best Innovation in Machine Learning.”
What differentiated WekaIO Matrix vis-à-vis other market competitors such as Pure Storage and NetApp?
WekaIO Matrix is the first and only parallel file system built from scratch to leverage the latest generation flash storage. Legacy parallel file systems – such as Lustre and GPFS – were architected prior to the emergence of solid-state disk technology when spinning disk [HDD] was ubiquitous. Most deep learning IT architectures now use GPU servers, which require a low latency, high bandwidth parallel file system to ensure the GPU remains fully saturated with data. For deep learning programs to be successful, the GPU must be able to process as many training Epochs as it can in a short amount of time. Legacy storage systems simply cannot fulfill this I/O performance requirement, leaving GPUs underutilized and significantly impacting overall training Epochs.
In our autonomous driving vehicles use case submitted for the award, the software was installed on commodity servers from a leading server vendor and within minutes, a high-performance data pool was available to all the GPU servers.
How did WekaIO Matrix improve the customer deep learning environment?
Our autonomous driving vehicle customer compared WekaIO Matrix to an all flash NAS appliance leveraging the NFS protocol. The NAS throughput peaked at 1-1.5GBytes/second, frequently leaving the GPU servers starved of data. Data scientists had struggled with the performance limitations of NFS and had devised a workaround that copied training data sets into the GPU server local NVMe drives to improve overall performance. But this stop-gap solution could not scale to production-sized clusters where the data sets were significantly larger than the GPU server could accomodate. The Solutions Architects calculated that just 10% GPU under-utilization would result in over $2M in wasted GPU infrastructure on a production cluster of 50 GPU servers.
WekaIO Matrix proved it has the high bandwidth, low latency ingest rate capable of saturating a 100Gbit network link–up to 11 GBytes/second per GPU server to meet the most demanding ingest demands, thus keeping the GPUs fully utilized. The tests showed that the customer was able to reduce a training Epoch from two weeks to less than one day, which resulting in reduced training times and better deep learning outcomes.
What was the impact on the bottom line?
The increase in GPU utilization means that the training systems can process more data in a shorter amount of time resulting in faster learning cycles and improved driver safety. More frequent continuous training runs have resulted in a car that gets “smarter” over time.
What follows are actual numbers achieved by the customer:
- Data ingest rates to the GPU cluster increased by 10X compared to legacy NFS based All-flash NAS
- Performance was 3X faster than locally attached NVMe drives inside the GPU servers, eliminating the need to constantly copy data to local storage
- Metadata performance 300%, resulting in far better GPU utilization • Models had “hands-free” access to a multi-Petabyte training catalog
- project costs dropped by 50% compared to all flash NAS
In summary, WekaIO’s Matrix had a significant impact on the success of the customer’s project through improved GPU utilization and reduced training time. The overarching business benefits included lower infrastructure costs and improved ROI.