Most data practitioners strive to accelerate application outcomes while providing data mobility and availability. Their success is measured by their ability to drive improved operational efficiencies and sustainability without compromising performance, all within an affordable cost envelope.  

Building data infrastructure to support these goals is often a multivariate equation that has no perfect solution and involves making difficult tradeoffs since most data management products and solutions can only provide excellence in one of these vectors at the expense of the rest.  

Understanding the Variables 

Next-generation applications have different performance and storage requirements at various data management life cycle stages. Take, for example, an ML system that trains autonomous driving models for vehicles. Data ingestion must happen within a fixed period to accommodate data from multiple sources.  
The data transformation phase has a similar challenge: it must cleanse the raw data and extract and transform features from it in the same period. Next comes the training phase, which must be as fast as the following validation phase. The data practitioner team running the project will be unwilling to compromise on performance in any of these data lifecycle phases as performance is directly related to business outcomes. For example, completing an ML model run within three hours could be a goal the team is driving towards. The demand for high-performance solutions has been the driving factor for the continued domination of high-performance NVMe-based systems over legacy hybrid systems from a storage domain perspective. 

Building an infrastructure with NVMe-based storage that drives superior I/O performance in all ML lifecycle phases could break the infrastructure cost budget for many enterprises. Data-driven enterprises seldom spread infrastructure investments evenly and choose hybrid systems that are only cost-optimized for storage efficiency alone. They need highly performant systems at an affordable price, two words rarely used together by many of the data management vendors today. This has led enterprises to purchase islands of infrastructure tiers at varying costs and with differing performance characteristics and storage efficiencies forcing teams to silo applications into the best-suited island of storage. It is not easy to build a data infrastructure that satisfies both performance-demanding applications and remaining workloads AND that fits the CFO’s annual infrastructure budget. 

Next-generation applications operate at scale. For example, one of WEKA’s customers processes more than 2PB of data daily through an ML model that trains autonomous vehicles. Another customer runs a genomic sequencing data center that processes 1PB of data aggregated from more than 10 remote sites that collect samples. To run such operations at scale, teams are forced to make tradeoffs on performance and economics as enterprises are not typically offered solutions that can meet or exceed requirements in all the vectors of performance, cost, scale, and sustainability.  

Using technologies such as deduplication and compression provides cost benefits for small datasets that begin to evaporate as they scale (think petabytes, not terabytes). Technologies such as data tiering to an object store have established their place in large data environments to reduce cost while enabling data operations at scale. However, many traditional systems that tier data to HDD or an object store are not known for providing adequate performance, let alone the extreme performance that next-generation applications demand.  

Today, most data management vendors have only a superficial story around sustainability. Reducing power consumption and cooling the data center by reducing its infrastructure footprint have become table-stakes solutions, and these are the baby steps most enterprises are taking today to get themselves on the sustainability bandwagon.

The big strides in the sustainability vector require migrating workloads to more energy-efficient data centers, which may be in the cloud, and potentially powered by renewable energy sources. The hard tradeoff for this vector is the identification of those high-performance workloads and the solution to migrate them to a more sustainable platform without ending up with multiple copies that consume more energy than before! Additionally, enterprises must implement solutions to operate their data infrastructure more efficiently and sustainably. 

Other Challenges 

Next-generation applications pose a certain set of constraints that teams of data practitioners grapple with when optimizing their environment to accelerate application outcomes. They are intimately aware of the characteristics of their data, such as performance throughput, latency, resiliency, and storage efficiency, which all vary based on the different applications they manage. Some applications like VFX may be sequentially streaming data that must be captured within a short duration before it can be processed. Data from remote labs must be processed as it arrives in the data center. Data processing, including data protection, is often scheduled when system resources are idle. And if such activities are initiated in crunch time, teams need the power to pause data transformation processes that delay their primary application outcomes.  

Also, given such varied application requirements, data practitioners may create a siloed best-of-breed environment for each application. From a storage perspective, they cannot gain efficiency even if all their applications’ data is stored in one infinite namespace, as many legacy vendors offer storage efficiency only in limited logical boundaries or “volumes.” Data teams depend on infrastructure vendors to provide best-of-breed solutions that combine storage technologies such as deduplication and compression to give the best cost-efficient infrastructure for their environment. They expect the solution to determine which technique to apply, if they must be applied together, and when they should be used based on the dataset type. For example, some datasets in the AI/ML space may compress well to provide cost-efficient storage, while others, such as VFX images, might not. The expectation is that the solution they purchase should make the right choice and apply it to the dataset. Ultimately, teams must decide if their applications can withstand data being available in a less performant tier with limited capabilities while still being able to pull data to the performant tier on demand. Or, they must decide if it is better to keep all their data in a more performant tier and apply data reduction techniques to reduce infrastructure costs and achieve economies of scale. 

Data Reduction Done Differently 

The WEKA Data Platform is purpose-built for the most demanding high-performance workloads and treats operations such as data reduction in an equally efficient manner across all storage nodes providing cluster-wide data reduction by using advanced block-variable techniques to reduce storage capacity. With the WEKA Data Platform, customers have the flexibility to choose the application data to be reduced. By design, WEKA’s data reduction is intelligently performed as a post-process activity, so as new data is written to the cluster it remains uncompressed, enabling accelerated I/O to the applications. When the system is consuming fewer resources to accelerate data to applications, the data reduction process will then scan and compress the data by applying these advanced block-variable techniques. Alternatively, customers may also choose to pause the process to run on-demand application workloads and maximize the benefits of the system. 

Customers can use WEKA’s Data Reduction Estimation Tool to understand the data reduction efficacy of different application datasets. This tool, designed by WEKA, scans the complete sample dataset provided and reports projected data reduction savings that could be obtained.  

WEKA’s data reduction can deliver savings of up to 8x for various application datasets. Application data from AI/ML modeling, EDA processing, databases, classic code compilation, and code repositories have gained the most benefit from WEKA’s data reduction technology. With the availability of large-capacity flash storage and more compute cores packed in small form factors, data-driven organizations can now accelerate their application workloads from one infrastructure without managing multiple islands of infrastructure spread across different locations. 

The insights organizations gain through the analysis of data have become a crucial part of business, research, and development.  As technology has jettisoned us into the next generation of application possibilities with advancements in deep learning, AI, Generative AI, and sustainable AI the volume and velocity of data required skyrockets exponentially.  For many data practitioners, this presents a significant challenge where the dual requirements of performance and cost-effective capacity are equally a high priority.  For this reason, silos of storage have emerged, one for performance, and one for capacity.  This is where WEKA closes the gap with its enhanced data reduction technology and presents organizations with both the high-performance data platform supporting the demands of the application profile, and the economics of data reduction to support capacity, as well as intelligent tiering to object storage for even better cost efficiencies. WEKA is not a point data storage solution, but a true data platform able to serve multiple purposes while enabling customers to achieve their own desired business outcomes.  

Learn More about the WEKA Data Platform