WEKA Data Reduction


A demonstration of WEKA data reduction technology including how to configure it and space savings from using it.

View Transcript

Welcome to the WEKA Technical Demonstration series. In today’s video, we’re gonna be talking about WEKA Data Reduction. We’ll be giving you a brief overview into how WEKA Data Reduction works, talk about the WEKA Data Reduction estimation tool (DRET) and how and where you’d wanna use it, and then we’re gonna jump right into the demos.

WEKA data reduction works by scanning incoming blocks and stores unique hash values for each one. It then groups blocks with similar hashes together and runs a modern delta compression technique to reduce how much data is stored. This capability is activated on a per file system basis but as we’ll see later, is effective across an entire cluster.

This delta compression is so effective over different data types that in practical use, WEKA has seen space savings in up to 6X in AI and Model Training pipelines, up to 8X in EDA workflows, up to 2X in Media and Entertainment environments, including images and video, and up to 2X in Genomics workflows.

Not every data set is as well-known as these. To help with capacity planning and how much savings you can achieve by moving your data to a WEKA Data Platform, WEKA provides the data reduction estimation tool or DRET. DRET uses the same algorithm that a WEKA cluster would use, so the results are accurate. To run it, you just need a Linux system that can access the data to be analyzed, and some free space to store the calculated hashes used in providing your estimate of data reduction.

So now let’s go to the demo.

For the demo, we have a small cloud-based WEKA cluster running WEKA 4.2.3. On the cluster, we have a file system called demo_drx_off, which has a 41GB data set consisting of everything from PowerPoints, GitHub repo scrapes, EDA layout data, movies, image files for graphics, some genomics image files, and more. The majority of the data in this dataset is composed of compressed images and video, so keep that in mind when you see the results later.

Now, we go to a client for the WEKA cluster.

We issue commands to see the CLI view of the file systems and how the demo_drx_off file system is mounted, what files are in the mount and crucially, the unreduced data size of the data. If you notice the size looks different between WEKA and Linux client, this is due to using gibibyte versus gigabyte for calculations in Linux. The actual capacity is identical.

The next step is to create a file system with data reduction turned on. We’ll name it demo_drx_on and turn data reduction on, which also turns thin provisioning on at the same time. We’ll also make it one terabyte in size. Note that there is now a check mark in the data reduction column indicating that this capability is turned on.

We’ve gone ahead and mounted the demo_drx_on file system. So now we have two WEKA file systems mounted to the same client. And as you can see here, demo_drx_on is empty. We’ll now fill it by copying in the same dataset was in the file system that had data reduction turned off.

While the data is copying in, let’s go back to the GUI and see the status including what savings we’re seeing. As the data copies in, you can see the file system filling up. At the same time, the data reduction background processes are hashing the data and letting you know how much savings you can expect as the reduction process starts to work.

At the end here, 32% reduction on a data set that has a lot of pre-compressed data, that’s pretty good! And at petabyte scale, quite a lot of savings. If you remember, data reduction turns thin provisioning on automatically, that means that the space saved on disc can then be used by any other file system in the WEKA cluster.

Now that we have data reduction on a single file system, let’s see what happens when we have a second file system with similar data in it. As I mentioned before, the hash and compare of similar blocks happens on a cluster wide level, even though data reduction is on a per file system basis. This should result in an even better outcome than the single file system reduction.

We’ll create a new 1 terabyte file system called demo_2_drx_on with data reduction enabled. Because it’s the same amount in data copy process, we’ll speed up the video by skipping that part and go directly into the results in the WEKA GUI.

In the demo_2_drx_on file system, you can see significant savings being processed. This is due to the new hashes being compared to the old hashes in the first file system. Because the date is the same, the effect is similar to deduplication, only needing to store one of the blocks that has a similar compared hash. In addition, notice what is happening to the first file system. The reduction algorithm is finding hashes in the second file system that it can use to incrementally save even more space than before.

As you’ve seen, WEKA data reduction is in incredibly simple to use, yet has powerful capabilities that work with the types of data being used in next generation workloads. In many of these workload pipelines, it’s not unusual for similar sets of data to be handed to different researchers, scientists and artists for use in their individual workflows. WEKA can help out these environments even more in these cases.

For more information on the WEKA Data platform and its capabilities, take a visit to www.weka.io