Accelerate Genomics Research with a Data Platform for Speed, Simplicity, and Scale

Shimon Ben David. February 15, 2023

40 exabytes of data. That’s the massive quantity of data the NIH estimates genomics projects may generate in the next decade. It’s fun to convert that to the number of blue whales, Empire State Buildings, or Millenium Falcons required to contain it all (the NIH likens the quantity to 429 sharks each representing 100,000,000 GB of data). But the truth is such vast amounts of data only matter if they can be studied and put to meaningful use.

The more salient considerations for researchers, IT managers, and life sciences organizations are: how can we ensure that our data infrastructure won’t stagger under the sizable load of information we create? Can our infrastructure keep pace with our research ambitions?

The Research Data Challenge Hiding in Plain Sight

You may have already experienced the signs of the coming data impediments. It looks like this:

A researcher asks, “where’s my data,” after a sequencer run
Local researchers are trying to copy data to multiple local storage drives and USBs
An IT administrator repeatedly juggles storage, moving data in response to one ticket, only to receive another ticket requiring them to move the data back
An institution’s decision-makers wonder why projects take ever-longer to complete

These are just a few indications that existing storage environments at many institutions are becoming too complex, cumbersome, and constrained despite best intentions. Some of the causes are easy to spot. One tool captures raw data coming from a sequencer. A data mover shifts that data to yet another tool that stores the data for analysis. An IT administrator needs to move the data for a waiting researcher somewhere along the way. Yet another data mover is required to move things for retention and still another to get everything into the archive. Do we even need to detail the difficulties of getting data back from the archive a year from now? In some cases the amount of data increases so much that these operations cannot even be performed anymore.

The combination of multiple tiers of storage and data movement tools from various vendors turns every step in the chain into a chokepoint requiring manual work and significant time. We’ve reached a point where the typical storage infrastructure doesn’t move as fast as your research. Data coming off a sequencer needs to be analyzed now, not in a day, week, or month. The stakes are high and getting higher.

Rethinking the Relationship between Data Infrastructure and Research

Modern genomics research demands infrastructure with the speed, simplicity, and scale needed to accelerate discoveries. The researcher needs to know that data coming off a sequencer or an application is where they expect to find it, ready for use anytime they need it; no tickets required. Administrators need to be able to focus on administration instead of taking care of data management and flow. The esoteric details of where data is stored for their end-users and applications need to be abstracted away. Institutions need greater productivity and can’t afford to have administrators and researchers idle while data is moved.

To date, institutions have been forced to amass multi-vendor, labyrinthine storage infrastructures over time, adding more all the time as new storage, application, and other requirements emerge. As a result, using your data is a constant IT engineering challenge. How does that benefit your research endeavors?

The solution is to have all of your sequencers and applications write and read directly to a single data platform. In other words, data is written to and read from one place, with no other data movement required. Busy researchers see one or more namespaces where all their data is written. They can see everything they’ve done today, last week, last month, and even ten years ago if the institution chooses to keep that data live. One challenge of centralization is that researchers typically have to make compromises as they weigh the cost per genome vs. the time it takes to complete sequencing on the genome. The ideal data platform solution enables centralization but does not require a significant tradeoff between performance and cost.

Driving Research Forward without Losing Sight of the Past

Access to historical data is crucial in research. For example, when a new reference genome is published, researchers usually want to read back their earlier data for comparison. Enabling such retrospective comparison requires that past data remain accessible to the user. Legacy storage environments typically accomplish this by endlessly bifurcating their data sets to separate new from old. This approach requires researchers to submit a support ticket whenever they want access to a past timeline. This is a problem because you don’t learn new things when you are stuck waiting.

Life sciences organizations have also been constrained when managing the cost of storage media. Today their options are to keep all the data they want accessible on expensive media or go through the painfully manual processes of moving data back from lower-cost archival storage every time it is needed. A better approach would allow organizations to purchase different storage tiers to achieve the best balance of cost and performance while ensuring that data is always available no matter where it is stored and without moving the data.

Data should automatically be placed on and moved to the best storage medium — on-premises, in the cloud, or any hybrid combination — as dictated by workflow and cost requirements. Automation frees IT administrators from the time-consuming drudgery of moving data and reduces the number of support tickets created. Properly implemented, data movement should be entirely transparent. When a researcher, an application, or someone else within the institution goes looking for that data, it simply appears visible and accessible.

Can a Data Platform Prepare You for Tomorrow’s Research Opportunities?

Scale is another major hurdle for many institutions. Project requirements vary tremendously. Today’s job might require 10,000 compute cores, but will you be ready if tomorrow’s grant needs a system with 150,000 cores?

Putting your organization in the position to say “yes!” to whatever work is around the corner requires the flexibility to see your data wherever it is stored and deliver that data no matter whether you choose to execute work on-premises or in the cloud. Such immediate flexibility at a massive scale is precious to researchers because it means their storage environment won’t slow down their work. Administrators don’t have to fine-tune the storage for a pipeline that might change ten minutes later. The institution has confidence that its storage environment will meet any demand thrown at it.

WEKA Helps Life Sciences Stay Focused on their Research

At WEKA, we believe life sciences institutions deserve a storage infrastructure that provides the speed, simplicity, and scale needed to support their missions fully. We start by taking most of the work out of managing your storage, obviating the need for multiple data movers and other tools from innumerable, often incompatible vendors. The WEKA® Data Platform ensures that past data is discoverable and accessible by researchers without the delays imposed by constantly creating support tickets.

That visibility is maintained throughout your data’s lifecycle, freeing organizations to manage the cost of their storage environment better. The WEKA platform gives you the flexibility to support different storage tiers to achieve the best balance of cost and performance. It also optimizes all that storage by placing your data on the best storage medium, whether that is on-premises, in the cloud, or a hybrid. The WEKA platform’s intelligent tiering can then automatically move data that has been untouched for days to lower-cost storage.

WEKA provides true freedom in how you use your storage. Our system effortlessly manages any data regardless of size, how it’s stored, or where it’s stored. It doesn’t care if you put millions of files in a single directory. If you need to put huge files next to small files in the same directory, the WEKA platform is equally happy to manage that. When your application needs simultaneous access to read and write large and small files randomly or sequentially, it has you covered. The WEKA platform flexibly ensures that your data is visible and automatically scalable to suit any task. So, if you need to burst into the cloud to support a massive new research opportunity, rest assured that your data will be accessible wherever researchers execute their workloads.

With the WEKA’s platform’s multi-protocol support the data can be streamlined directly from the sequencer to the WEKA platform, saving on the need to copy it manually as well as allowing the sequencer to immediately perform another operation instead of waiting for local capacity to clear up.

Genomics research is about paving the way for a better future. With WEKA managing your storage, you’ll have the speed to save time that is better invested in research. You’ll have the kind of simplicity that ensures that critical data is always available to be read where it was written, without extra data movement or management burden. And your organization will have the scale that will allow it to adapt to any challenge.

Learn more about WEKA for Life Sciences