We established in a previous blog post that the explosion of data offers a myriad of opportunities to parse data for hidden treasures—gaining data insights that enable businesses and industries to solve their greatest challenges. Not the least of these industries is the field of Life Sciences in which pharmaceutical companies use copious amounts of data as they work to produce medicine and drugs that perform multiple functions, such as vaccines to prevent diseases, drugs to combat various symptoms of diseases, and pain killers to make life more tolerable. The method used for producing these medicines is called Structure-Based Drug Design (SBDD), and it often employs another process called Cryogenic Electron Microscopy (CRYO-EM). Let’s talk a bit about each one and why data storage is so important to their specific workloads.

Cryo-EM Data Storage: Pharma and Bioimaging

People are living longer and staying active well into their senior years. Much of that is due to living healthier lifestyles, but a good amount is also due to “better living through chemistry.” Let’s face it: pharma is a booming industry that commands big dollars in sales and spends big dollars in research.

A revolution in high-speed microscopy is paving the way for new drug delivery mechanisms, using custom therapies aimed at sub-molecular targets. Powered by the latest cryogenic electron microscopy (cryo-EM) tools, these techniques can give clinicians a new arsenal of microscopic weapons to fight the deadliest diseases.

Here’s how it works. Cryo-EM samples encompass thousands of high-resolution (4K) two-dimensional images, which must be transformed into 3D models and motion clips. At each stage in this process systems work with large-image and video files, performing computationally intensive tasks like blur removal, motion correction, 3D image classification, and more. A single step in the Cryo-EM process involves capturing thousands of images and can generate anywhere between 1TB to 10TB of raw data in each dataset.

Cryo-EM Data Processing

Ultimately, then, it’s easy to see how the Cryo-EM process uses huge amounts of data. Of course, all of this data must be kept because any small amount of data that might not be deemed useful today could result in huge gains in the future as the sciences advance their own understanding and the technology advances to process it.

The main storage challenge with Cryo-EM pipeline is the high degree of variability between the data sizes and access patterns of each step in the process. As established, the raw datasets are huge, and through the process of cleanup (removing low resolution images, duplicates, etc.) only about half of the original data moves to the next step. Sequentially, then, each step moves the dataset to a smaller size random IO pattern.

Accelerating Modern Workloads

Life sciences organizations face large pipeline and productivity challenges. Even running in an HPC environment 24/7, processing a single sample can take three months—and sometimes this is an iterative process. A whole human genome at 30X coverage can require several hundred gigabytes of storage during the alignment and variant calling stages, and it can take more than 30 hours to process this data using CPU solutions. This has become a computational bottleneck when processing thousands of genomes.

Now consider the fact that research organizations work on more than a single sample to attain a research goal. In the scope of this work they must transform millions of high-resolution cryo-EM images into useful 3D models. To date, that process has taken many months—sometimes a year and longer for major discoveries. However, as datasets grow, as people expect more dramatic results, as companies expect higher profits, and because “time is money,” the time it takes to produce results needs to be reduced dramatically. The good news is that it can.

Due to the large scale of data involved, cryo-EM has always benefited from advances in computing technology. To that end, life sciences are making use of accelerator technologies like GPUs to accelerate IO between the storage and GPU server nodes to address potential storage bottlenecks for AI, ML, and HPC workloads, thereby acting as force multipliers for complete IO acceleration and speeding up time to results.

Legacy Storage Architectures Can’t Keep Up

In the quest to achieve cutting-edge breakthroughs, legacy technologies can’t keep up. In fact, one of the major barriers to faster performance is antiquated hardware and legacy filesystems. Modern components like NVMe interconnects and NAND flash media are capable of order-of-magnitude I/O improvements. It’s the decades-old software and file systems that even the newest storage systems still run that cannot exploit the newest storage media components. These aging architectures waste significant resources on processes like the following:

  • Converting data between file and block I/O, which gets more resource-intensive as data grows
  • Maintaining global data maps, at scale as the number of files grows exponentially
  • Ensuring global caches coherence, across multiple nodes in a large cluster

Modern Workloads and Genomic Sequencing

For those of us who aren’t doctors, the subject of genomic sequencing can be daunting at best, conjuring memories of calculating the genome sequence of a fruit fly in high school biology labs. In those labs we asked, “Why do we need to know this?” An inherent part of being a lab scientist is that you ask questions, and the kinds of questions that lab scientists ask typically have broad ramifications across our society. They can be questions about the molecular structure of plants, about brain formation during gestational development, or about the rapid progression of certain diseases. Asking–and potentially answering–these questions helps us to better understand ourselves, the environment around us, and our future health and happiness. This certainly includes, but is not limited to, cures and treatments for diseases that affect millions of people across the globe.

In the process of embarking on this endeavor, however, scientists need to include many resources (human and otherwise), processes, and practices toward achieving the desired end result.
Genomic researchers are finding that data storage demands are doubling every seven months based on current sequencing rates, and the storage demands are forecasted to reach over 40 exabytes of capacity by 2025–and that’s just for the human genome alone. Managing and protecting data at this scale will be a nightmare for the ill prepared.

Today’s complex scientific workflows require high throughput and IOPS at low latencies so that researchers can achieve faster discovery. Legacy storage systems that have limited scale are holding back research and, as with other workloads, creating data silos that are difficult and costly to manage. Supporting genomic sequencing workflows is further complicated by the large data sets and long retention periods of

Real-Life Example: Genomics England

Genomics England (GEL) needed a solution to support the UK National Health Service 5 Million Genomes Project but could not scale with its existing NAS solution from a leading vendor. Weka delivered a two-tier architecture that takes commodity flash and disk-based technologies and presents it as a single hybrid storage solution. The primary tier consists of 1.3 petabytes of high-performing NVMe-based flash storage that supports the working data sets. The secondary tier consists of 40 petabytes of object storage to provide a long-term data lake and repository. Weka presents the entire 41 petabytes as a single namespace. Each of the tiers can scale independently: should GEL require more performance on the primary tier, it can scale its performance independently of the data lake. The system takes advantage of the geo-distributed capability of the object store, and data is protected across three locations that are 50 miles apart from one another. Download the Case Study.

How Weka Accelerates Life Science Workloads

Weka helps laboratories continually speed data through each stage of the Cryo-EM pipeline to break through the backlog and process more workloads in far less time. With these capabilities, researchers can reduce the time needed to develop new molecular delivery mechanisms.

Combined with state-of-the-art genomic analysis and new AI and machine learning applications—all of which also benefit from the speed, simplicity, and scale of WekaFS—researchers can usher in the age of personalized medicine. They can help build a future where clinicians draw on new precision therapies to target the right patient, with the right drug, in the right dosage, at the right time.

Story telling


3 new rules for selecting scalable storage solutions.

Additional Helpful Resources

Life Sciences Data
AI-based Drug Discovery with Atomwise and Weka on AWS
Accelerating Genomic Discovery with Cost-Effective, Scalable Storage
Accelerating Discovery and Improving Patient Outcomes With Next-Generation Storage
How to Analyze Genome Sequence Data on AWS with WekaFS and NVIDIA Clara Parabricks
Top 5 Myths in HPC for Life Sciences