Learn About CryoEM Data Storage & Processing

Shimon Ben David. May 8, 2020
Learn About CryoEM Data Storage & Processing


What is Cryo-EM?

In order to further understand what Cryo-EM is, we need to discuss another topic first: drug discovery.

Pharmaceutical companies are producing medicine and drugs that perform multiple functions. These can be pain killers, a cure to a specific disease and much more.

Another term we should discuss is called “proteins” – You can think of proteins as little engines that have specific functions within our body – for example fighting viruses or transmitting messages.

In order to produce an effective drug, researchers need to see the structure of proteins inside our cells so that they can design a drug that will bind effectively to a specific protein type – for example design a drug that will bind to a pain receptor protein to block the pain. imagine pieces of a puzzle that can fit perfectly to each other compared to disjointed pieces that almost fit.

The better the researchers can see the protein the better the drug can be, which means it is more effective or will require a smaller dosage which is obviously healthier to the human body and especially the liver.

Understanding the above we can go back to the main topic. Cryo-EM which is short for Cryogenic Electron Microscopy is part of a larger field of research called Structure Based Drug Design (SBDD). This is the process taking organic tissues, freezing them, then bombarding them with radiation. This generates multiple pictures of the proteins themselves. During this process the tissue can move as well, even though it is frozen, and that is why it can be compared to taking multiple pictures of an object from multiple angles while it is moving, with all of the challenges associated with it. 

Scientists can then use these 2D pictures to generate 3D images of the proteins. Since the protein is moving during this time period it actually creates a movie like output. Using that, researchers can then design a drug that can bond better with that protein.

Cryo-em Pipeline

The Cryo-EM process is composed out of multiple steps that can vary according to the exact pipeline additionally there will be multiple frameworks in use such as Relion, CryoSPARC, CTFFIND, and more. In general these steps would include Motion correction, CTF Estimation, Particle picking, and particle extraction.

Cryo-em Data Processing

The Cryo-EM scopes are very expensive and in high demand and therefore are usually used 24×7 and produce large amounts of data. The data then needs to pass multiple steps in order to get to the end 3D movie like result. The ability to go over large amounts of data can also allow for better images of the proteins which can then improve the generated drug.

Since this is a fairly new field it is already adapted to newer technologies and therefore Cryo-EM frameworks such as Relion and CryoSPARC can already leverage GPUs for accelerating their workload.

Data Storage Requirements for Cryo-EM

A single CryoEM run can capture thousands of images and can generate anywhere between 1TB to 10TB of raw datasets. Each step in the pipeline usually half that amount of data (removing bad images, removing low resolutions, duplicates, etc…) the main storage challenges for this pipeline are the high degree of variability between the data sizes and the access patterns of each step. While it begins as a high throughput sequential access IO pattern use case with each step it moves to a smaller size random IO pattern. Additionally multiple GPU servers need to go over the data in parallel in an effective way in order to minimize the processing time. also the storage needs to scale to PBs as to not require the researchers to delete the original data.

Since this is a pipeline that is composed out of multiple steps using multiple access patterns to the data it takes a modern parallel file system such as WekaFS that can accommodate for the different access patterns and sizes as well as to the number of files that need to be analyzed in order to get to the end result. WekaFS’s ability to accelerate GPU workloads can further decrease the time required to complete the process and allow researchers to perform more pipelines and get to more accurate results.


Click here to learn more about how Weka accelerates CryO-EM pipelines.


You may also like:
Comparing Network Attached Storage (NAS) Solutions: Isilon vs. Flashblade vs. Weka
Worldwide Scale-out File-Based Storage 2019 Vendor Assessment Report
5 Reasons Why IBM Spectrum Scale is Not Suitable for AI Workloads

Related Resources

White Papers
White Papers

The State of AI and Analytics Infrastructure 2021

White Papers
White Papers

A Buyer's Guide to Modern Storage Solutions

Solution Brief
Solution Brief