Cryo-EM data processing reference design from Clovertex and WEKA simplifies and accelerates computing and data storage for Cryo-EM

Cryo-EM and HPC are helping researchers understand protein structure to accelerate the design of new medicines such as vaccines, immunotherapies, and pain management drugs. However, the computation necessary to determine 3D structure creates significant computing, data storage and cost challenges:

  • Compute: Requires the latest GPU and CPU resources. Because these resources are expensive, high utilization is essential to maximize ROI.
  • Storage: Poor I/O performance can bottleneck Cryo-EM data processing. Data storage must accelerate the Cryo-EM pipeline and eliminate data copies.
  • Cost: Taken together, the compute, data storage, networking and software licensing costs can be high. As with GPUs and CPUs software tools such as Cryo-SPARC are expensive so money is wasted when resources are left sitting idle.

It can also be extremely difficult for an organization to procure and deploy the necessary infrastructure to support the needs of researchers—and keep up with rapid technology advancements. Research teams are turning to the public cloud to gain access to the computing resources they need—when they need them.

Clovertex, a specialist in bringing scientific computing applications to the cloud, and WEKA, an innovator in high performance file storage for the cloud and AI era, have joined forces to create a unique cloud-based reference design that offers significant advantages for drug companies and research scientists. This architecture can help you:

  • Determine protein structures in less time
  • Utilize cloud resources efficiently without deep domain expertise
  • Rapidly deploy a computing solution to exact specifications
  • Simplify ongoing operations
  • Control cloud compute and data storage costs

This blog describes the reference architecture in detail.

A reference design for Cryo-EM in the cloud

Our cloud design benefits from the proven cloud expertise of Clovertex combined with the power of the WEKA® Data Platform. The result is a complete and fully functional Cryo-EM data processing environment that reduces the burden on researchers, IT teams, and cloud architects while accelerating time to results.

The figure below illustrates deployment of the reference architecture on AWS.

Key elements of the reference architecture include:

  • Automation: As illustrated in the upper right of the diagram, the entire cloud deployment is automated using DevOps and Infrastructure-as-Code (IaC) methods. This means the entire infrastructure is deployed automatically with no manual processes, no mistakes, and no time-consuming initial debug and tuning. Infrastructure scales up and scales down in response to load, helping control costs.
  • Ingest: Data acquisition from Cryo-EM instruments takes place on-premises, as illustrated on the left side of the figure. Data can be ingested into the cloud using SMB or NFS or over S3, depending on the Cryo-EM equipment. A single WEKA Data Platform hosted file system supports multi-protocol access, simplifying data management for complex workflows.
  • Data storage: WEKA Data Platform running in the cloud satisfies all data access and storage needs. All steps in the Cryo-EM pipeline take advantage of WEKA’s simplicity, speed, and scale, eliminating the need for data copies. A single WEKA cluster satisfies all data access needs for multiple Cryo-EM jobs running at the same time.
  • Data tiering: WEKA File System runs on a cluster of Amazon EC2 i3en instances. The WEKA Data Platform automatically tier to Amazon S3 providing massive scalability at an affordable cost. Data in S3 can be archived to Amazon Glacier for long-term retention. To learn more about the WEKA architecture, visit www.weka.io/how-it-works/ or read our architecture white paper.
  • Head node: The head node runs various CryoSPARC and Relion apps (application server, database, etc.) as well as SLURM workload management. Head node services run in their own security group in the same availability zone as compute nodes. WEKA client software runs on all instances requiring high speed access to WEKA Data Platform.
  • Compute nodes: This subnet provides access to the specialized GPU and CPU resources needed to accelerate Cryo-EM data processing. Control processes running in the head node subnet dispatch each Cryo-EM pipeline step to the appropriate resources in the compute subnet for execution. Resources can be elastically added and released in response to workload demands.

Reference design benefits

This architecture yields significant real-world benefits for researchers, pharmaceutical companies—and the technology teams supporting them:

  • Deploy Cryo-EM data processing clusters in days instead of months
  • Seamless data mobility including data transfer from Cryo-EM equipment to cloud
  • Provision capacity and throughput independently to match needs
  • Latest data is available immediately for processing
  • Automatic eviction provides faster caching for active projects
  • Faster runtime due to reduced IO latency at all IO sizes
  • Maintain and patch with the click of a button
  • Faster protein structure prediction to accelerate drug discovery

These technical benefits translate to greater value from Cryo-EM, faster scientific insights, and decreased time to market for new and potentially life-saving products.

Controlling cloud costs

The ability of this reference design to automatically scale resources up or down allows you to gain much needed control over cloud costs. The reference design also helps control licensing costs by eliminating data bottlenecks and maximizing the utilization of compute resources. Most organizations need to get the maximum amount of protein structure information for the minimum spend.

This article has been co-developed with our WEKA X partner, Clovertex, a cloud organization specializing in architecting, automating, and managing applications for HPC in the cloud. Clovertex provides solutions tailored to specific research needs that allow HPC workloads to move seamlessly to the cloud.

Learn More About WEKA for Life Sciences