AI-based Drug Discovery with Atomwise and Weka on AWS
Shailesh Manjrekar. August 6, 2021
The Covid-19 pandemic has profoundly changed the world. The remote workplace has become the norm. We have started looking at personal health differently, and outdoor recreation and games have increased in popularity. AI’s use for drug discovery has accelerated post-Covid-19 era. Today, drug discovery is an expensive proposition, with a $2.6 billion cost over 10 years and just a 12% success rate. AI promises to significantly change this, and innovative startups are attempting to change this landscape.
On the forefront is Atomwise, with its AtomNet® platform, has succeeded in finding small molecule hits for more undruggable targets than any other AI drug discovery platform.
Atomwise’s AtomNet® platform
AtomNet is built on best-in-class engineering architecture and tools, with Weka, NVIDIA, and AWS as key technology partners. The AtomNet® platform enables massive scale and unprecedented speed needed to create a deep and broad pipeline of drugs to improve human health. The platform leverages CNN (Convolutional Neural Nets), which employ deep learning in three dimensions to the molecular recognition problem. In many ways it’s the same approach as deep learning for image recognition. Instead of learning low-level image features, the networks learn low-level features of 3D molecular interactions and associate these into higher-order concepts that explain and predict important labels like binding affinity to a particular protein, which can then be used to eliminate a disease. This AI-based approach is very effectively used for cancer research, Sars-COV-2019 drug discovery, or for precision medicine use cases.
The Atomwise Data Challenge
The small molecule drug discovery process is very data intensive. Essentially, Atomwise takes around 4000 different protein structures with over 3million molecule compounds and runs over 15 million experiments. This involves importing data from 15 million source databases, running ETL, and generating descriptors to generate around 30 million small files used for CNN training. Taking all of these protein structures and sampling them against the molecule compounds presents a daunting data challenge and needs a distributed file system that can provide excellent metadata and mixed read-write I/O performance.
To put in perspective, each model takes about,
- 6 P2, P3, or P4 GPU instances on AWS, and there are around 10,000 such instances running at a given time
- with 5M weights
- 0.5 to 4 days of epoch times with 30 to 50 such epochs
- 1-2M random-access file lookups
Weka-Atomwise Solution on AWS
Atomwise evaluated several storage solutions throughout their data challenge: (1) a local file system on a multi-core server; (2) EBS with NFS head; (3) AWS EFS; (4) in-memory Redis database server, and they finally discovered that (5) WekaFS is the ideal solution to meet their data challenges.
Weka-Atomwise Solution Business Outcomes
The Weka and Atomwise solution on AWS provided the best solution for bursty workloads, such as those in Computational Drug discovery, and resulted in the following Key Performance Indicators (KPI’s):
- Experimentation time down from 3 months to 1 week resulted in best time-to-market and time-to-insights results
- Epoch times (Convolutional Neural Net model training times down by 2x)
- Lots of small file performance
- Excellent 30 million small and large file metadata performance
- Bursty workload performance with 10,000 EC2 instances against single Weka cluster
- Global namespace with an EC2 and S3 bucket
- S3 bucket attached to EC2 NVMe file instances for file serving as well as storage persistence by taking snap2object of the file namespace and moving it to S3 bucket
- Works with AWS batch and AWS EKS – Elastic Kubernetes service for job scheduling
- Caters to Atomwise’s entire data pipeline storage requirements, eliminating storage silos: ingest, ETL, Train, Inference and Lifecycle Mgmt.
The Weka file solution in AWS is ideal for customers with vertical use cases, is well integrated with AWS, HPC, and Life Sciences solution stacks, and offers the following advantages:
|HPC and Life Science Use Cases||Weka and AWS Value Proposition|
|Computational chemistry and structural biology||Accelerated time to insights|
|Modeling and simulation||Scalability and dynamic resourcing with autoscaling support|
|Genomics||Compliant and secure environment with end-to-end encryption|
|BioImaging||Global fault tolerant infrastructure|
WekaFS is available on AWS Marketplace for best performance and economics.
Accelerating AI Training Models
for Faster Research and Drug Discovery in the CloudWatch On-demand
Additional Helpful Resources
Top 5 Myths in HPC for Life Sciences
WekaFS for Life Sciences: Accelerate the data pipeline in Life Sciences
Modern Workloads in Pharma and Life Sciences
Accelerating Cryo-EM & Genomics Workflows
Accelerating Genomic Discovery with Cost-Effective, Scalable Storage
Accelerating Discovery and Improving Patient Outcomes With Next-Generation Storage
How to Analyze Genome Sequence Data on AWS with WekaFS and NVIDIA Clara Parabricks