Is Data Growth the Cancer of Genomics?
Barbara Murphy. May 29, 2018
We recently attended BioIT World Boston, a leading show that brings together a community of biologists, data scientists and IT infrastructure personnel, all focused on advancing medicine through new discovery methods, with leaders in HPC storage and other technologies. We took the time to profile our prospective customer base in return for a chance to win one of the coolest giveaways at the show. The results provide some really great insight into the intersection of genomics, IT and the cloud. This blog post provides a summary of the insights gained.
Genomics Customers are Cloud Savvy
There has been an ongoing debate on how much HPC is really done in the cloud. According to Addison Snell, CEO of Intersect360, cloud adoption is the fastest growing segment of the HPC market space. He highlights that cloud is predominantly used when the workloads are inconsistent and the resources are used for a relatively short period of time.
Genomic workloads fit well within the profile that Intersect360 has outlined, the research pipeline has computationally intensive periods that fit well with the elasticity of cloud (public or private ) infrastructure. Our research revealed that almost 50% of the responders were using public cloud services. AWS dominates HPC in the cloud, with Google in #2 position and Microsoft Azure a distant third place. While the volume of customers using cloud was significantly higher than we saw in our AI machine learning blog post, the relative rank of the cloud vendors is the same.
That said, as the data sets get larger customers are more likely to keep data on premises. WekaIO Matrix has a great cloud-bursting feature that it allows customers to take advantage of elastic compute in the cloud while maintaining the large scale data sets on premises.
Genomics is a Data Hungry Segment
Genomic sequencing generates a tremendous amount of data. A single human genome contains 3 billion base pairs and after sorting and processing the finished file takes approximately 100GBytes of capacity. Extrapolating that out to a large sample it does not take long before the data growth explodes. It was no surprise then that the biggest pain point experienced by genomics customers is scaling capacity as the data sets grow. Over 40% of the respondents cited data scaling as their biggest pain point. When we sampled show attendees 75% had significant data sets and when you consider that 100TB only represents 1000 unique genomic samples it is relatively easy to see how storage grows exponentially.To add to the challenge, all of this data is relevant for future research so none can be thrown away, in fact one customer reported that she will have to maintain a 30 year history of the data set.
Storage Performance to a Single GPU Server
The second biggest problem was poor application performance, which usually translates into poor storage performance. One of the biggest challenges with traditional genomic storage systems is that capacity scaling is achieved with a scale-out NAS solution, but the access protocol is over an NFS network share. NFS (Not for Speed) is a protocol developed in 1984 when the average LAN speed was 1Mbit/second. Today’s enterprise networks can easily handle 10,000 times more throughput and the NFS protocol has become the bottleneck with a bandwidth limitation of 1GByte/second. This was proven out in the Nvidia labs when testing WekaIO against NFS.
Research requires a single namespace to ensure researchers can access data quickly and easily. That workflow requires an intelligent architecture that can stage the data according to its heat map and WekaIO Matrix is the only file system that can solve the genome storage challenge with a system that will scale to global level.
The WekaIO Global Namespace
Matrix is the only file system that allows data to be archived on low cost infrastructure while still fully accessible to the applications – with no need for additional management software. It can scale to trillions of files and hundreds of Petabytes in a single namespace, and with the ability to seamlessly integrate object storage it delivers cost dynamics that make sequencing all life a possibility.